
Why AI in healthcare needs stringent safety protocols Premium
The Hindu
AI safety in medicine is crucial to prevent catastrophic errors in patient care with the rise of large language models.
In 1982, a chilling tragedy in Chicago claimed seven lives after Tylenol (paracetamol) capsules were mixed with cyanide—not during manufacturing, but after reaching store shelves by unknown killer(s). Until the 1980s, products weren’t routinely sealed, and consumers could not know if items had been tampered with. The incident exposed a critical vulnerability and led to a sweeping reform: the introduction of tamper-evident sealed packaging. What was once optional became essential. Today, whether it’s food, medicine, or cosmetics, a sealed cover signifies safety. That simple seal, born from crisis, transformed into a universal symbol of trust.
We are once again at a similar crossroads. Large Language Models (LLM) like ChatGPT, Gemini, and Claude are advanced systems trained to generate human-like text. In the medical field, LLMs are increasingly being used to draft clinical summaries, explain diagnoses in simple language, generate patient instructions, and even help in decision-making processes. A recent survey found that over 65% of healthcare professionals have used LLMs, and more than half do so weekly for administrative relief or clinical insight in the United States. This integration is quick and often unregulated, especially in private settings. The success of these systems depends on the propriety Artificial Intelligence (AI) models built by companies, and the quality of training data.
To put it simply, an LLM is an advanced computer programme that generates text based on patterns it has learned. It is trained using a training dataset—vast text collections from books, articles, web pages, and medical databases. These texts are broken into tokens (words or word parts), which the model digests to predict the most likely next word in a sentence. The model weights—numbers encode this learning—are adjusted during training and stored as part of the AI’s core structure. When someone queries the LLM—whether a patient asking for drug side effects or a doctor seeking help with a rare disease—the model draws from its trained knowledge and formulates a response. The model performs well if the training data is accurate and balanced.
Training datasets are the raw material on which LLMs are built. Some of the most widely used biomedical and general training datasets include The Pile, PubMed Central, Open Web Text, C4, Refined Web, and Slim Pajama. These contain moderated content (like academic journals and books) and unmoderated content (like web pages, GitHub posts, and online forums).
A recent study in Nature Medicine published online in January 2025, explored a deeply concerning threat: data poisoning. Unlike hacking into an AI model that requires expertise, this study intentionally created a poisonous training dataset using the OpenAI GPT-3.5-turbo API. It generated fake but convincing medical articles containing misinformation—such as anti-vaccine content or incorrect drug indications at a cost of around $1,000. The study investigated what happened if the training dataset was poisoned with misinformation. Only a tiny fraction, 0.001% (1 million per billion) of the data was misinformed. However the results revealed that it displayed a staggering 4.8% to 20% increase in medically harmful responses, depending on the size and complexity of the model (ranging from 1.3 to 4 billion parameters) during prompts.
Benchmarks are test sets that check if an AI model can answer questions correctly. In medicine, these include datasets like PubMedQA, MedQA, and MMLU, which draw on standardised exams and clinical prompts based on multiple-choice style evaluations. If a model performs well on these, it is assumed to be “safe” for deployment. They are widely used to claim LLMs perform at or above the human level. But, the Nature study revealed that poisoned models scored as well as uncorrupted ones. This means existing benchmarks may not be sensitive enough to detect underlying harm, revealing a critical blind spot about benchmarks.
LLMs are trained on billions of documents, and expecting human reviewers—such as physicians—to screen through each and every one of these is unrealistic. Automated quality filters are available to eliminate garbage content containing abusive language or sexual content. But these filters often miss syntactically elegant, misleading information—the kind a skilled propagandist or AI can produce. For example, a medically incorrect statement written in polished academic prose will likely bypass these filters entirely.

Half a century has passed since India successfully launched Aryabhata, its first satellite, on April 19, 1975. This success proved to be the solid foundation for India’s space programme, which has grown by leaps and bounds in the five decades since. A.S.Ganesh takes you back to where it all started…