Dr. Tanishq Abraham

The Medical AI Manifesto - Introducing Sophont

Tanishq Mathew Abraham — Mon, 31 Mar 2025 07:00:00 GMT

Let’s face it: the healthcare system is broken. Issues include overworked care staff and a system that generates massive amounts of data—growing larger every year—making it impossible to track patient health closely or combine data from all sources in an integrated plan for care and diagnosis. Humans can’t process it all alone. AI is the clear solution, with the potential to augment and enhance doctors’ ability to provide care.

Multimodal foundation models are the future of medical AI and medical AI is the future of healthcare.

AI is already being used in healthcare: in 2023, 223 AI-powered medical devices/software were approved by the FDA, as part of a clearly exponential trend. However, currently deployed medical AI models are inflexible and rigid, suited for narrow tasks focused on individual data modalities. These approaches succumb to the parable of the blind men and the elephant: The blind men are unimodal medical models and the patient is the elephant.

We need broadly useful models that can communicate information across all the relevant medical modalities. LLMs are already proving to be broadly useful for a variety of purposes from suggesting differential diagnoses to writing more empathetic responses to patient questions, with patients often reporting better diagnoses from models like ChatGPT than through the healthcare system. Even doctors are already using ChatGPT! Outside of LLMs, unimodal medical foundation models have been improving analysis of tissue slides, chest X-rays, and more. Our vision is to merge highly performant, specialized medical foundation models into a single holistic multimodal foundation model.

To reach this goal, we propose these three critical elements for progressing medical AI research and deployment:

open research is critical for medical domains
generalist models like ChatGPT are insufficient, models must be carefully trained with medical in mind
“truly” multimodal AI is the future

Open-source is necessary for medicine

Even though models like ChatGPT are not FDA-approved, much of the use of LLMs or foundation models in medicine right now is happening under the understanding that the “practice of medicine” is not regulated by the FDA. This means leeway is given to doctors to utilize the tools they feel allows them to give the best care to patients. Of course, these tools cannot be marketed for medical diagnosis or treatment, otherwise they need FDA approval, and it must be clear they don’t replace professional medical judgment.

Instead it’s been easier for AI tools sold to doctors to focus on administrative tasks, and there are indeed many companies focusing on this direction, like Abridge Health, Suki AI, Nabla, and Nuance Communications. While the FDA has not released specific guidance and is likely still developing a regulatory framework for LLMs in healthcare, it is not unreasonable that the FDA will impose the following requirements:

Transparency
Explainability
Privacy
Consistency

Current AI solutions often fail to meet these requirements. For example, let’s look at ChatGPT or other related proprietary LLM services:

There is no transparency regarding the kind of data the model is trained on, the sort of biases it may exhibit, etc.
Since no model weights are available, most interpretability methods for LLMs are not applicable. Only prompt-based interpretability methods which are extremely brittle can be applied.
OpenAI does have privacy provisions but this may not be true for every LLM product. Even if this is true, some healthcare providers are not comfortable sending personal health information (PHI) to the cloud, and need on-premise solutions instead.
LLMs are continuously being updated (especially through RLHF), leading to unexpected changes in behavior.

For these reasons, we are instead bullish on open-source AI for healthcare:

Open-source AI tools can be more easily audited due to the availability of model weights, dataset information, etc. This enables complete transparency that builds trust all the stakeholders involved, from doctors and patients to regulatory bodies
Interpretability methods can be more easily applied to open-source models
Open-source models can be deployed locally on-premises
Models can be static, resulting in more consistent and reliable outputs

Generalist AI vs. medical-specific AI

An ongoing question surrounding foundation models in medicine is whether generalist models trained on massive amounts of data from the web are enough or if medical AI needs separate models specially trained to handle medical inputs.

On one hand, with many clinical NLP-related tasks, generalist LLMs have shown impressive advances, often beating medical-specific LLMs. On the other hand, some medical modalities have unique data formats that generalist foundation models will not have seen or are unable to process. For example, CT scans are three-dimensional images, while digitized tissue slides are gigapixel images, so different architectures and training datasets will be needed. Additionally, while text about medical practice may be present across web crawls that generalist foundation models are trained on, large datasets of other medical modalities are typically not in the pretraining corpora of these models (such data is typically private or gated).

For these reasons, foundation models trained specifically for medicine are needed.

The bitter lesson and medical foundation models

Building medical-specific models doesn’t mean that we develop models with architectural and training approaches highly specific to the medical domain. Sure, the medical domain has some unique challenges, but the bitter lesson is also true: approaches that can be easily scaled with more compute and data will usually win.

Unfortunately, this is not the approach currently taken by a majority of the medical AI field. For example, the best paper award at the Medical Imaging with Deep Learning 2023 conference went to a paper that utilized a specialized graph attention neural network to solve the narrow task of epithelial cell classification in pathology images. But we can confidently bet that a pathology foundation model (like UNI, CONCH, Virchow, etc.) could easily be finetuned to achieve SOTA on this narrow task.

Why is this the case? Because currently most medical AI research is happening in academia which has access to limited compute and in many cases limited data. So academics spend time manually designing approaches to solve narrow tasks on their limited compute and data budget.

In our opinion, this is not the right way to do impactful medical AI research.

Truly multimodal AI is the future of medicine

From GPT-4o to Gemini to Grok-3, it is clear that multimodal AI is the future of foundation models, but it will also be the future of medical AI.

People have started developing multimodal foundation models for medicine: LLaVA-Med (Microsoft), PathChat (Harvard), CheXagent (Stanford, I was a co-author). However, most of these approaches are very limiting, basically just trying to make a multimodal chatbot for medicine. There are other sorts of multimodal models being developed like BiomedCLIP (Microsoft) and RoentGen (Stanford+Stability AI, which I was also a co-author) but these are still limited to image-text modalities.

There is more to medicine than medical images and clinical notes! There are lab tests, there are 3D images like CTs, there are time series data like ECG, there are even video data like surgical videos.

All these modalities should be processed together in order to provide more holistic care support! This is what would make a truly multimodal AI system for medicine. So far, very little research is addressing this. ¹

Biomedical foundation models can enable new possibilities in medicine and research

When people think about the potential of medical foundation models, the ability to alleviate the burden on doctors and nurses is probably what comes to mind for most. But we suspect that people are not realizing that there are many other unique opportunities that medical foundation models can provide.

For example, narrow medical AI systems have already shown the potential to detect biomarkers in seemingly unrelated places. Dr. Eric Topol terms this “opportunistic AI”. For example, AI is able to detect biomarkers for heart calcium score, diabetes, kidney disease, etc. from just a retinal scan. AI is able to detect tumor gene expression (which typically requires molecular testing) directly from the tissue slide. This is all additional information that doctors typically don’t get in this way and it is worth wondering how it may affect the practice of medicine.

But none of these studies I cite use foundation models or incorporate multiple modalities. Our hypothesis is this: By using complementary information from different modalities, a multimodal medical foundation model will be able to pull out completely novel and highly predictive biomarkers for a variety of diseases that doctors would have previously been unable to identify.

Then there is the question of how does the AI even pick up on these features in the first place? Understanding this may lead to novel insights about disease. Let me give a hypothetical example.

Let’s say we give an AI model an image of a cancer tissue slide and it is able to predict that this tumor has a specific gene mutation. Maybe the AI picked up on how this gene mutation leads to a different morphology of the cancer cells or a different cell organization in the tumor microenvironment. If we could use AI interpretability approaches to determine the reason underlying its predictions, this could provide useful insight to understanding how this mutation affects the progression of this cancer.

As far as we are aware, while there is a large body of AI interpretability work applied to standard medical classification tasks, there has been limited work studying how AI is picking up on novel features like described above.

However, there has recently been an explosion of (mechanistic) interpretability research for foundation models. The most promising direction currently has been sparse autoencoders. It wouldn’t be surprising to me that it may be possible to apply these novel interpretability approaches to medical foundation models to derive novel insights about biomedicine.

So the real potential of biomedical foundation models is not about replacing or augmenting the practice of medicine, but rather enhancing it in novel ways and even generating novel biomedical knowledge.

AI must be ready to enable tomorrow’s healthcare

Healthcare right now looks a little something like this:

You don’t feel well
You go to a doctor
The doctor measures you in a pretty standard set of ways, starting out with simple point measurements of vital signs, going to blood tests, medical imaging, etc.
Based on all the recorded information the doctor makes a diagnosis and suggests treatment

However, we can see this approach to healthcare is very reactive, not proactive. But healthcare shouldn’t be sick-care! We can expect this to change in the future, though, and AI can enable this.

In order to enable a proactive approach to healthcare, we need to measure a patient’s baseline health and how it changes over time. For this reason, continuous monitoring of patient health will be crucial. We already see people using wearable sensors and continuous monitors to track vital signs like heart rate, pulse oxygen levels, glucose levels, etc. over time. In the future we may expect to see new wearables that continuously track even more biomarkers, or even portable medical imaging tools that patients use to scan themselves every day (akin to a Tricorder).

Foundation models will be necessary to identify patterns and features in extremely rich, continuously recorded, high-resolution, multi-modality patient data.

What do we need?

Summarizing all of the above, we believe this is what’s needed to advance the field of medical AI:

Build truly multimodal, highly flexible, medical-specific models
This should be done open-source for maximum transparency and flexibility
Such models will enhance how clinicians provide care and also enable tomorrow’s healthcare

In order to achieve this, we believe we need a dedicated research lab company: a DeepSeek for medical AI.

Just like any AI research lab, we need the following three components:

Data - this comes from strong partnerships with healthcare systems, medical practices, pharma companies, etc. Ideally training data comes from a variety of sources with lots of patients from different demographics: we need both scale and diversity
Compute - Training these models requires significant compute resources. This may even require more compute than typical training of an LLM or diffusion model, because medical data poses its own challenges. For example, pathology images are gigapixels in size, while CT data is three-dimensional.
Talent - We need diverse talent working on this:
- Data engineers to wrangle the medical datasets
- Research engineers that can manage large-scale foundation model training
- Clinicians to help scope research and evaluate models

So far, there are no entities that have such an environment:

Academia often has relevant domain (clinical) expertise and useful academic medical datasets but little large-scale AI training expertise and very little compute
Industry can have large-scale AI training expertise and plenty of compute, but little domain expertise and medical data

Our proposed new research lab company aims to be in-between an academic lab and a standard AI company, focusing on open research and development while also translating and deploying our developed foundation models into real-world clinical applications.

Introducing Sophont

We are founding Sophont, a new company focused on building open multimodal medical foundation models in order to build towards what we believe is the future of healthcare. We are an AI-first company building for medicine. Our goal is to become the de facto leader of impactful medical AI research and deployment.

We are not a company doing research for the sake of research. We also need to be translating that research into the hands of the people, making an actual direct impact. We are a research and translation company. We believe foundation models will become an important layer of infrastructure throughout healthcare and the life sciences, and we will help make that happen.

While brainstorming company names, my younger sister, Tiara Abraham (the creative genius of my family) suggested “Sophont.” The term, often used in science fiction, refers to an intelligent, self-aware being capable of advanced reasoning and problem-solving. Derived from the Greek sophos, meaning wise or intelligent, “Sophont” reflects our vision for medical AI—an advanced, cognitive system that enhances decision-making, drives innovation, and pushes the boundaries of AI-driven healthcare.

With over five years of experience applying generative AI to medicine, I bring a wealth of expertise, having previously served as the Research Director at Stability AI and CEO of MedARC. My co-founder, Paul Scotti, has a decade of experience in computational neuroscience, was a postdoctoral researcher at Princeton University, and served as Head of NeuroAI at Stability AI. He led the team behind the MindEye publications, which achieved state-of-the-art reconstructions of seen images from brain activity.

We are strong believers in the importance and usefulness of both open-source and open, transparent research, as evidenced by our previous experiences at MedARC. During our time at MedARC, we were able to scale up research projects to many external collaborators from around the world to make significant advances in neuroAI research. We hope to continue this science-in-the-open approach to research at Sophont to build towards the future of healthcare.

If you are interested in building and collaborating in this space, whether you’re in generative AI, hospital work, life science R&D, pharma, etc., feel free to reach out at: contact@sophontai.com

To stay updated about Sophont, follow our company on Twitter and LinkedIn.

Footnotes

As a side note, it’s worth pointing out that our vision is somewhat similar to the one laid out by Moor et al. 2023 which we also agree with heavily. They propose a medical AI that has the following characteristics: (1) the ability to carry out tasks that are dynamically specified (ex: with in-context learning), (2) the ability to support flexible combinations of data modalities, (3) formally represent medical domain knowledge and leverage it to carry out advanced medical reasoning.↩︎

Citation

For attribution, please cite this work as:

Tanishq Mathew Abraham, and Paul Scotti. 2025. “The Medical AI Manifesto - Introducing Sophont.” March 31, 2025. https://www.tanishq.ai/blog/posts/sophont.html.

LLMs in medicine: evaluations, advances, and the future

Tanishq Mathew Abraham — Tue, 04 Mar 2025 08:00:00 GMT

Introduction

Large Language Models (LLMs) have shown significant potential for medical applications yet many challenges remain. Let’s talk about the state of LLMs in medicine, how these models are evaluated, how the latest models are improving, and the future of the field.

Significant progress in the field has been made by Google and Microsoft/OpenAI¹. Here are some of the models these companies have developed either for medical use-cases or been tested specifically on medical tasks:

Google: Med-PaLM → Med-PaLM 2 → Med-PaLM M → AMIE and Med-Gemini

OpenAI: GPT-3 → GPT-4 → GPT-o1-preview²

Note that Google’s strategy so far has been to fine-tune their general purpose models for medical use-cases, while Microsoft takes OpenAI general-purpose models and directly applies them to medical tasks.

The open-source community has also produced models that have been evaluated for medical tasks, including Llama, Mistral, Qwen, and the DeepSeek-series models. In addition, certain models have been specifically fine-tuned for medical applications, such as Meditron [1].

How LLM medical capabilities are evaluated

Until fairly recently, the prevalent way of evaluating model performance for medical tasks is to report performance on multiple choice medical question answering (MCQA) benchmarks. This practice was popularized by Google’s Med-PaLM paper [2], which introduced the MultiMedQA benchmark. This benchmark is itself a suite of a few other benchmarks:

PubMedQA [3] - 1,000 expert-labeled Q&A pairs where a question and corresponding PubMed abstract as context is given, and a yes/maybe/no answer must be produced. Unlike the rest of the tasks in this suite, PubMedQA is a closed-domain Q&A task.
MedQA [4] - US Medical License Exam (USMLE) questions with 4 or 5 possible answers. Typically, only the 4-option questions are used.
MedMCQA [5] - 4-option multiple choice questions from Indian medical entrance examinations, >191k total questions.
MMLU [6] - 4-option multiple choice exam questions from a variety of domains. The following six domains are utilized here:
- Anatomy
- Clinical Knowledge
- College Medicine
- Medical Genetics
- Professional Medicine
- College Biology

Here is a representative example of a QA pair from each dataset:

Dataset	Question	Options	Answer
PubMedQA	Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?	yes or no	yes
MedQA	A 57-year-old man presents to his primary care physician with a 2-month history of right upper and lower extremity weakness. He noticed the weakness when he started falling far more frequently while running errands. Since then, he has had increasing difficulty with walking and lifting objects. His past medical history is significant only for well-controlled hypertension, but he says that some members of his family have had musculoskeletal problems. His right upper extremity shows forearm atrophy and depressed reflexes while his right lower extremity is hypertonic with a positive Babinski sign. Which of the following is most likely associated with the cause of this patient’s symptoms?	A: HLA-B8 haplotype B: HLA-DR2 haplotype C: Mutation in SOD1 D: Mutation in SMN1 E: Viral infection	C
MedMCQA	Which drug is a selective COX 2 inhibitor?	A: Celecoxib B: Acetaminophen C: Ketorolac D: Aspirin	A
MMLU	Which of the following conditions does not show multifactorial inheritance?	A: Pyloric stenosis B: Schizophrenia C: Spina bifida (neural tube defects) D: Marfan syndrome	D

When papers report “LLM X beats humans on medical license exams” they typically refer to models surpassing baseline human performance on benchmarks such as MedQA and MedMCQA (or also a separate USMLE benchmark).

Current state-of-the-art LLMs have already achieved very high accuracy on these tasks. For example, Med-Gemini [7] achieved 91.1% on MedQA. Many of these benchmarks are being saturated and losing their utility. In fact, some researchers noticed that many of the questions that LLMs get wrong actually have incorrect ground truth labels!

Based on this, you might assume LLMs have solved medicine. You would be incorrect.

All of these benchmarks consist of multiple choice questions. When you consult a doctor, do you imagine a list of potential diagnoses floating above your head, waiting for the doctor to simply choose the right one? Unfortunately not! Multiple choice question answering benchmarks can assess if an LLM contains medical knowledge and basic medical capabilities, but it does not accurately represent the actual practice of medicine!

Going beyond multiple-choice question answering

Over the past year or so, especially with MultiMedQA being saturated by recent well-performing LLMs, researchers have become acutely aware of the challenges of current medical MCQA benchmarks. For this reason, researchers have started evaluating LLMs in new ways that they hope better align with actual medical use-cases.

For example, at the end of 2023, Google published a paper on how well a medically finetuned variant of PaLM-2 (distinct from Med-PaLM 2) performed on differential diagnoses using a new benchmark [8]. New England Journal of Medicine (NEJM) Clinicopathological Conference Case Reports are lightly edited transcriptions of the clinicopathological conferences of the Mass General Hospital. In a clinicopathological conference, a patient case (medical history, test results, etc.) is described and an expert physician is asked to provide a differential diagnosis and a final diagnosis. These cases are published regularly in the NEJM as “diagnostic puzzles”. Many papers are now using these cases to evaluate LLMs, although different cases are used between different studies.

Google’s LLM achieved 59.1% accuracy in including the correct diagnosis somewhere in its top 10 differential diagnoses—better than unassisted doctors at 33.6%. When doctors used the LLM as an assistant, they achieved 51.7% accuracy versus 44.4% with just search tools.

This certainly seems very promising, but we shouldn’t read too much into these results. Let us consider how there remain weaknesses to this benchmark. All the necessary information to make the diagnosis is provided upfront and there is no back-and-forth between patient and physician. Real medical practice requires gathering information dynamically from patients, and dealing with uncertainty, which these carefully structured benchmarks cannot evaluate.

Researchers recognized this gap and several research groups have now developed novel frameworks for evaluating LLMs for clinical settings, each taking different approaches to attempt to simulate more realistic medical practices:

CRAFT-MD (Oct 2023) [9], developed by researchers at Harvard Medical School and Stanford, introduced a framework using three AI agents: a patient agent, doctor agent (the LLM being evaluated), and a grader agent. Their evaluation used three main data sources: 1,800 case vignettes from MedQA dataset, 100 questions from an online medical question bank focused on dermatology, and 100 newly generated private dermatology cases created by dermatologists. The researchers specifically focused on dermatology to evaluate how well LLMs could conduct nuanced conversations about symptoms, progression, and medical history. Their evaluation revealed that diagnostic accuracy drops significantly when LLMs need to gather information through conversation rather than being presented with all information upfront. For example, GPT-4’s accuracy dropped from 82% to 62.7% when moving from static case descriptions to conversations. They also examined how biases affected performance, finding that while GPT-4 was relatively robust, other models like Mixtral-8x7B showed significant performance degradation when biases were introduced.

AMIE (Jan 2024) [10], developed by Google DeepMind, took a different approach by conducting a randomized, double-blind crossover study comparing their LLM system against primary care physicians using standardized patient actors. The study used 149 scenario packs sourced from multiple regions: 75 from India, 60 from Canada, and 14 from the UK. These scenarios covered conditions across multiple specialties including cardiovascular, respiratory, gastroenterology, neurology, and obstetrics/gynecology, though notably excluded pediatric and psychiatric cases. The study revealed that AMIE outperformed physicians on 28 of 32 evaluation axes according to specialist physicians and 24 of 26 axes according to patient actors. However, the authors note important limitations, particularly that the text-chat interface may have disadvantaged human physicians who are more accustomed to in-person or video consultations.

AgentClinic (May 2024) [11], from researchers at Stanford and Johns Hopkins, introduced a broader doctor/patient evaluation framework supporting multiple medical specialties, multiple languages, and multimodal inputs like medical imaging. Their evaluation environment drew from several datasets: the MedQA USMLE dataset for general medical scenarios, MIMIC-IV database for realistic patient cases, NEJM case challenges (120 cases) for multimodal evaluation, and MedMCQA dataset for specialist cases across 9 medical specialties. They also created multilingual versions in 7 languages (Chinese, Hindi, Korean, Spanish, French, Persian, and English). The authors compared the accuracy of models on MedQA-USMLE vs. on AgentClinic’s MedQA-USMLE agentic environment. Interestingly, they found that MedQA accuracy is not predictive of accuracy on AgentClinic-MedQA due to the additional complexity of agentic environment. Their evaluation showed Claude-3.5 achieved the highest performance with 62.1% accuracy on their benchmark (as of Oct. 2024), while also demonstrating that different models varied significantly in their ability to utilize tools like experiential learning and adaptive retrieval.

MIMIC-CDM (Sept 2024) [12], from researchers at Technical University of Munich and Imperial College London demonstrated that current LLMs still face significant challenges in real clinical settings. Using a comprehensive dataset of 2,400 real patient cases from the MIMIC database, focused on four common abdominal pathologies (appendicitis, cholecystitis, diverticulitis, and pancreatitis), they showed that LLMs struggled with guideline adherence and lab result interpretation. Each case included comprehensive medical data: admission info, lab events, radiology reports, diagnoses, and discharge summaries. The study found that even state-of-the-art models performed significantly worse than physicians across all pathologies, with accuracy dropping further when models had to gather information themselves rather than having it provided upfront.

Physicians only spend roughly 27% of their time performing direct clinical care duties with the rest being spent on laborious documentation and administrative tasks, so it’s important to also evaluate LLMs on these complex administrative tasks for realistic/practical medical applications. MedAgentBench (Feb 2025) [13] developed by Stanford researchers is a comprehensive benchmark designed to evaluate the agent capabilities of LLMs to work with realistic electronic health records. The MedAgentBench framework comprises 300 clinically-derived tasks across 10 categories, utilizing realistic profiles of 100 patients with over 700,000 records. The authors show that while current LLMs like Claude 3.5 Sonnet v2 can achieve success rates close to 70%, substantial challenges remain—especially in executing action-based tasks—highlighting the need for further advancements before such systems can be reliably integrated into clinical workflows.

Overall, these different approaches to evaluation have helped expose the current limitations of LLMs (poor performance in more realistic patient-doctor interactions) in medical applications while also highlighting areas for improvement. While multiple-choice medical exam benchmarks suggested near-human or superhuman performance, these more realistic evaluations reveal that LLMs still fall short of clinical practice in many ways.

Before moving on, it’s worth noting two things:

While it seems like these alternative (often agentic) approaches are more useful now for highlighting the limitations of current LLMs, this doesn’t mean MCQA benchmarks are completely over. New, more challenging and unsaturated MCQA benchmarks can be constructed. For example, researchers at Tsinghua University included questions from specialty board exams for specialized scenarios with easy and highly similar questions filtered out to construct the challenging MedXpertQA benchmark (Jan 2025) [14].
LLM limitations can also be evaluated with more unrealistic toy questions that specifically evaluate the model’s flexible reasoning abilities or capacity to identify knowledge gaps. Two good examples of this are MetaMedQA (Jan 2025) [15] and M-ARC (Feb 2025) [16].

The future of LLM medical capabilities

I think it’s becoming clear that the challenges of general purpose LLMs also apply to medical LLMs. Namely, addressing reasoning is extremely important for medical tasks as well. My hypothesis is that the progression and development of reasoning models will address many of the limitations of LLMs in medicine, supported by the fact that Microsoft reports o1-preview achieved SoTA on MultiMedQA [17]. What’s particularly interesting is that models like o1-preview can spend more or less compute “thinking” about a problem, and spending more reasoning tokens tends to improve performance!

Another study instead analyzes how well o1-preview performs on case studies like the NEJM case challenges [18]. Here, responses by o1-preview were evaluated by two physicians from a score of 0 (no suggestions close to target diagnosis) to 5 (contains the exact target diagnosis) with the model including the correct diagnosis in its differential in 78.3% of cases (with the correct diagnosis being the top suggestion 52% of the time), and significantly outperforming GPT-4 on a subset of 70 cases where o1-preview achieved 88.6% accuracy compared to GPT-4’s 72.9%.

So what’s next? Well, so far there hasn’t been a study of o1 in the agentic patient-doctor scenario environments as I previously discussed, so that’s a clear next step. Additionally, it will be interesting to see how o3 models perform. The open-source community also has developed their own reasoning-enhanced LLMs (ex: DeepSeek-R1), and it will be valuable to see how well these o1 replication models perform on medical benchmarks.

Finally, this is still a very narrow look at how foundation models can be applied to medicine. Specifically, most of the analyses discussed do not include processing of multiple modalities like medical images, lab tests, etc. which is an important aspect of clinical care. Instead, analyses of these tests are usually described in text and provided to the LLM. Additionally, much of the research focuses on evaluating general-purpose LLMs. I believe we need to go beyond general-purpose LLMs for medicine and instead focus on medical-specific multimodal models. Of course, this will necessitate new evaluations. I will discuss both of these points in future blog posts.

The trajectory of LLMs in medicine demonstrates both impressive advances and important limitations. While early evaluations on multiple-choice medical exams suggested near-superhuman performance, newer frameworks like CRAFT-MD, AMIE, and AgentClinic have exposed crucial gaps between simplistic benchmark performance and real clinical capabilities. The emergence of reasoning-enhanced models offers promise, with their ability to achieve state-of-the-art results on both traditional benchmarks and complex case studies suggesting that enhanced reasoning capabilities may be key to bridging this gap.

Conclusion

Moving forward, the field must focus on two key challenges: evaluating these reasoning-enhanced models in realistic clinical scenarios and developing true multimodal medical AI systems capable of processing and reasoning about diverse clinical data modalities. As we continue to refine both our models and evaluation methods, the goal remains clear: creating AI systems that can genuinely support and enhance medical practice rather than just excel at standardized tests.

Acknowledgements

Thank you to Paul Scotti, Ph.D. Jean-Benoit Delbrouck, Ph.D., and Alireza Nejati, Ph.D. for review and feedback.

References

[1]

Z. Chen et al., “MEDITRON-70B: Scaling Medical Pretraining for Large Language Models.” arXiv, Nov. 2023. doi: 10.48550/arXiv.2311.16079. Available: http://arxiv.org/abs/2311.16079. [Accessed: Mar. 03, 2025]

[2]

K. Singhal et al., “Large language models encode clinical knowledge,” Nature, vol. 620, no. 7972, pp. 172–180, Aug. 2023, doi: 10.1038/s41586-023-06291-2. Available: https://www.nature.com/articles/s41586-023-06291-2. [Accessed: Mar. 03, 2025]

[3]

Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu, “PubMedQA: A Dataset for Biomedical Research Question Answering,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan, Eds., Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 2567–2577. doi: 10.18653/v1/D19-1259. Available: https://aclanthology.org/D19-1259/. [Accessed: Mar. 03, 2025]

[4]

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams.” arXiv, Sep. 2020. doi: 10.48550/arXiv.2009.13081. Available: http://arxiv.org/abs/2009.13081. [Accessed: Mar. 03, 2025]

[5]

A. Pal, L. K. Umapathi, and M. Sankarasubbu, “MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering.” arXiv, Mar. 2022. doi: 10.48550/arXiv.2203.14371. Available: http://arxiv.org/abs/2203.14371. [Accessed: Mar. 03, 2025]

[6]

D. Hendrycks et al., “Measuring Massive Multitask Language Understanding.” arXiv, Jan. 2021. doi: 10.48550/arXiv.2009.03300. Available: http://arxiv.org/abs/2009.03300. [Accessed: Mar. 03, 2025]

[7]

K. Saab et al., “Capabilities of Gemini Models in Medicine.” arXiv, May 2024. doi: 10.48550/arXiv.2404.18416. Available: http://arxiv.org/abs/2404.18416. [Accessed: Mar. 03, 2025]

[8]

D. McDuff et al., “Towards Accurate Differential Diagnosis with Large Language Models.” arXiv, Nov. 2023. doi: 10.48550/arXiv.2312.00164. Available: http://arxiv.org/abs/2312.00164. [Accessed: Mar. 03, 2025]

[9]

S. Johri et al., “An evaluation framework for clinical use of large language models in patient interaction tasks,” Nature Medicine, vol. 31, no. 1, pp. 77–86, Jan. 2025, doi: 10.1038/s41591-024-03328-5. Available: https://www.nature.com/articles/s41591-024-03328-5. [Accessed: Mar. 03, 2025]

[10]

T. Tu et al., “Towards Conversational Diagnostic AI.” arXiv, Jan. 2024. doi: 10.48550/arXiv.2401.05654. Available: http://arxiv.org/abs/2401.05654. [Accessed: Mar. 03, 2025]

[11]

S. Schmidgall, R. Ziaei, C. Harris, E. Reis, J. Jopling, and M. Moor, “AgentClinic: A multimodal agent benchmark to evaluate AI in simulated clinical environments.” arXiv, Oct. 2024. doi: 10.48550/arXiv.2405.07960. Available: http://arxiv.org/abs/2405.07960. [Accessed: Mar. 03, 2025]

[12]

P. Hager et al., “Evaluation and mitigation of the limitations of large language models in clinical decision-making,” Nature Medicine, vol. 30, no. 9, pp. 2613–2622, Sep. 2024, doi: 10.1038/s41591-024-03097-1. Available: https://www.nature.com/articles/s41591-024-03097-1. [Accessed: Mar. 03, 2025]

[13]

Y. Jiang et al., “MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents.” arXiv, Feb. 2025. doi: 10.48550/arXiv.2501.14654. Available: http://arxiv.org/abs/2501.14654. [Accessed: Mar. 03, 2025]

[14]

Y. Zuo et al., “MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding.” arXiv, Feb. 2025. doi: 10.48550/arXiv.2501.18362. Available: http://arxiv.org/abs/2501.18362. [Accessed: Mar. 03, 2025]

[15]

M. Griot, C. Hemptinne, J. Vanderdonckt, and D. Yuksel, “Large Language Models lack essential metacognition for reliable medical reasoning,” Nature Communications, vol. 16, no. 1, p. 642, Jan. 2025, doi: 10.1038/s41467-024-55628-6. Available: https://www.nature.com/articles/s41467-024-55628-6. [Accessed: Mar. 03, 2025]

[16]

J. Kim, A. Podlasek, K. Shidara, F. Liu, A. Alaa, and D. Bernardo, “Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning.” arXiv, Feb. 2025. doi: 10.48550/arXiv.2502.04381. Available: http://arxiv.org/abs/2502.04381. [Accessed: Mar. 03, 2025]

[17]

H. Nori et al., “From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond.” arXiv, Nov. 2024. doi: 10.48550/arXiv.2411.03590. Available: http://arxiv.org/abs/2411.03590. [Accessed: Mar. 03, 2025]

[18]

P. G. Brodeur et al., “Superhuman performance of a large language model on the reasoning tasks of a physician.” arXiv, Dec. 2024. doi: 10.48550/arXiv.2412.10849. Available: http://arxiv.org/abs/2412.10849. [Accessed: Mar. 03, 2025]

Footnotes

Even though Microsoft and OpenAI are separate, I keep them together because so far OpenAI hasn’t published any separate medical AI research while Microsoft publishes their evaluations of OpenAI models on medical tasks, although that may change in the future with OpenAI’s new health AI team.↩︎
o1 and o3-mini have not been evaluated on medical tasks yet, as far as I know.↩︎

Citation

BibTeX citation:

@online{abraham2025,
  author = {Abraham, Tanishq Mathew},
  title = {LLMs in Medicine: Evaluations, Advances, and the Future},
  date = {2025-03-04},
  url = {https://www.tanishq.ai/blog/posts/llm-medical-evals.html},
  langid = {en}
}

For attribution, please cite this work as:

T. M. Abraham, “LLMs in medicine: evaluations, advances, and the future,” Mar. 04, 2025. Available: https://www.tanishq.ai/blog/posts/llm-medical-evals.html

Debunking DeepSeek Delusions

Tanishq Mathew Abraham, Ph.D. — Tue, 04 Feb 2025 08:00:00 GMT

Introduction

On January 20th, 2025, a Chinese AI company called DeepSeek open-sourced and released their reasoning model, R1. What’s different about this model vs. all the other open-source LLMs are a couple things:

Performance is actually as good as OpenAI’s o1, which is a frontier model, marking the first time open-source has truly caught up to closed-source
This was done with a relatively low training budget compared to other frontier models
The easy-to-use UI, combined with a good UX with visible chain-of-thought in their website and app led to millions of new users

Given that DeepSeek is a Chinese company, the U.S. and its AGI companies have a variety of “national security concerns”. Rampant misinformation has been spreading about the model due to this. The goal of this blog post is to counteract many of the extremely bad AI-related takes about DeepSeek since its release and provide a balanced take as an AI researcher who works at the forefront of generative AI.

Let’s get started!

Myth 1: DeepSeek is a Chinese company that came out of nowhere, deeply suspicious!

Completely false, pretty much any generative AI researcher had already heard of DeepSeek by January 2025. DeepSeek even previewed R1 a couple months before its full release!

Anybody spreading this myth is likely someone who doesn’t work in AI and it is preposterous and extremely pretentious to assume that you know everything about what’s going on in a field if you are not actively a part of it.

DeepSeek’s first open-source models were released in November 2023, which were state-of-the-art coding LLMs (DeepSeek-Coder). As you can see in the below graph, DeepSeek continued shipping over the course of a year to reach R1:

deepseek progress plot

So this isn’t some overnight success, and there’s nothing suspicious about their rate of progress. With everything moving so fast in AI and with the clearly cracked team they have, this much progress in a year seems quite reasonable to me.

If you are wondering what other companies are under the radar to the broader public but bullish in AI circles, I would look into Qwen (Alibaba), YI (01.AI), Mistral, Cohere, AI2. I will note that none of them have the consistent shipping of SOTA models like DeepSeek, but they all have the potential to release stellar models, as they have demonstrated in the past.

Myth 2: The model does not cost $6 million to make, the Chinese are lying about it

Okay this is an interesting one. The claim is that DeepSeek is lying about the true cost of model training in order to avoid admitting they had illegal under-the-table dealings to obtain compute they shouldn’t have access to (due to export controls).

First of all it’s worth understanding where this $6 million figure comes from. It’s mentioned in the DeepSeek-V3 paper that released a month before the DeepSeek-R1 paper:

deepseek cost

DeepSeek-V3 is the base model of DeepSeek-R1, which means DeepSeek-R1 is DeepSeek-V3 with some additional reinforcement learning training. So in some sense the cost is already inaccurate simply because there’s an additional cost for the RL training that’s not accounted for. But that would likely only cost a few hundred thousand dollars.

Okay then, so is the $5.5 million claim of the DeepSeek-V3 paper incorrect? Numerous analyses based on GPU cost, dataset size, and model size achieve similar ballpark estimates. Note that while DeepSeek V3/R1 is a 671B parameter model, it is a mixture-of-experts model which means any function call/forward pass of the model only uses ~37B parameters and this is the value used in calculations for training cost.

However, note that DeepSeek is reporting an estimated cost based on current market prices for these GPUs. We don’t actually know how much their 2048 H800 GPU cluster (note: not H100s, a common misconception and confusion!) costs. Typically, contiguous GPU clusters cost less when bought together, so it may even be cheaper.

But here’s the thing, this is the cost for the final run. There are numerous experiments and ablations that are done at smaller scales to get to the final run which can cost a significant amount and this is not reported here.

On top of that, there are probably numerous other costs, like researcher salaries. SemiAnalysis reports that DeepSeek research salaries are rumored to be on the order of $1 million. This is comparable to the higher end of salaries at AGI frontier labs like OpenAI or Anthropic.

Typically when costs of training different models have been reported and compared, they have always focused on the final training run cost. But due to the poor discourse and misinformation spreading, people have been arguing that the additional costs discredit the cheap costs of DeepSeek and the efficient nature of their operation. This is wildly unfair. The additional costs both in terms of ablations/experiments and researcher salaries at other AGI frontier labs are quite significant but these are not typically mentioned in such discussions!

Myth 3: It’s so cheap, all the US AGI companies have been wasting their money, this is extremely bearish for NVIDIA

Okay I consider this to be another fairly dumb take. DeepSeek definitely was significantly more efficient in training compared to many other LLM. And yes, it’s very much possible many US frontier labs were being inefficient with their compute. However, that does not necessarily imply that having more compute is a bad thing.

Honestly, whenever I hear a take like this, it’s clear to me that they don’t understand scaling laws and they don’t understand the mindset of AGI company CEOs (and anyone who is treated as an expert in AI should understand such things). Let me dispense some alpha on this topic.

Scaling laws have demonstrated that as long as we continue to put more compute into the model, we get better and better performance. Of course, the exact approach and aspect of the AI being scaled has changed over time: first it was with model size, then with dataset size, now with inference-time compute and synthetic data. Nevertheless, the overall trend of more compute=better performance seems to be holding since the original Transformer in 2017.

More efficient models means you can squeeze more performance for a given compute budget, but more compute will still be better. More efficient models means you can do more with less amount of compute, but you can do even more with more compute!

Now you may have your own opinions on scaling laws. You may think there is a plateau coming. You may argue past performance is not indicative of future results, as they say in finance. But that frankly doesn’t matter much if you want to understand the moves the largest AGI companies are making. All of the largest AGI companies are betting on scaling laws to hold long enough to reach AGI and ASI. This is their whole-hearted belief. And if they deeply believe this, then the only logical move is to obtain more compute.

(Personally, I am quite “scaling-pilled” but am open to evidence that suggests otherwise)

Now you may argue that NVIDIA GPUs are going to be obsolete soon, look at the performance of AMD, Cerebras, Graphcore, TPUs, Trainium, blah blah blah. There’s a million of these AI-specific hardware products that are all trying to compete with NVIDIA. And one of them might win in the future. In which case, maybe these AGI companies will switch to them. But this is completely orthogonal to DeepSeek’s success.

(Personally, I don’t see very strong evidence that other companies will topple NVIDIA’s domination of AI accelerator chips, given NVIDIA’s current market domination and continued level of innovation.)

So overall, I see no reason why DeepSeek means you should be bearish on NVIDIA. You may be bearish on NVIDIA for other reasons which may very well be justifiable and correct, but DeepSeek does not seem like the right justification to me.

Myth 4: DeepSeek didn’t make any meaningful innovations and are copying American companies

Wrong. There are numerous innovations in the design of the language model and how it was trained, some more significant than others. Here are a few (not a comprehensive list, read the DeepSeek-V3 and DeepSeek-R1 papers for more details):

Multi-latent attention - LLMs are usually Transformers which utilizes what is known as a multi-head attention (MHA) mechanism. The DeepSeek team developed a variant of the MHA mechanism that is both more memory-efficient and yields better performance.
GRPO with verifiable rewards - The AI community has been trying to replicate o1 since its release. Since OpenAI had been quite closed about how it works, the community had to explore a variety of different approaches for achieving o1-like results. There were various directions like Monte Carlo Tree Search (the approach used by Google DeepMind to win at Go) which turned out to be less promising than initially expected. On the other hand, DeepSeek demonstrated a very simple reinforcement learning (RL) pipeline can actually achieve o1-like results. On top of that, they developed their own variant of the common PPO RL algorithm called GRPO that is more efficient and better-performing. I think many in the AI community have been wondering, why didn’t we try this before already?
DualPipe - When training an AI model over many GPUs there’s a lot of efficiency aspects to consider. You need to figure out how the model and dataset is split across all the GPUs, how the data flows through the GPUs, etc. You need to reduce any transfer of data between GPUs too because it’s very slow, it’s better to process as much as you can on each individual GPU. Anyway there are many ways to set up this sort of multi-GPU training, and the DeepSeek team designed a new approach that is significantly more efficient and faster called DualPipe.

We are extremely lucky that DeepSeek has completely open-sourced and written in great detail these innovations, unlike American AGI companies. Now, everyone can benefit and improve their own training of AI models by utilizing these advances.

Myth 5: DeepSeek is “sucking knowledge” from ChatGPT

It has been claimed by David Sacks (AI and crypto czar for the US government) and OpenAI that DeepSeek is “sucking knowledge” from ChatGPT with a technique called distillation.

First of all, the term distillation is being used very weirdly here. Typically distillation refers to training on full probabilities (logits) of all the possible next words (tokens) but this info isn’t even exposed by ChatGPT.

But okay, let’s say we’re talking about training on text generated by ChatGPT, despite that not being the typical use of the term.

OpenAI and its employees are claiming that DeepSeek themselves generated text with ChatGPT and trained on it. They have provided no evidence for this but if this is true, then DeepSeek has clearly violated ChatGPT Terms of Service. I think the legal ramifications of this, especially for a Chinese firm, is unclear, but I don’t know much about that.

Note that this is only if DeepSeek themselves generated the data to train on. If DeepSeek used ChatGPT-generated data available from other sources (there are many public datasets at this point), my understanding is that this form of “distillation” or synthetic data training is not prohibited by the TOS.

That said, in my opinion this doesn’t take away from the achievements of DeepSeek. Rather than the efficiency side of DeepSeek, what was more impressive to me as a researcher was their replication of o1. And I highly doubt performing “distillation” of ChatGPT would help in any way, simply because the o1 CoT thinking process was never exposed publicly, so how would DeepSeek be able to learn it?

Additionally, many LLMs do perform training on ChatGPT (and other LLM) synthetic data, plus there’s naturally going to be AI text in any new Internet scrapes anyway.

Overall, the argument that DeepSeek’s model performs well because it simply distilled ChatGPT ignores the reality of their engineering, efficiency and architectural innovations, as detailed in DeepSeek’s technical report.

Should we be worried about China’s dominance in AI?

Maybe a little bit? Frankly, not much really changed in terms of the Chinese-US AI race between now and 2 months ago. Rather, the reaction from outsiders has been quite dramatic and this may indeed affect the overall AI landscape through changes in funding, regulation, etc.

The Chinese have always been competitive in the AI space, but DeepSeek makes them impossible to ignore now.

The typical argument regarding open-source has been that because China is behind we shouldn’t openly share our technology for them to catch up. But clearly China has already caught up, and they frankly did a while back, and they are actually leading on open-source, so it’s unclear if closing off our technology actually helps significantly.

Note that companies like OpenAI, Anthropic, and Google DeepMind definitely have models better than DeepSeek R1. For example, the benchmark results for OpenAI’s o3 model are quite impressive and they likely already have another subsequent model wrapping up development.

On top of that, with significant additional investment like Project Stargate and OpenAI’s upcoming funding round, OpenAI and other American frontier labs will have plenty of compute to be able to maintain their lead.

Of course, China will be pouring lots of additional capital into AI development. So overall, the competition is heating up! But I think the path continues to be quite promising for American AGI frontier labs to remain at the top.

Conclusion

On one hand, some AI folks, especially some at OpenAI, are trying to underhype DeepSeek. On the other hand, the reaction to DeepSeek from some pundits and self-proclaimed experts is exaggerated and even dangerous. No, it’s not over for OpenAI/Anthropic/Meta/Google/xAI/NVIDIA/etc. No, DeepSeek is (probably) not lying about what they did. That said, DeepSeek deserves the recognition and R1 is an impressive model.

Finally, I want to note that there is so much more nuance and details regarding what is discussed here. But I hope this article served as a useful jumping-off point for your own exploration of these topics. If other sources are sharing these falsehoods with no nuance, you can safely disregard them. But there are all kinds of more in-depth discussions from folks like Teortaxes, SemiAnalysis, etc., be sure to check them out!

Acknowledgements

Thanks to Paul Scotti for his feedback.

Reinforcement Learning for Diffusion Models from Scratch

Tanishq Mathew Abraham, Ph.D. — Fri, 22 Sep 2023 07:00:00 GMT

Introduction

Over the past year we have seen the rise of generative AI that has mainly come in two forms:

Text-to-image generation powered by diffusion models like Stable Diffusion and DALL-E 2
Language models like ChatGPT and LLaMA-2

It turns out one of the key ingredients for the mainstream success of language models is the use of Reinforcement Learning from Human Feedback (RLHF) where language models are trained with human feedback to produce outputs that users are more likely to prefer. This has enabled these language models to more easily follow instructions, making these models significantly more accessible. Therefore the question arises if RLHF can be applied to diffusion models. This is a natural question to ask, since text-to-image diffusion models also struggle to follow prompts and tend to need prompt engineering skills in order to get desired results. A paper in May 2023 by the Levine Lab at UC Berkeley explored how the RLHF paradigm can be applied to diffusion models, resulting in an algorithm called DDPO. Here we’ll walk through a simple implementation of this DDPO algorithm. Let’s get started!

First let’s start with some basic imports:

import os
import requests
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
import clip # pip install git+https://github.com/openai/CLIP.git
import torch
import random
import math
import wandb
from torch import nn
from diffusers import StableDiffusionPipeline, DDIMScheduler
from PIL import Image
from fastprogress import progress_bar, master_bar

Let’s load our Stable Diffusion model. Let’s also enable some performance optimizations (TF32 support, attention slicing, memory-efficient xformers attention) that will make it faster to work with our Stable Diffusion model for training.

torch.backends.cuda.matmul.allow_tf32 = True
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4").to("cuda")
pipe.enable_attention_slicing()
pipe.enable_xformers_memory_efficient_attention()
pipe.text_encoder.requires_grad_(False)
pipe.vae.requires_grad_(False)

We’re using the diffusers library, which provides a simple-to-use interface for sampling the Stable Diffusion model using their pipeline:

prompt = "a photograph of an astronaut riding a horse"
img = pipe(prompt).images[0]

100%|██████████| 50/50 [00:03<00:00, 15.99it/s]

img

Okay then, we want to improve the images coming out of our model. In order to do so we should have some sort of score for the image that we can later optimize for. This score could represent how aesthetic the image is. This is frankly something that is quite subjective, and there is no mathematical equation for the aestheticness of an image. Instead we will use LAION’s aesthetic predictor, which was trained on thousands of human aesthetic ratings of AI-generated images and is a linear model on top of CLIP features. Below is the standard inference code for the aesthetic predictor model:

class MLP(nn.Module):
    def __init__(self, input_size, xcol='emb', ycol='avg_rating'):
        super().__init__()
        self.input_size = input_size
        self.xcol = xcol
        self.ycol = ycol
        self.layers = nn.Sequential(
            nn.Linear(self.input_size, 1024),
            #nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(1024, 128),
            #nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 64),
            #nn.ReLU(),
            nn.Dropout(0.1),

            nn.Linear(64, 16),
            #nn.ReLU(),

            nn.Linear(16, 1)
        )

    def forward(self, x):
        return self.layers(x)

def load_aesthetic_model_weights(cache="."):
    weights_fname = "sac+logos+ava1-l14-linearMSE.pth"
    loadpath = os.path.join(cache, weights_fname)

    if not os.path.exists(loadpath):
        url = (
            "https://github.com/christophschuhmann/"
            f"improved-aesthetic-predictor/blob/main/{weights_fname}?raw=true"
        )
        r = requests.get(url)

        with open(loadpath, "wb") as f:
            f.write(r.content)

    weights = torch.load(loadpath, map_location=torch.device("cpu"))
    return weights

def aesthetic_model_normalize(a, axis=-1, order=2):
    l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
    l2[l2 == 0] = 1
    return a / np.expand_dims(l2, axis)

We need the CLIP model, whose features will be passed into our aesthetic predictor:

clip_model, preprocess = clip.load("ViT-L/14", device="cuda")

aesthetic_model = MLP(768)

aesthetic_model.load_state_dict(load_aesthetic_model_weights())
aesthetic_model.cuda()

MLP(
  (layers): Sequential(
    (0): Linear(in_features=768, out_features=1024, bias=True)
    (1): Dropout(p=0.2, inplace=False)
    (2): Linear(in_features=1024, out_features=128, bias=True)
    (3): Dropout(p=0.2, inplace=False)
    (4): Linear(in_features=128, out_features=64, bias=True)
    (5): Dropout(p=0.1, inplace=False)
    (6): Linear(in_features=64, out_features=16, bias=True)
    (7): Linear(in_features=16, out_features=1, bias=True)
  )
)

image = preprocess(img).unsqueeze(0).cuda()
with torch.no_grad(): image_features = clip_model.encode_image(image)

im_emb_arr = aesthetic_model_normalize(image_features.cpu().detach().numpy())
prediction = aesthetic_model(torch.from_numpy(im_emb_arr).float().cuda())

print(f'Aesthetic score: {prediction}')

Aesthetic score: tensor([[5.1999]], device='cuda:0', grad_fn=)

Just like, that, we get the aesthetic score given with this predictor. Let’s package this code into a function:

def aesthetic_scoring(img, preprocess, clip_model, aesthetic_model_normalize, aesthetic_model):
    image = preprocess(img).unsqueeze(0).cuda()
    with torch.no_grad(): image_features = clip_model.encode_image(image)
    im_emb_arr = aesthetic_model_normalize(image_features.cpu().detach().numpy())
    prediction = aesthetic_model(torch.from_numpy(im_emb_arr).float().cuda())
    return prediction

prompt = "a horse"
img = pipe(prompt).images[0]
print(f'Aesthetic score: {aesthetic_scoring(img, preprocess, clip_model, aesthetic_model_normalize, aesthetic_model)[0][0]}')
img

 34%|███▍      | 17/50 [00:00<00:01, 19.25it/s]100%|██████████| 50/50 [00:02<00:00, 19.03it/s]

Aesthetic score: 5.421084403991699

prompt = "a beautiful, exquisite portrait of a horse, 4k, unreal engine"
img = pipe(prompt).images[0]
print(f'Aesthetic score: {aesthetic_scoring(img, preprocess, clip_model, aesthetic_model_normalize, aesthetic_model)[0][0]}')
img

 62%|██████▏   | 31/50 [00:01<00:00, 19.53it/s]100%|██████████| 50/50 [00:02<00:00, 19.17it/s]

Aesthetic score: 5.927961826324463

prompt = "a very ugly photograph of a donkey"
img = pipe(prompt).images[0]
print(f'Aesthetic score: {aesthetic_scoring(img, preprocess, clip_model, aesthetic_model_normalize, aesthetic_model)[0][0]}')
img

 58%|█████▊    | 29/50 [00:01<00:01, 19.56it/s]100%|██████████| 50/50 [00:02<00:00, 19.16it/s]

Aesthetic score: 5.116610050201416

You can see if you prompt for something “ugly” we do get an image with a lower score. But to be honest, it’s not that much lower. So I found another “ugly donkey” image online to score instead:

import requests
from io import BytesIO
img = Image.open(BytesIO(requests.get("https://i.redd.it/8wbqtdequzv41.jpg").content)).resize((512,512))
print(f'Aesthetic score: {aesthetic_scoring(img, preprocess, clip_model, aesthetic_model_normalize, aesthetic_model)[0][0]}')
img

Aesthetic score: 4.846365928649902

Okay that’s definitely much lower.

As you can see, for Stable Diffusion-generated images, variation of the score doesn’t tend to be much. There seems to be a sort of average aesthetic score that most images fluctuate around (~5.3). Can we increase the average aesthetic score of the images produced by Stable Diffusion? Reinforcement learning (RL) will help us do this.

What if we could optimize the aesthetic score?

Now that we have some sort of measure of quality of our image, our aesthetic score, we can optimize for it. In the RL literature, this measure of quality that we are optimizing for is referred to as the reward. The goal of RL algorithms is to optimize the reward. We will see how DDPO does this for diffusion models.

Before we go down the RL route though, it is worth examining if there are alternative approaches. Diffusion models, after all, are an extremely versatile framework, and people have been incorporating different constraints and forms of guidance during sampling in order to achieve desired results. Let’s do a quick refresher about diffusion models and how guidance is applied.

Diffusion model refresher

A diffusion model is described by a forward and reverse process. The forward process is when we start out with a clean image and repeatedly add Gaussian noise ( being the identity matrix)) to give use noisier and noisier images . This is described by the following:

where is a predefined monotonically increasing variance schedule. The forward process runs for a total of timesteps and finally ends with pure noise . The reverse process starts with pure noise and uses a neural network to repeatedly denoise the image giving us . The end of the reverse process gives us back our samples . This is described as follows:

where is the variance schedule for the reverse schedule and is the denoising neural network. Note that the denoising neural network can be reparameterized to predict the noise in the image. So instead of predicting the denoised image directly, we can predict the noise in the image and subtract it out to get . We train the reparameterized denoising neural network in the reverse diffusion process with a simple MSE loss:

In practice, training and sampling is quite simple. During each training step, a random image and timestep is selected, the forward process starts from till timestep to get using the noise , this is passed into our denoising model, and the MSE between the used to calculate and the predicted is optimized. During sampling, we start out with random Gaussian noise and the denoising neural network is repeatedly applied to give us until we reach our sample .

It is also worth noting the score matching intuition behind diffusion models. Specifically, turns out to actually be an estimate (up to a factor) of the “score function” . Basically, what this means is that when we sample from a diffusion model, we are iteratively taking steps in the direction of this score function, which is this gradient of the likelihood. So the sampling is very much like an optimization problem.

If any of this is unfamiliar to you, I recommend checking out the fast.ai course on the subject (I am somewhat biased though given I co-taught the class!).

Let’s now discuss how additional constraints and guidance can be added during diffusion sampling. Basically, we want to model where is some sort of condition or constraint (for example a class condition). In diffusion models, we could instead try to estimate and use this in sampling. This can be expressed differently using Bayes’ Rule:

The second term is our score function that is already being estimated by our diffusion model . The first term, however, is the gradient of the log likelihood of a classifier with respect to . What this means is that during diffusion model sampling, if we use a modified with the classifier gradient added to it, we can get samples that adhere to the desired condition.

More broadly, losses can be applied to the noisy images and its gradient can be added to the score function/denoising network output to try to obtain images that better adhere to some desired constraints. This is the idea behind CLIP-guided diffusion, for example, where the similarity between the CLIP text embedding of a prompt and the CLIP image embedding of the images during sampling are maximized.

Can we use guidance to get diffusion models to generate more aesthetic images that better adhere to user prompts? Potentially yes, but there are many challenges that may make it undesirable.

Strictly speaking, the proper way to perform classifier guidance is to use a classifier trained on either the noisy images or the predicted denoised images (which tend to be blurry, especially early on in sampling), which is what the original classifier guidance paper demonstrates. Note, it is possible to use classifiers and other models trained on regular images and get reasonable results for guidance, even though noisy or blurry images will likely be out-of-distribution. For example CLIP isn’t trained on noisy images but is used in CLIP-guided diffusion. But often to get reasonable results, various hacks and tricks are required (in the case of CLIP-guided diffusion, the use of this technique fell out of popularity once diffusion models properly conditioned on CLIP features like Stable Diffusion were developed, and CLIP guidance applied on top of models like Stable Diffusion often showed minimal benefit).

Additionally, note that guidance requires the calculation of whatever guidance loss we have and autograd of that loss at each step in the sampling process. This can add a significant overhead to the sampling time compared to guidance-free sampling. The situation is made worse with latent diffusion models like Stable Diffusion, where the latents often need to be decoded to full images in order to apply the classifier/loss, resulting in additional computational overhead.

For these reasons, we will instead apply reinforcement learning to obtain a diffusion model after optimizing arbitrary constraints (the reward function), such as aesthetic scores. As we will see, the generated images from the starting diffusion model are passed into the reward function, so there is no concern of images being out-of-distribution. Additionally, we will obtain a diffusion model that provides higher scoring images directly, not through sampling changes like guidance.

Okay let’s proceed with trying to apply RL to diffusion models. First we’ll create a dataset generator - Stable Diffusion generated images given some prompts. We’ll use animal prompts:

!wget https://raw.githubusercontent.com/formigone/tf-imagenet/master/LOC_synset_mapping.txt

--2023-06-13 09:54:06--  https://raw.githubusercontent.com/formigone/tf-imagenet/master/LOC_synset_mapping.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31675 (31K) [text/plain]
Saving to: ‘LOC_synset_mapping.txt.14’

LOC_synset_mapping. 100%[===================>]  30.93K  --.-KB/s    in 0.006s  

2023-06-13 09:54:06 (5.24 MB/s) - ‘LOC_synset_mapping.txt.14’ saved [31675/31675]

synsets = {k:v for k,v in [o.split(',')[0].split(' ', maxsplit=1) for o in Path('LOC_synset_mapping.txt').read_text().splitlines()]}

imagenet_classes = list(synsets.values())

def imagenet_animal_prompts():
    animal = random.choice(imagenet_classes[:397])
    prompts = f'{animal}'
    return prompts

imagenet_animal_prompts()

'sea urchin'

Put into a dataset class:

class PromptDataset(torch.utils.data.Dataset):
    def __init__(self, prompt_fn, num):
        super().__init__()
        self.prompt_fn = prompt_fn
        self.num = num
        
    def __len__(self): return self.num
    def __getitem__(self, x): return self.prompt_fn()

Next let’s set up our sampling loop. For simplicity, we’ll just use the DDIM scheduler:

pipe.scheduler = DDIMScheduler(
    num_train_timesteps=pipe.scheduler.num_train_timesteps,
    beta_start=pipe.scheduler.beta_start,
    beta_end=pipe.scheduler.beta_end,
    beta_schedule=pipe.scheduler.beta_schedule,
    trained_betas=pipe.scheduler.trained_betas,
    clip_sample=pipe.scheduler.clip_sample,
    set_alpha_to_one=pipe.scheduler.set_alpha_to_one,
    steps_offset=pipe.scheduler.steps_offset,
    prediction_type=pipe.scheduler.prediction_type
)

Below we have a sampling function, that also gives us intermediate timesteps. Again this is a pretty standard diffusion sampling loop, check out HuggingFace blog post for more information.

@torch.no_grad()
def sd_sample(prompts, pipe, height, width, guidance_scale, num_inference_steps, eta, device):
    scheduler = pipe.scheduler
    unet = pipe.unet
    text_embeddings = pipe._encode_prompt(prompts,device, 1, do_classifier_free_guidance=guidance_scale > 1.0)

    scheduler.set_timesteps(num_inference_steps, device=device)
    latents = torch.randn((len(prompts), unet.in_channels, height//8, width//8)).to(device)

    all_step_preds = []

    for i, t in enumerate(progress_bar(scheduler.timesteps)):
        input = torch.cat([latents] * 2)
        input = scheduler.scale_model_input(input, t)

        # predict the noise residual
        pred = unet(input, t, encoder_hidden_states=text_embeddings).sample

        # perform guidance
        pred_uncond, pred_text = pred.chunk(2)
        pred = pred_uncond + guidance_scale * (pred_text - pred_uncond)

        # compute the "previous" noisy sample
        scheduler_output = scheduler.step(pred, t, latents, eta)

        all_step_preds.append(scheduler_output)
        latents = scheduler_output.prev_sample
    
    return latents, all_step_preds

preds, all_step_preds = sd_sample([prompt]*2, pipe, 512, 512, 7.5, 50, 1, 'cuda')

0.00% [0/50 00:00

The sampling function only gives us latents, and they need to be decoded by a VAE to get the final images:

@torch.no_grad()
def decoding_fn(latents,pipe):
    images = pipe.vae.decode(1 / 0.18215 * latents.cuda()).sample
    images = (images / 2 + 0.5).clamp(0, 1)
    images = images.detach().cpu().permute(0, 2, 3, 1).numpy()
    images = (images * 255).round().astype("uint8")
    return images

Image.fromarray(decoding_fn(preds,pipe)[0])

Let’s again calculate the aesthetic score. We have to make a slight modification to our aesthetic_scoring function so it can handle batches.

def aesthetic_scoring(imgs, preprocess, clip_model, aesthetic_model_normalize, aesthetic_model):    
    imgs = torch.stack([preprocess(Image.fromarray(img)).cuda() for img in imgs])
    with torch.no_grad(): image_features = clip_model.encode_image(imgs)
    im_emb_arr = aesthetic_model_normalize(image_features.cpu().detach().numpy())
    prediction = aesthetic_model(torch.from_numpy(im_emb_arr).float().cuda())
    return prediction

imgs = decoding_fn(preds,pipe)
aesthetic_scoring(imgs, preprocess, clip_model, aesthetic_model_normalize, aesthetic_model)

tensor([[5.1635],
        [5.4475]], device='cuda:0', grad_fn=)

Now we can initialize our dataset generator, which provides prompts that we sample with, generate images, and pass into the aesthetic predictor in just a few lines of code:

train_set = PromptDataset(imagenet_animal_prompts, 1000)
train_dl = torch.utils.data.DataLoader(train_set, batch_size=2, shuffle=True, num_workers=0)

prompts = next(iter(train_dl))
preds, all_step_preds = sd_sample(prompts, pipe, 512, 512, 7.5, 50, 1, 'cuda')
imgs = decoding_fn(preds,pipe)
rewards = aesthetic_scoring(imgs, preprocess, clip_model, aesthetic_model_normalize, aesthetic_model)

0.00% [0/50 00:00

index = torch.where(rewards == rewards.min())[0][0]
print(prompts[index])
Image.fromarray(imgs[index])

polecat

index = torch.where(rewards == rewards.max())[0][0]
print(prompts[index])
Image.fromarray(imgs[index])

garter snake

Once again, the aesthetic predictor provides rewards and these are used in optimization with RL. Our goal is to maximize the reward!

For stability during RL training, the rewards are usually normalized. There are different ways of doing this, but the DDPO paper utilizes a simple approach, which is to normalize based on unique prompts. Basically, a queue is set up with prompts and the corresponding rewards, which some given buffer size. When a new (prompt, reward) pair is obtained, it is added to the queue. If the queue for that unique prompt is not long enough, the reward is just normalized over the whole batch. Else, it is normalized based on the statistics for that specific prompt. This is implemented below:

from collections import deque
class PerPromptStatTracker:
    def __init__(self, buffer_size, min_count):
        self.buffer_size = buffer_size
        self.min_count = min_count
        self.stats = {}

    def update(self, prompts, rewards):
        unique = np.unique(prompts)
        advantages = np.empty_like(rewards)
        for prompt in unique:
            prompt_rewards = rewards[prompts == prompt]
            if prompt not in self.stats:
                self.stats[prompt] = deque(maxlen=self.buffer_size)
            self.stats[prompt].extend(prompt_rewards)

            if len(self.stats[prompt]) < self.min_count:
                mean = np.mean(rewards)
                std = np.std(rewards) + 1e-6
            else:
                mean = np.mean(self.stats[prompt])
                std = np.std(self.stats[prompt]) + 1e-6
            advantages[prompts == prompt] = (prompt_rewards - mean) / std

        return advantages

per_prompt_stat_tracker = PerPromptStatTracker(buffer_size=32, min_count=16)

rewards.squeeze().cpu()

tensor([5.4786, 5.5764], grad_fn=)

advantages = per_prompt_stat_tracker.update(np.array(prompts), rewards.squeeze().cpu().detach().numpy())

advantages

array([-0.99998444,  0.9999747 ], dtype=float32)

per_prompt_stat_tracker.stats

{'garter snake': deque([5.576433], maxlen=32),
 'polecat': deque([5.478563], maxlen=32)}

We have the reward normalization set up. Let’s look at some of the intermediate timesteps:

Image.fromarray(decoding_fn(all_step_preds[0].prev_sample, pipe)[0])

Image.fromarray(decoding_fn(all_step_preds[30].prev_sample, pipe)[0])

As expected, we get pure noise (in the latent space, it looks somewhat different when decoded by the Stable Diffusion VAE) at the beginning of sampling, and as sampling progresses, the images starts to take shape.

The DDPO algorithm

Okay let’s now dig into how reinforcement learning works and derive the DDPO objective. A reminder that what we want to do is to maximize the reward signal. We can mathematically express this as follows:

where is the weights of our diffusion model, is some conditioning for the diffusion model, and is our reward function.

It would be nice to directly maximize for and if our model was a single evaluation of a neural network, we could simply backpropagate through the neural network and use an optimizer to update the weights. But that’s not what happens with a diffusion model! We have multiple timesteps for which we apply our denoising neural network. This constructs a trajectory as its known in the RL literature. In standard RL literature, our trajectory is composed of states and actions. A model that we are optimizing provides the next action given the current state, and this model is referred to as the policy. This framework is known as a Markov Decision Process (MDP). Note that in the general MDP framework, a reward is usually given after each action, and we optimize over the sum of rewards over the whole trajectory.

We can easily describe diffusion models as an MDP, which will allow us to use standard results in RL for diffusion model optimization.

is the state, which is just the current noisy image (along with the timestep and condition info). is the action, which is the slightly less noisy image . is the policy that takes the current state and provides the next action, which is our diffusion model . is the distribution of the initial states, which in this case is the same distribution for , a standard isotropic normal distribution, with the timestep always being and the conditioning having whatever prior distribution as in the dataset. is giving given the current state and action, and is basically just saying it always goes to the state associated with . is the reward after each action/state (which is zero until the very last step when the generation is complete).The entire trajectory can be represented as and the probability density for trajectories as Note this is different from the diffusion model as it is a function of , a bit confusing! So going forward the diffusion model is referred to as .

Diffusion models as a Markov Decision Process.

Okay, with this framework in place it becomes trivial to apply standard policy gradient optimization methods like REINFORCE and proximal policy optimization (PPO). Let’s go over these two algorithms now. I have also written a more principled from-scratch derivation over [here]

We’ll start by looking at the general case of taking the gradient of the expectation over of some function .

This is referred to as the score function gradient estimator.

Let’s think more about what this is doing. We want to calculate the gradient of with respect to . That is, we want to know how we can change such that we get samples from that on average give higher values. What this estimator says is that we can take the gradient (which tells you how to change to increase the likelihood of under your distribtuion ), and weight it with . So if this is being used in gradient ascent for example, we are placing more weight on updates that make high-scoring samples more likely under our model .

When we apply this to the MDP framework and simplify, we can get our policy gradient:

This gradient is referred to as the REINFORCE gradient and is only one type of policy gradient that could be used. Of course, this policy gradient is then used to update the weights of our model using gradient ascent:

where is some learning rate.

One implementation point is that the expectation is over the trajectories but of course we can’t take and sum over all possible trajectories. The expectation is estimated with just the sampled trajectories in the currenty batch. One other implementation point to mention: we could calculate our gradient and then pass that gradient to our optimizer, or we could let autograd handle the calculation of the gradient by constructing a loss function and treating it as a standard training loop. The latter is what is done in practice even though it is not explicitly mentioned often in the papers. So our loss function is:

Again, let’s reinforce the intuition behind REINFORCE gradient/loss function (pun definitely intended). You can see the loss looks very much like a negative log-likelihood loss, with the actions as our target. The diference here is that it is weighted by the reward. What this loss is doing is trying to make high-reward trajectories more likely and low-reward trajectories less likely.

Higher reward trajectories (represented by the checkmark) are encouraged by the policy gradient (represented by the higher peak).

Okay so we can simply plug in our diffusion model terms based on how it fits into the MDP framework, which we described earlier. We get:

This objective and gradient estimator is referred in the paper as DDPO_SF.

One challenge with this approach is that for each optimization step, the sampling from the current iteration of the model needs to be performed, we need to re-calculate as it comes from the current version of the model. This can be computationally expensive and wasteful, as the samples collected with previous iterations of the model cannot be used to learn.

One trick to address this is known as importance sampling. This relies on the following identity (trivial to demonstrate based on the expectation definition):

Applying importance sampling gives us the following:

Again this can be written down as a loss function that we perform gradient descent on (sometimes referred to as the surrogate loss):

Again, we can plug in the diffusion model terms based on the MDP framework and get this loss function:

Minimization of this loss function is equivalent to gradient with the following gradient:

Note that the reward is usually normalized, and the normalized reward is referred to the advantage . So the advantage can be negative if it’s less than average.

Note that we don’t want current policy diverge too much from the previous policy, otherwise we may diverge and get a bad policy. To help address this, we can apply clipping to the importance sampling ratio to the loss function:

So if the policy diverges too much (the ratio is either much larger or much smaller than 1) the loss function is clipped to a certain value and the gradient will be zero and no updates will be made. The below diagram (taken from here) clarifies this further:

Annotated PPO loss function.

Note that DDPO also clips the advantages themselves $ A(_0,) $ in its implementation but this is not described in the paper, so I have not included it in the loss function.

The loss function can be written in a way that’s numerically easier to calculate/more stable (using logs, ignoring the clipping for now):

The objective and gradient estimator described here is referred in the paper as DDPO_IS. It is pretty much the same as proximal policy optimization (PPO), applied to diffusion models.

For a more complete derivation of the DDPO objective from scratch, see here.

In order to start implementing this loss function, let’s calculate the log probs, which is easy for a normal distribution:

def calculate_log_probs(prev_sample, prev_sample_mean, std_dev_t):
    std_dev_t = torch.clip(std_dev_t, 1e-6)
    log_probs = -((prev_sample.detach() - prev_sample_mean) ** 2) / (2 * std_dev_t ** 2) - torch.log(std_dev_t) - math.log(math.sqrt(2 * math.pi))
    return log_probs

We need to get those log probs of the original model so our sampling function should return that. Let’s update our sampling function to do that:

@torch.no_grad()
def sd_sample(prompts, pipe, height, width, guidance_scale, num_inference_steps, eta, device):
    scheduler = pipe.scheduler
    unet = pipe.unet
    text_embeddings = pipe._encode_prompt(prompts,device, 1, do_classifier_free_guidance=guidance_scale > 1.0)

    scheduler.set_timesteps(num_inference_steps, device=device)
    latents = torch.randn((len(prompts), unet.in_channels, height//8, width//8)).to(device)

    all_step_preds, log_probs = [latents], []


    for i, t in enumerate(progress_bar(scheduler.timesteps)):
        input = torch.cat([latents] * 2)
        input = scheduler.scale_model_input(input, t)

        # predict the noise residual
        pred = unet(input, t, encoder_hidden_states=text_embeddings).sample

        # perform guidance
        pred_uncond, pred_text = pred.chunk(2)
        pred = pred_uncond + guidance_scale * (pred_text - pred_uncond)

        # compute the "previous" noisy sample mean and variance, and get log probs
        scheduler_output = scheduler.step(pred, t, latents, eta, variance_noise=0)
        t_1 = t - scheduler.config.num_train_timesteps // num_inference_steps
        variance = scheduler._get_variance(t, t_1)
        std_dev_t = eta * variance ** (0.5)
        prev_sample_mean = scheduler_output.prev_sample # this is the mean and not full sample since variance is 0
        prev_sample = prev_sample_mean + torch.randn_like(prev_sample_mean) * std_dev_t # get full sample by adding noise
        log_probs.append(calculate_log_probs(prev_sample, prev_sample_mean, std_dev_t).mean(dim=tuple(range(1, prev_sample_mean.ndim))))

        all_step_preds.append(prev_sample)
        latents = prev_sample
    
    return latents, torch.stack(all_step_preds), torch.stack(log_probs)

We can get everything we need for the loss function now (intermediate timesteps, log_probs, rewards):

per_prompt_stat_tracker = PerPromptStatTracker(buffer_size=32, min_count=16)
prompts = next(iter(train_dl))
pipe.text_encoder.to('cuda')
pipe.vae.to('cuda')
preds, all_step_preds, log_probs = sd_sample(prompts, pipe, 512, 512, 7.5, 50, 1, 'cuda')
imgs = decoding_fn(preds,pipe)
rewards = aesthetic_scoring(imgs, preprocess, clip_model, aesthetic_model_normalize, aesthetic_model)
advantages = torch.from_numpy(per_prompt_stat_tracker.update(np.array(prompts), rewards.squeeze().cpu().detach().numpy())).float().to('cuda')

0.00% [0/50 00:00

Here’s a function to compute the loss function:

def compute_loss(x_t, original_log_probs, advantages, clip_advantages, clip_ratio, prompts, pipe, num_inference_steps, guidance_scale, eta, device):
    scheduler = pipe.scheduler
    unet = pipe.unet
    text_embeddings = pipe._encode_prompt(prompts,device, 1, do_classifier_free_guidance=guidance_scale > 1.0).detach()
    scheduler.set_timesteps(num_inference_steps, device=device)
    loss_value = 0.
    for i, t in enumerate(progress_bar(scheduler.timesteps)):
        clipped_advantages = torch.clip(advantages, -clip_advantages, clip_advantages).detach()
        
        input = torch.cat([x_t[i].detach()] * 2)
        input = scheduler.scale_model_input(input, t)

        # predict the noise residual
        pred = unet(input, t, encoder_hidden_states=text_embeddings).sample

        # perform guidance
        pred_uncond, pred_text = pred.chunk(2)
        pred = pred_uncond + guidance_scale * (pred_text - pred_uncond)

        # compute the "previous" noisy sample mean and variance, and get log probs
        scheduler_output = scheduler.step(pred, t, x_t[i].detach(), eta, variance_noise=0)
        t_1 = t - scheduler.config.num_train_timesteps // num_inference_steps
        variance = scheduler._get_variance(t, t_1)
        std_dev_t = eta * variance ** (0.5)
        prev_sample_mean = scheduler_output.prev_sample
        current_log_probs = calculate_log_probs(x_t[i+1].detach(), prev_sample_mean, std_dev_t).mean(dim=tuple(range(1, prev_sample_mean.ndim)))

        # calculate loss

        ratio = torch.exp(current_log_probs - original_log_probs[i].detach()) # this is the importance ratio of the new policy to the old policy
        unclipped_loss = -clipped_advantages * ratio # this is the surrogate loss
        clipped_loss = -clipped_advantages * torch.clip(ratio, 1. - clip_ratio, 1. + clip_ratio) # this is the surrogate loss, but with artificially clipped ratios
        loss = torch.max(unclipped_loss, clipped_loss).mean() # we take the max of the clipped and unclipped surrogate losses, and take the mean over the batch
        loss.backward() # perform backward here, gets accumulated for all the timesteps

        loss_value += loss.item()
    return loss_value

loss = compute_loss(all_step_preds, log_probs, advantages, 10, 1e-4, prompts, pipe, 50, 7.5, 1, 'cuda')
print(loss)

100.00% [50/50 00:29<00:00]

2.6168066263198853

Complete training loop

Now that we can calculate the loss function, we can construct the full training loop. For a single epoch we:

Sample from the diffusion model num_samples_per_epochtimes, collecting the intermediate noisy images and log probs.
Pass the samples to the reward model and get reward, which we normalize to get advantage.
For num_inner_epochs times, we go over each sample compute the loss, backpropagate, and update our diffusion model.

Let’s define all our hyperparameters. We will be training Stable Diffusion v1.4 on ImageNet animal prompts as defined earlier, using the LAION Aesthetic classifier as our reward model.

Note that if we set num_inner_epochs to a high amount, this would be very data-efficient since we would be repeatedly using the previously generated trajectories, but we would probably significantly diverge from the original policy that we used to get those trajectories (or it would at least get clipped frequently in the loss). So we set num_inner_epochs=1. This is still pretty efficient using DDPO_IS because otherwise with DDPO_SF you’d need to resample after every iteration (when model is updated) instead of after num_samples_per_epoch=128 that we have here.

num_samples_per_epoch = 128
num_epochs = 50
num_inner_epochs = 1
num_timesteps = 50
batch_size = 4
img_size = 512
lr = 5e-6
clip_advantages = 10.0
clip_ratio = 1e-4
cfg = 5.0

Okay let’s set everything up:

# group all reward function stuff
def reward_fn(imgs, device):
    clip_model.to(device)
    aesthetic_model.to(device)
    rewards = aesthetic_scoring(imgs, preprocess, clip_model, aesthetic_model_normalize, aesthetic_model)
    clip_model.to('cpu')
    aesthetic_model.to('cpu')
    return rewards

# a function to sample from the model and calculate rewards
def sample_and_calculate_rewards(prompts, pipe, image_size, cfg, num_timesteps, decoding_fn, reward_fn, device):
    preds, all_step_preds, log_probs = sd_sample(prompts, pipe, image_size, image_size, cfg, num_timesteps, 1, device)
    imgs = decoding_fn(preds,pipe)    
    rewards = reward_fn(imgs, device)
    return imgs, rewards, all_step_preds, log_probs

Here we create our dataset, which is just randomly chosen prompts:

train_set = PromptDataset(imagenet_animal_prompts, num_samples_per_epoch)
train_dl = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True, num_workers=0)
per_prompt_stat_tracker = PerPromptStatTracker(buffer_size=32, min_count=16)
sample_prompts = next(iter(train_dl)) # sample a batch of prompts to use for visualization

pipe.unet.enable_gradient_checkpointing() # more performance optimization

optimizer = torch.optim.AdamW(pipe.unet.parameters(), lr=lr, weight_decay=1e-4) # optimizer

Now that we are set up, let’s start training! You can see the training loop is quite simple!

for epoch in master_bar(range(num_epochs)):
    all_step_preds, log_probs, advantages, all_prompts, all_rewards = [], [], [], [], []

    # sampling `num_samples_per_epoch` images and calculating rewards
    for i, prompts in enumerate(progress_bar(train_dl)):
        batch_imgs, rewards, batch_all_step_preds, batch_log_probs = sample_and_calculate_rewards(prompts, pipe, img_size, cfg, num_timesteps, decoding_fn, reward_fn, 'cuda')
        batch_advantages = torch.from_numpy(per_prompt_stat_tracker.update(np.array(prompts), rewards.squeeze().cpu().detach().numpy())).float().to('cuda')
        all_step_preds.append(batch_all_step_preds)
        log_probs.append(batch_log_probs)
        advantages.append(batch_advantages)
        all_prompts += prompts
        all_rewards.append(rewards)
    
    all_step_preds = torch.cat(all_step_preds, dim=1)
    log_probs = torch.cat(log_probs, dim=1)
    advantages = torch.cat(advantages)
    all_rewards = torch.cat(all_rewards)

    # inner loop
    for inner_epoch in progress_bar(range(num_inner_epochs)):
        # chunk them into batches
        all_step_preds_chunked = torch.chunk(all_step_preds, num_samples_per_epoch // batch_size, dim=1)
        log_probs_chunked = torch.chunk(log_probs, num_samples_per_epoch // batch_size, dim=1)
        advantages_chunked = torch.chunk(advantages, num_samples_per_epoch // batch_size, dim=0)
        
        # chunk the prompts (list of strings) into batches
        all_prompts_chunked = [all_prompts[i:i + batch_size] for i in range(0, len(all_prompts), batch_size)]
        
        for i in progress_bar(range(len(all_step_preds_chunked))):
            optimizer.zero_grad()

            loss = compute_loss(all_step_preds_chunked[i], log_probs_chunked[i], 
                                advantages_chunked[i], clip_advantages, clip_ratio, all_prompts_chunked[i], pipe, num_timesteps, cfg, 1, 'cuda'
                                ) # loss.backward happens inside
            
            torch.nn.utils.clip_grad_norm_(pipe.unet.parameters(), 1.0) # gradient clipping
            optimizer.step()

That’s it! Let’s see what results we get with this code.

Results

(These results were obtained with this script that included W&B tracking)

Here is the loss curve from training Stable Diffusion v1.4 with the LAION Aesthetic classifier reward model on ImageNet animal prompts:

As you can see, it’s quite noisy but clearly does decrease. That said, I’ve observed that typically the loss curve can be quite uninformative. Instead, as expected, the reward curve is a better indicator of the performance:

Here you can see a clear increase in average reward over training, which is what we want! So what do the samples look like? Let’s see:

I’d argue these images are definitely more aesthetic! It works! A few observations:

Sometimes the prompt isn’t being followed, as seen with the wolf spider example. This is since the reward model does not take into consideration the prompt and only looks at the image, so if generating something that is slightly unrelated to the prompt gives a better reward score then the model will do so. A reward model explicitly taking in the prompt as well and ensuring prompt alignment would be needed. One such model that could be used is PickScore.
Sometimes the RL-trained diffusion model generates pencil/charcoal drawings of the animals, which was observed in the original DDPO paper as well.
Another common pattern that the RL-trained diffusion model demonstrates is the generation of narrow depth-of-focus images, which of course clearly looks more artistic and aesthetic.

DRLX - A library for performing RLHF training with diffusion models

In order to make RLHF for diffusion models easy-to-use and accessible, I have co-developed a library with Shahbuland Matiana at CarperAI called DRLX. It implements DDPO, complete with W&B experiment tracking, distributed GPU training support, and other features. We also will be implementing other RL algorithms and adding more features in the coming weeks. Here I provide a quick overview, but check out our blog post for more information!

Let’s see how to do the same DDPO training of Stable Diffusion 1.4 with LAION aesthetic classifier on ImageNet prompts. First we’ll do our imports:

from drlx.trainer.ddpo_trainer import DDPOTrainer
from drlx.configs import DRLXConfig
from drlx.reward_modelling.aesthetics import Aesthetics
from drlx.pipeline.imagenet_animal_prompts import ImagenetAnimalPrompts

DRLX has a simple-to-use config system, where the model information and hyperparameters are described in a YAML file. See the config file for this example here.

config = DRLXConfig.load_yaml("configs/ddpo_sd_imagenet.yml")

DRLX has PromptPipeline to implement the prompts we pass into the diffusion model. We already have a prompt pipeline for ImageNet animal prompts implemented in the library:

pipe = ImagenetAnimalPrompts(prefix='', postfix='', num=config.train.num_samples_per_epoch)

All we have to do is instantiate our DDPOTrainer and call train(), passing in our reward model. We have a RewardModel class that can be subclassed to implement the desired reward function, and the LAION aesthetic classifier is already provided as the Aesthetics lass:

trainer = DDPOTrainer(config)
trainer.train(pipe, Aesthetics())

It’s that simple!

Conclusion

Note that much of what we discuss here for how RL is applied to diffusion models also applies to language models. Specifically, the autoregressive generation of text from a language model can be viewed as a trajectory from an MDP. The state is the previous tokens, the action is the next token to be predicted, and the policy is of course the language model. The PPO algorithm that we described is the most common RL algorithm for RLHF training of language models. Overall, this hints to a deeper connection between diffusion models and autoregressive language models and how ideas can be transferred from one domain to another. For example, recently it was demonstrated how classifier-free guidance could be applied to language models. There may continue to be interesting ideas to apply from diffusion models to language models or vice versa.

This paper only is the start of applying RL to diffusion models. DDPO_IS is only a baseline, and there are changes that can easily be made to improve the performance, such as value function baselines, reward discounting, KL regularization, etc. Additionally, alternative RLHF algorithms like direct preference optimization could be explored. We plan to explore these directions further via DRLX.

Finally, I just want to say this was an interesting learning experience implementing my very first RL algorithm (and building a library based on that too). It took me a lot of reading of RL course material, blog posts, etc. looking at code implementations, wrestling with very subtle bugs, etc. but after about 2 weeks I managed to get it working. As many people in the RL field have experienced, it didn’t take much time for me to form a love-hate relationship with RL 😂 but I still think it’s a very interesting field and I’m excited to explore it further.

Acknowledgements

Thank you to Costa Huang who helped me debug my code and providing feedback on the blog post. Thank you to Jonathan Whitaker and inox/hayley for providing feedback on my blog post. Thank you to Shahbuland Matiana who I worked closely with on DRLX.

Gradio + HuggingFace Spaces: A Tutorial

Tanishq Mathew Abraham, Ph.D. — Tue, 16 Nov 2021 08:00:00 GMT

Introduction

After you train a machine learning model, the next thing to do is showcase it to the world by making a demo. Currently, the easiest way to do so is with Gradio, hosting on HuggingFace Spaces. With the Gradio framework deployed on Spaces, it takes <10 minutes to deploy a model! Let’s see how we can easily deploy a model for the world to try out with these platforms. We will use a classic CNN pet classifier as an example.

Preliminaries: Training a pet classifier

Before we make a demo, we need to have a model to actually demo! Let’s quickly train a simple ResNet50 pet classifier on the Oxford Pets dataset using fastai.

from fastai.vision.all import *
path = untar_data(URLs.PETS)
dls = ImageDataLoaders.from_name_re(path, get_image_files(path/'images'), pat='(.+)_\d+.jpg', item_tfms=Resize(460), batch_tfms=aug_transforms(size=224, min_scale=0.75))
learn = vision_learner(dls, models.resnet50, metrics=accuracy)
learn.fine_tune(1)
learn.path = Path('.')
learn.export()

epoch	train_loss	valid_loss	accuracy	time
0	0.973277	0.309940	0.905954	00:32

epoch	train_loss	valid_loss	accuracy	time
0	0.420781	0.260167	0.910690	00:34

And with fastai, it’s that simple! Learn more about fastai, a simple and flexible PyTorch training framework, over here.

Using Gradio

Let’s see how to make a demo web app with Gradio. First let’s load our model:

learn = load_learner('export.pkl')

Next, let’s define a prediction function our model:

labels = learn.dls.vocab
def predict(img):
    img = PILImage.create(img)
    pred,pred_idx,probs = learn.predict(img)
    return {labels[i]: float(probs[i]) for i in range(len(labels))}

Finally, let’s import Gradio and use it’s functionality to make an interface and launch it. Note that if you are doing this from a notebook, the Gradio demo will also show up within the notebook for you to try interactively (here I just show screenshots).

import gradio as gr
gr.Interface(fn=predict, inputs=gr.inputs.Image(shape=(512, 512)), outputs=gr.outputs.Label(num_top_classes=3)).launch(share=True)

Running on local URL:  http://127.0.0.1:7860/
Running on public URL: https://10290.gradio.app

This share link will expire in 72 hours. To get longer links, send an email to: support@gradio.app

(, 'http://127.0.0.1:7860/', 'https://10290.gradio.app')

That’s it! The actual creation of the demo takes one line!¹

All Gradio interfaces are created by constructing a gradio.Interface() object. As you can see in this example, the Interface object takes in the function that we want to make an interface for (usually an ML model inference function), Gradio input components (the number of input components should match the number of parameters of the provided function), and Gradio output components (the number of output components should match the number of values returned by the provided function). Gradio provides components for various types of input and output types. This includes: images (upload, draw, or webcam), video, audio (upload or microphone), textboxes, dataframes, timeseries, generic files, and more! So you should be able to create a Gradio demo for virtually any type of ML task you can think of!

After the gradio.Interface() object is defined, the interface is launched with the launch method.

Optional: customizing our Gradio app

Gradio has lots of features that we can use to customize our app. Let’s go over a few of these features and add them to our demo. All of these features are arguments for the instantiation of the Interface class.

First of all, we can pass in a title and description for our app which goes at the top before our input and output components:

title = "Pet Breed Classifier"
description = "A pet breed classifier trained on the Oxford Pets dataset with fastai. Created as a demo for Gradio and HuggingFace Spaces."

We can also put a link at the bottom of our demo. Here I will link to this blog post:

article="Blog post
"

We can also provide some example inputs that people can try out. Here I have provided an example Siamese cat image, which is in the same directory as my code:

examples = ['siamese.jpg']

Another interesting feature that Gradio has is the ability for interpretation so that users can understand what parts of the input are responsible for the output. We’ll use the default interpretation function provided by Gradio but you can use your own as well:

interpretation='default'

Note that the default interpretation function needs scikit-image to be installed. More information on the interpretation feature is provided here.

Gradio also provides a screenshotting feature that can make it really easy to share your examples and results with others. It is enabled by default.

Finally, Gradio also supports serving of inference requests with a queue. This can be helpful when your app receives a significant amount of traffic. We’ll enable a queue here:

enable_queue=True

You can also add custom CSS for your Gradio app but we’ll not do that here (my CSS skills are essentially non-existent! 😂). Additionally, you can set live=True so that it will automatically submit when you make a change to the input, but removes the Submit button so I won’t use it for now.

Let’s put it all together and make our interface with these additional features:

gr.Interface(fn=predict,inputs=gr.inputs.Image(shape=(512, 512)),outputs=gr.outputs.Label(num_top_classes=3),title=title,description=description,article=article,examples=examples,interpretation=interpretation,enable_queue=enable_queue).launch(share=True)

Running on local URL:  http://127.0.0.1:7861/
Running on public URL: https://30513.gradio.app

This share link will expire in 72 hours. To get longer links, send an email to: support@gradio.app

(,
 'http://127.0.0.1:7861/',
 'https://30513.gradio.app')

Check the Gradio documentation for more information on how to customize your interface.

Let’s put it all into one file which we name app.py:

import gradio as gr
from fastai.vision.all import *
import skimage

learn = load_learner('export.pkl')

labels = learn.dls.vocab
def predict(img):
    img = PILImage.create(img)
    pred,pred_idx,probs = learn.predict(img)
    return {labels[i]: float(probs[i]) for i in range(len(labels))}

title = "Pet Breed Classifier"
description = "A pet breed classifier trained on the Oxford Pets dataset with fastai. Created as a demo for Gradio and HuggingFace Spaces."
article="Blog post
"
examples = ['siamese.jpg']
interpretation='default'
enable_queue=True

gr.Interface(fn=predict,inputs=gr.inputs.Image(shape=(512, 512)),outputs=gr.outputs.Label(num_top_classes=3),title=title,description=description,article=article,examples=examples,interpretation=interpretation,enable_queue=enable_queue).launch()

Let’s also make a requirements.txt file which will allow us to install the packages that we need in whatever environment we need:

fastai
scikit-image

Now that we have our self-contained web app, we could deploy this on any webserver or cloud platform that we want. But let’s see how we can use HuggingFace Spaces to deploy it.

Using HuggingFace Spaces

HuggingFace Spaces is a free-to-use platform for hosting machine learning demos and apps. The Spaces environment provided is a CPU environment with 16 GB RAM and 8 cores. It currently supports the Gradio and Streamlit platforms. Here we will make a Space for our Gradio demo.

In order to be able to create a HuggingFace Space, you need to have a HuggingFace account. You can sign up for free here. After signing up, you can create a Space by clicking “New Space” on the navigation menu (press on your profile image).

Now you will be shown instructions on how to add your code to this Space from the command line to prepare the demo. Spaces are essentially git repositories (like GitHub) with an app.py file from which the demo is prepared.

So we can clone the repository to a local directory,

git clone https://huggingface.co/spaces/tmabraham/fastai_pet_classifier

add the app.py, requirements.txt, export.pkl, and siamese.jpg files,

cp app.py fastai_pet_classifier/app.py
cp requirements.txt fastai_pet_classifier/requirements.txt
cp export.pkl fastai_pet_classifier/export.pkl
cp siamese.jpg fastai_pet_classifier/siamese.jpg

Now before we commit our files, there is something we need to pay attention to. Our model file export.pkl is too big to be handled by git. So instead we need to use git-lfs which you first need to install. If you are on Debian or Ubuntu, you can directly use apt-get install git-lfs (which installs an older version but that’s not really an issue). For other Linux distros, you can use this script which Jeremy Howard has prepared. For Windows, you can download and run the installer from here. For MacOS, you can do brew install git-lfs.

Once you have installed git-lfs, you can then initialize git-lfs in the repository for the app in the following way:

git lfs install
git lfs track "*.pkl"
git add .gitattributes
git commit -m "update .gitattributes so git lfs will track .pkl files"

Now, we can commit and push the changes to the Space.

git commit -am "let's deploy to huggingface spaces"
git push

Alternatively, the files can be uploaded via the Spaces UI. When you go to your Space, under “Files and versions”, there is an “Add files” button which you can use to upload your app files.

After a few moments, during which the app is being built, our demo should show up on the HuggingFace Space.

That’s it! In a few minutes, you trained a pet classifier model with fastai, made a demo interface with Gradio, and hosted it for free on a HuggingFace Space! You can try it out right below or you can try it out on HuggingFace Spaces here. All the files described in this post located here).²

If you are a more advanced user with expertise in web development, you might be interested to know that there is an API available for any Gradio interface (there is a “view the api” link at the bottom of the interface). For example, here is a link to the API docs for my interface. This provides much more flexibility, like interacting with your model very easily in code. For example, here I can take any image URL and get a pet breed prediction with my model.

import requests
import gradio as gr
from IPython.display import Image
from IPython.core.display import HTML 
image_url = 'https://petkeen.com/wp-content/uploads/2021/05/grey-cat.jpeg'
data = gr.processing_utils.encode_url_or_file_to_base64(image_url)
r = requests.post(url='https://hf.space/embed/tmabraham/fastai_pet_classifier/+/api/predict/', json={"data":[data]})


print(f"The breed of this pet is a {(' '.join(r.json()['data'][0]['label'].split('_')))}:")
display(Image(url=image_url, width=475))
print('Original JSON returned from the request: ', json.dumps(r.json(), indent=2))

The breed of this pet is a British Shorthair:

Original JSON returned from the request:  {
  "data": [
    {
      "label": "British_Shorthair",
      "confidences": [
        {
          "label": "British_Shorthair",
          "confidence": 0.9997965693473816
        },
        {
          "label": "Russian_Blue",
          "confidence": 0.00019805884221568704
        },
        {
          "label": "Sphynx",
          "confidence": 2.037774265772896e-06
        }
      ]
    }
  ],
  "flag_index": null,
  "updated_state": null,
  "durations": [
    0.09037947654724121
  ],
  "avg_durations": [
    0.13969146820806688
  ]
}

Some examples of using the API in custom websites is provided here (put together by Jeremy Howard and members of the fast.ai community).

For more information on Gradio and HuggingFace Spaces, check the relevant docs and forums:

There are so many features of Gradio and Spaces that I haven’t mentioned here (like multiple models per demo, the Blocks feature, etc.). Additionally, both Gradio and HuggingFace Spaces are in active development and new, amazing features afe always being added by tje Gradio and HuggingFace teams! For this reason, I also recommend following HuggingFace and Gradio on Twitter to hear about the latest updates and newest features.

I’ll end by sharing a quick example prediction by my pet classifier of our new kitten! Her name is Mimi and, as predicted by my classifier here, she is indeed a Ragdoll kitten!:

Acknowledgements

Thanks to Zach Mueller, Ahsen Khaliq, Abhishek Thakur, and Jeremy Howard for reviewing my blog post.

Footnotes

One of the developers of Gradio created a simple Python module to easily create Gradio demos for fastai Learner objects. Check it out here. It currently only supports image-to-label interfaces but it could likely be expanded to other tasks fairly easily.↩︎
Recently, HuggingFace added direct support for pushing and loading fastai models to the HuggingFace Hub with the push_to_hub_fastai and from_pretrained_fastai functions, respectively. This can make creating Spaces much easier, since you can just load it in the Space and not have to add it to the repository with git-lfs. See an example of this over here.↩︎

Coding with GitHub Copilot

Tanishq Mathew Abraham, Ph.D. — Wed, 14 Jul 2021 07:00:00 GMT

Coding with GitHub Copilot

On July 1st, I was able to obtain access to GitHub Copilot, thanks to Hamel Husain. I wanted to share my experience and discoveries about this new tool. Much of the findings was demonstrated with the help of Mazen Alotaibi, Ryan Panwar, and Mark Saroufim.

What is GitHub Copilot?

GitHub Copilot is a tool that helps you to code faster

If you haven’t logged onto Twitter or Hacker News in the last couple weeks, you might not know about GitHub Copilot. Developed out of a partnership between OpenAI and Microsoft (GitHub’s parent company), it’s an AI-based autocomplete tool that helps you to write code faster. The GitHub team has termed it “your AI pair programmer”. OpenAI CTO Greg Brockman has explained that it utilizes the currently-unreleased Codex model, which is apparently a successor to the (in)famous GPT-3 language model. It has been trained on billions of lines of code available on GitHub ¹.

Based on the demos that GitHub Copilot provided and favorable reviews from beta-testers, I was eager to give it a try, but I was also skeptical if it really was as life-changing as people claimed it was. To my surprise, it was much better than I expected.

Here is a demo of GitHub Copilot in action (specifically for an ML-related task):

It’s clear that GitHub Copilot understands the general PyTorch training workflow, and understands intricacies like what are the appropriate augmentations for images (resizing, random crop, normalization, etc.), making sure to put model into evaluation mode and with torch.no_grad() during validation, etc. These are things that sometimes we may forget to do, so it’s great that GitHub Copilot can help prevent us from making these common mistakes.

GitHub Copilot performs best when you provide it with comments describing what you are trying to do. It then uses the comments to generate a list of possible completions. This is highlighted in the example above, where I wrote a few lines about what I wanted to do (fine-tuning a pretrained ResNet50 on a custom dataset) and how I wanted to do that, and it mostly completed the rest of the code for me. I think this is great, because it changes the way we code. It now drives code development to focus on documentation, since writing good documentation often results in better Copilot suggestions.

On a related note, some have hypothesized that GitHub Copilot might also lead to more test-driven development:

I also want to point out that while most demos directly use GitHub Copilot in the editor, it’s also possible to open GitHub Copilot in a separate tab and have it generate and present multiple suggestions for you. Here’s an example:

I quite like this feature, because it provides various approaches for solving a particular task, and I can select which approach I want to use. For instance, in the above example, it shows various approaches for defining a ResNet50 model for fine-tuning. I typically prefer defining a class for the ResNet50, so I select that option.

There is another unintended consequence of GitHub Copilot that I find interesting. GitHub Copilot actually makes a pretty good autocomplete tool for regular writing. I actually discovered this when I started writing this blog post in a Markdown file in the VS Code editor. Of course, this is likely GitHub Copilot learning from README files and other documentation in various repositories, and there could be some residual general knowledge from the underlying GPT-3 model (if that is indeed the base model used) ². But I would genuinely consider writing more in Markdown files with VS Code + GitHub Copilot because some of the autocomplete suggestions are actually quite helpful.

Challenges with GitHub Copilot

There are several challenges that I think could preclude widespread use of GitHub Copilot:

Leaking of personal information
Limited multi-lingual capabilities
Copyright/licensing issues
Usage of outdated APIs

Let’s dive into each of these issues further.

Personal information shared by GitHub Copilot

One aspect we discovered was that GitHub Copilot would inadvertantly share information that would be considered personal, such as people’s names, phone numbers, emails, etc. This was something Mazen and I explored further. Here are a few examples of this.

In a Python file, simply asking it to create a function to list author names indeed gives the name of a person that exist:

Mark demonstrated an example when writing a bash script when an actual person’s name was suggested in an autocompletion here.

Interestingly, this method did not work for returning other types of information like phone numbers:

But if we just ask GitHub Copilot to autocomplete phone number in a comment at the beginning of a Python file, it does work:

Mazen looked more into this number, and found out it was used in several GitHub repositories, including a programming example problem here.

Mark also discovered that working API keys were provided by GitHub Copilot:

Interestingly, from my experiments, I was not able to get GitHub Copilot to leak any e-mail addresses.

On their website, GitHub Copilot has the following information:

So this confirms that indeed private information was available in the training set that allows GitHub Copilot to leak this information. I was unable to easily get email addresses because of the rudimentary filtering that GitHub Copilot performed.

Multi-lingual capabilities of GitHub Copilot

As we mentioned before, GitHub Copilot performs best when you provide it with comments explaining your intent. Therefore, Mazen and I wanted to explore how well GitHub Copilot can perform with comments in various languages. I have used Google Translate to translate my English comments to various languages and observe how well it performed. Let’s go over an example. Below, I give GitHub Copilot the prompt to “Add two numbers” and see what Python code it suggests:

English

Mandarin

Spanish

Arabic

Of course, if you comment with English, GitHub Copilot provides a good suggestion. It gives us an adding function as well as some use-cases. But as demonstrated in these experiments, the quality of GitHub Copilot suggestions when given comments in other languages likely is correlated with the overall frequency of these languages in the training data. It’s likely that Mandarin and Spanish is more common than Arabic in the training set, so GitHub Copilot performs better with Mandarin and Spanish comments. Of course, this is a single example (although I observed similar results with other prompts). However, given that it’s well-established that biases in the training data are reflected in the output of any ML algorithm (unless it is appropriately counteracted), I think it is safe to assume that GitHub Copilot will likely be less useful for non-English-speaking users.

Copyright/licensing issues

Let’s move on to the elephant in the room: copyright/licensing issues. GitHub Copilot/Codex was trained on all public GitHub code, regardless of license (confirmed by GitHub). While some argue that training on copyrighted code is not an issue, it becomes much more challenging to argue that when Copilot is regurgitating public code verbatim ³. According to GitHub, Copilot repeats code snippets verbatim about 0.1% of the time. They have also provided a more in-depth study here. Thankfully they are currently developing origin tracker that tells where the verbatim code is coming from and allows you to decide whether to include proper attribution or not use that code altogether.

In my opinion, because of these copyright issues, GitHub Copilot in its current state is not usable for commericial purposes. I think that once the origin tracker is released, copyright issues will be resolved, although it puts the onus on the user to make sure that code is properly attributed. Of course, the easier solution would have been to avoid training on copyrighted and GPL-licensed code altogether, which would have likely prevented the significant controversy that arose, and I wonder what led to the decision to train on all public GitHub code instead of further curating the dataset.

Usage of outdated APIs

As an ML researcher and developer, I am typically working with the latest ML frameworks and tools. However, GitHub Copilot is trained on older codebases and does not have knowledge of these cutting-edge tools and is often unable to provide relevant suggestions.

I first discovered this issue when trying to write fastai-related code and get GitHub Copilot to provide relevant suggestions. However, since the latest version of fastai was only released in August 2020, GitHub Copilot was not able to provide any relevant suggestions and instead provided code for using older versions of fastai. This indicates that the codebases that GitHub Copilot is trained on must be at least before August 2020, if not earlier. Similarly, I discovered that GitHub Copilot was unable to provide any suggestions regarding the usage of the timm library, which is one of the leading deep learning+computer vision libraries.

Here is a video that demonstrates this issue:

To me, this is a major concern regarding the current usability of GitHub Copilot ⁴. If we are using cutting edge tools like PyTorch XLA, JAX, fastai, timm, GitHub Copilot has no knowledge of this and cannot provide useful suggestions. Somehow, the GitHub team needs to keep Copilot updated on newer codebases. Given that telemetry of GitHub Copilot usage is being sent to GitHub, it’s possible that the GitHub team can further train their model on the usage of these newer codebases. Indeed, it is mentioned in the documentation that the telemetry data is used for “improving the underlying code generation models, e.g. by providing positive and negative examples (but always so that your private code is not used as input to suggest code for other users of GitHub Copilot)”. Additionally, a GitHub Developer Advocate has mentioned that “the model is being trained everyday, so the more people use it, Copilot will learn that these suggestions need to be updated”.

I wonder if the GitHub team might also develop a way of perhaps fine-tuning GitHub Copilot to specific use-cases. For example, there may be a specific GitHub Copilot models for fastai, JAX, etc. They would be fine-tuned on the source code of of these libraries and codebases that use these libraries. But making sure that the tool does not provide outdated suggestions would still be a challenge. I don’t think it would be possible to provide suggestions for a brand-new library that does not have enough codebases using it to train on. Additionally, for situations like fastai where there are older APIs and newer APIs, when fine-tuning a model, the codebases using the older APIs would have to be filtered out.

All in all, I personally think that for practical applications, it is necessary for GitHub Copilot to provide suggestions for new codebases, and doing so might be a difficult but potentially solvable challenge.

How might GitHub Copilot be commercialized?

While it is currently available for free to the beta-testers, the GitHub team has already mentioned they plan to commercialize this product. There are several ways that GitHub Copilot could be commercialized:

A monthly fee for personal use of a generic GitHub Copilot model
Enterprises paying for a model fine-tuned to their specific, private codebases
Separate fees for domain-specific models (ex: a GitHub Copilot model for writing machine learning code, or a GitHub Copilot model for web development)

Conclusion

In conclusion, GitHub Copilot, is a mind-blowing and extremely powerful tool. Additionally, it is a very interesting and practical application of AI. With the domains that it is most familiar, GitHub Copilot works exceptionally well and can write most of the code for you! It may very well change the approach and workflow many programmers have and lead to documentation-driven and test-driven development.

But it’s not yet ready for prime time. There are clear issues with leaking of personal information copyright/licensing issues, accessibility to foreign-language users, and its use on more cutting-edge projects. Thankfully, the GitHub team is working on these issues and I’m excited by the future of AI-augmented programming!

Acknowledgements

Thank you to Hamel Husain for helping to provide access to the GitHub Copilot tool and also for reviewing the blog post.

Thank you to Mazen Alotaibi, Ryan Panwar, and Mark Saroufim for sharing their ideas to try with GitHub Copilot and also for reviewing the blog post.

Footnotes

The OpenAI team has recently released a paper on the Codex model that was trained on Python code, and it is noted that the GitHub Copilot model is a descendant of the one reported in the paper. Importantly, this paper indicates that Codex model is a fine-tuned GPT-3 model. It is likely that the GitHub Copilot version is also a GPT-3 model that is instead fine-tuned on the whole GitHub dataset.↩︎
This video demonstrates an example of some of the more general knowledge GitHub Copilot seems to have.↩︎
Yannic Kilcher provides a nice explanation of the potential copyright/GPL licensing issues over here.↩︎
A related issue that many people, including myself, have observed is that sometimes recommendations are for older versions of a programming language, such as providing Python 2.7 suggestions instead of Python 3.↩︎

Introducing Noisy Imagenette

Tanishq Mathew Abraham, Ph.D. — Tue, 02 Mar 2021 08:00:00 GMT

TL;DR: We introduce a dataset, Noisy Imagenette, which is a version of the Imagenette dataset with noisy labels. We hope this dataset is useful for rapid experimentation and testing of methods to address noisy label training.

Introduction

Dataset have noisy labels!

Deep learning has led to impressive results on datasets of all types, but its success often shines when models are trained with large datasets with human-annotated labels (extreme example: GPT-3 and more recently CLIP/ALIGN/DALL-E). A major challenge when constructing these datasets is obtaining enough labels to train a neural network model. There is an inherent tradeoff between the quality of the annotations and the cost of annotation (in the form of time or money). For example, while using sources like Amazon Mechanical Turk provide cheap labeling, the use of these non-expert labeling services will often produce unreliable labels. This is what is referred to as noisy labels, as these unreliable labels are not necessarily ground truth. Unfortunately, neural networks are known to be susceptible to overfitting to noisy labels (see here) which means alternative approaches are needed to achieve good generalization in the presence of noisy labels.

Prior research on noisy labels

Recently, many techniques have been presented in order to address label noise. These include novel loss functions like Bi-Tempered Logistic Loss Taylor Cross Entropy Loss, or Symmetric Cross Entropy. Additionally, there are many novel training techniques that have been recently developed like MentorMix, DivideMix, Early-Learning Regularization and Noise-Robust Contrastive Learning.

Most of these papers are using MNIST, SVHN, CIFAR10 or related datasets with synthetically-added noise. Other common datasets are the WebVision and Clothing1M datasets, which are real-world noisy, large-scale datasets with millions of images. Therefore there is an opportunity to develop a mid-scale dataset that allows for rapid prototyping but is complex enough to provide useful results when it comes to noisy label training.

fastai’s Imagenette - a dataset for rapid prototyping

The idea of mid-scale datasets for rapid prototyping has been explored in the past. For example, in 2019, fast.ai released the Imagenette and Imagewoof datasets (subsequently updated in 2020), subsets of Imagenet for rapid experimentation and prototyping. It can serve as a small dataset proxy for the ImageNet, or a dataset with more complexity than MNIST or CIFAR10 but still small and simple enough for benchmarking and rapid experimentation. This dataset has been used to test and establish new training techniques like Mish activation function and Ranger optimizer (see here). The dataset also has been used in various papers (see here, here, here, here, here, and here). Clearly, this dataset has been quite useful to machine learning researchers and practitioners for testing and comparing new methods. We believe that an analogous dataset could be useful to researchers with modest compute for testing and comparing new methods for addressing label noise.

Introducing Noisy Imagenette

We introduce Noisy Imagenette, a version of Imagenette (and Imagewoof) that has synthetically noisy labels at different levels: 1%, 5%, 25%, and 50% incorrect labels. The Noisy Imagenette dataset already comes with the Imagenette dataset:

from fastai.vision.all import *
source = untar_data(URLs.IMAGENETTE)

While the regular labels for Imagenette dataset are given as the names of the image folder, the noisy labels are provided as a separate CSV file with columns corresponding to the image filename and labels for each of the different noise levels:

csv_file = pd.read_csv(source/'noisy_imagenette.csv')
csv_file.head()

	path	noisy_labels_0	noisy_labels_1	noisy_labels_5	noisy_labels_25	noisy_labels_50	is_valid
0	train/n02979186/n02979186_9036.JPEG	n02979186	n02979186	n02979186	n02979186	n02979186	False
1	train/n02979186/n02979186_11957.JPEG	n02979186	n02979186	n02979186	n02979186	n03000684	False
2	train/n02979186/n02979186_9715.JPEG	n02979186	n02979186	n02979186	n03417042	n03000684	False
3	train/n02979186/n02979186_21736.JPEG	n02979186	n02979186	n02979186	n02979186	n03417042	False
4	train/n02979186/ILSVRC2012_val_00046953.JPEG	n02979186	n02979186	n02979186	n02979186	n03394916	False

The generation of these noisy labels are provided in this Jupyter notebook. We have also updated fastai’s train_imagenette.py to utilize the new noisy labels. If you want to train on the Noisy Imagenette dataset using this script, just simply pass the --pct-noise argument to the script with the desired noise level.

Note

The validation set remains clean and its labels are not changed. While technically the accuracy metric is robust to noise, I believe it’s simpler to use a clean validation set to clearly understand see if a model is learning appropriate decision boundaries on the ground truth.

For the original Imagenette dataset, there are technically (3 image sizes)*(4 number of epoch levels) for both Imagenette and Imagewoof giving a total of 24 leaderboards. If we had each of these 24 leaderboards for the previously mentioned 4 noise levels (1%, 5%, 25%, 50%), that would give us 96 leaderboards! Instead, the Imagenette repository only maintains leaderboards for Noisy Imagenette (and not Imagewoof) for 5% and 50% noise (24 leaderboards). Just like with the regular Imagenette leaderboards, feel free to send a pull request to the Imagenette repository with your results if it beats the current top score. I have provided a baseline which is currently on the leaderboards, as well as a CSV file with the baseline accuracy for all 96 leaderboards.

Backstory

For some background, I started looking into training with label noise because of the recent Cassava Leaf Disease Kaggle Competition (my team was able win a silver medal, see here), which had a really noisy dataset. One of the recent techniques I heard about was SAM, which recently achieved a state-of-the-art score on ImageNet (only to be beaten in a few weeks by techniques/models like Meta Pseudo Labels and NFNets). However, the paper also included some improvements to noisy label training. I had a fastai implementation in-progress for the SAM optimizer (probably will describe in an upcoming blog post) and I wanted to test out its noisy label training capabilities on a dataset. I thought about corrupting the Imagenette labels and use that as my dataset for testing SAM. Jeremy Howard suggested adding it to the main Imagenette dataset and here we are!

Closing Remarks

In conclusion, I hope that this Noisy Imagenette datasets serves as a useful benchmarking dataset for machine learning community when it comes to testing and comparing techniques for training on noisy labels. I hope to experiment with some of these techniques like SAM, the different loss functions, etc. and record those results over on this blog, so be sure to keep an eye on this blog, or follow me on Twitter to get the latest updates!

Acknowledgments

I’d like to thank Jeremy Howard and especially Hamel Husain for adding the Noisy Imagenette dataset. I also would like to thank Hamel Husain for reviewing my blog post and providing feedback. I’d like to thank Isaac Flath for pointing out an error I originally had when generating the dataset.