June 13, 2024

Performance Evaluation Metrics for the Apollo Model

Performance Evaluation Metrics for the Apollo Model

To evaluate the effectiveness of commercially available Large Language Models (LLMs) in the medical field, we focus on their ability to address a wide array of medical inquiries. These include not only clinical questions but also extend to the financial aspects of healthcare such as medical billing, coding, and appeals. Currently, many medical LLMs struggle to accurately handle these specialized financial queries due to the complexities of healthcare regulations and the detailed nature of medical coding. Furthermore, the extensive length of medical records often poses a challenge for clinical LLMs like Clinical Bert, which have limitations related to context size. To overcome these obstacles, we have developed a comprehensive solution: the Apollo Model. This advanced Medical Large Language Model is designed to encompass a deep understanding of the medical domain, equipped with the capability to retrieve relevant medical information, decode intricate coding systems, and apply expert-level reasoning efficiently.

Apollo Model was  fine tuned on large amount of medical data and also data related medical coding and billing and appeal generation.

To evaluate this model we assessed the model on following medical Benchmark datasets:

  1. MedQA-USMLE-4-options
  2. MedMCQA4
  3. PubMedQA5
  4. MMLU Professional Medicine
  5. MMLU Clinical knowledge

In addition we conducted human evaluation test to assess the long  answer formats on following metrics:

  • Scientific Consensus
  • Extent of Possible Harm
  • Likelihood of Possible Harm
  • Evidence of Correct Comprehension
  • Evidence of Correct Retrieval
  • Evidence of Correct Reasoning
  • Evidence of Incorrect Comprehension
  • Evidence of Incorrect Retrieval
  • Evidence of Incorrect Reasoning
  • Incorrect Content
  • Missing Content

Benchmark datasets:


The "MedQA-USMLE-4-options" accuracy metric refers to the performance of various Large Language Models (LLMs) on a specific dataset within the MedQA suite, styled after the United States Medical Licensing Examination (USMLE) that present four answer choices. This metric provides an indication of how well these models can handle medical multiple-choice questions that are structured in a manner similar to those found in the USMLE, which is a comprehensive test for medical licensure in the United States.

*Medprompt retrieval is a prompt and retrieval technique that allows you to insert similar questions and answers from a large knowledge base into the prompt before answering the question


The "PubMedQA" accuracy metric specifically measures the performance of Large Language Models (LLMs) on the PubMedQA dataset. PubMedQA is a dataset designed for evaluating models on their ability to answer biomedical questions based on abstracts from PubMed, which is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The dataset poses questions that are derived from article titles, with the answers being yes, no, or maybe, depending on the content of the corresponding abstract.

Apollo represents a notable enhancement, suggesting that specific prompting strategies can greatly improve the model's ability to understand and answer questions based on biomedical abstracts effectively.


The "MedMCQA" accuracy metric evaluates the performance of Large Language Models (LLMs) on the MedMCQA dataset. MedMCQA is a dataset specifically designed for medical multiple-choice question-answering. It contains a wide range of questions that cover various medical subjects, challenging the models to choose the correct answer from multiple options provided.

The MedMCQA dataset is critical for testing how well LLMs can navigate complex medical scenarios where they must utilize their understanding of medical facts and their reasoning abilities to pick the correct answer from several plausible options. This ability is essential in real-world medical applications where decisions often have to be made by choosing the best option from many potential ones. The scores on this metric for offer insights into Apollo model's depth of medical knowledge and its practical application in a testing scenario akin to what medical professionals might encounter.

MMLU Clinical Knowledge

The "MMLU Clinical Knowledge" accuracy metric evaluates the performance of Large Language Models (LLMs) on the Clinical Knowledge subset of the MMLU (Massive Multitask Language Understanding) dataset. The MMLU dataset is extensive and covers a wide range of subjects; the Clinical Knowledge subset specifically focuses on testing the models' understanding of medical concepts, practices, and their ability to apply this knowledge to answer clinically relevant questions.

The MMLU Clinical Knowledge accuracy metric is crucial for assessing the capability of LLMs in medical domains, particularly in understanding and applying clinical information in practical scenarios. These scores provide insight into each model’s effectiveness in real-world medical settings, where accurate and contextually appropriate knowledge application is critical. The high performance of Apollo model on the MMLU Clinical Knowledge subset highlights their potential utility in supporting healthcare professionals by providing reliable information.

MMLU Professional Medicine

The "MMLU Professional Medicine" accuracy metric assesses the performance of Large Language Models (LLMs) on the Professional Medicine subset of the MMLU (Massive Multitask Language Understanding) dataset. This particular subset is designed to test the models' grasp of advanced medical knowledge that would typically be used by healthcare professionals in their practice. It involves a deeper level of understanding and application of medical principles, diagnoses, treatments, and ethical considerations in medicine.

The MMLU Professional Medicine accuracy metric is vital for evaluating how well LLMs can support medical professionals by providing accurate, actionable, and ethically sound medical advice. High scores in this category  for Apollo Model are indicative of a model's potential to act as a reliable aid in clinical decision-making, capable of handling the sophisticated challenges faced in healthcare environments.

Final Scores are represented below. Compared to every model without Medprompt retrieval, Apollo met or had higher scores across the board.

When adding Medprompt retrieval and comparing to all models (including those with Medprompt retrieval), Apollo yet again had met or beat every other clinical LLM.

Human evaluation Test:

We wanted to go farther than standard benchmark tests and evaluate Apollo with a panel of human experts, from billers/coders to actual practitioners.

For this test, 100 random medical questions were crafted by medical experts, encompassing 60 clinical inquiries and 40 questions related to billing, coding, and denials. These questions covered a range of topics, including clinical diagnosis and management, billing queries, prediction of ICD and CPT codes, generating appeal letters, and creating a clinical summary from deidentified medical notes.

Answers to these questions were obtained from four sources:

  1. A panel of human experts with expertise in relevant medical fields, including doctors and  billers/ coders.
  2. Apollo Model, with Few-shot Prompting, with/without retrieval augmented generation, and with/without tree of thought, especially in billing and coding questions.
  3. Claude 3 Sonnet.
  4. GPT4.


Answers from all four sources underwent evaluation by a panel of different  human experts, each specializing in the respective medical fields, to minimize bias. Each question was assigned to a single expert to ensure that every answer received evaluation from one expert. The source of each answer (whether from Apollo Model, a human expert, Claude, or GPT) was concealed from the evaluator to prevent bias that could affect the ratings based on their perceptions of the source's reliability.

For each criterion, evaluators assigned a score based on predefined definitions, with scores ranging from 0 to 2. These scores reflected the quality and appropriateness of the answers relative to the criterion. After completing all evaluations, scores were analyzed to compare the performance of human experts and other LLMs across different criteria.

Statistical Analysis:The non-parametric bootstrap method was employed to estimate significant variations in the results, including:

  • Generating 1,000 bootstrap replicas for each set of answers.
  • Constructing a distribution for each set based on these replicas.
  • Utilizing the 95% bootstrap percentile interval to assess and report variations.

Evaluation criteria table :

Scientific Consensus :

The Apollo Model demonstrates an exceptional alignment with scientific consensus, closely rivaling the performance of the other advanced models like GPT-4. The Apollo Model's specific strengths include:

  • High Alignment with Consensus: The Apollo Model showcases a 95.07% alignment with the current scientific consensus, which is marginally higher than the Human Expert and almost on par with the best-performing GPT-4. This highlights the model's robust ability to produce scientifically accurate and accepted responses.
  • Low Opposition and No Consensus Rates: While the Apollo Model does have a slightly higher rate of opposition to consensus (0.99%) compared to other models, this is still under 1%, indicating rare occurrences of contrary responses. Furthermore, its low no consensus rate (1.86%) confirms its consistent reliability in adhering to established scientific views.
  • Comparative Advantage: Compared to the Claude Model and even the Human Expert, the Apollo Model's more favorable confidence intervals in the highest compliance category (aligned with consensus) suggest a strong, dependable performance that closely matches the top tier of current AI capabilities in understanding and replicating scientific consensus.

Extent of Possible Harm and Likelihood of Possible Harm:

  • The Apollo Model excels in minimizing the risk of severe harm, with a significantly lower mean percentage (1.94%) compared to the other models. This highlights its effectiveness in delivering safe advice.
  • With a mean of 77.30% for "No harm," the Apollo Model also leads in providing advice with the least potential for harm, surpassing even the Human Expert and slightly outperforming the Claude Model.
  • The Apollo Model's performance in these safety metrics suggests it has a strong ability to evaluate and mitigate risks in its advice, making it particularly valuable in scenarios where safety is paramount.
  • The Apollo Model performs well in minimizing high-risk advice, with a lower percentage of high-risk scores (4.86%) compared to the Human Expert and GPT-4 Model, and comparable to the Claude Model. This indicates a safer approach in its recommendations.
  • In terms of medium risk, the Apollo Model is slightly below the Claude Model but has a similar risk profile to the other models. This suggests a balanced approach to risk management, where the model doesn't overly prioritize caution at the expense of potentially valuable advice.
  • The Apollo Model also has a relatively high percentage of low-risk scores (71.40%), demonstrating its capability to provide safe advice that is unlikely to result in harm. This is marginally higher than both the Human Expert and GPT-4 Model, though very close to the Claude Model, indicating strong performance in ensuring the safety of its outputs.

Overall, the Apollo Model's strong performance in minimizing potential harm, while maintaining a good balance in risk management, highlights its robustness and reliability in scenarios where minimizing risk is crucial.

Comprehension, Retrieval, and Reasoning:


  • Apollo Model: Demonstrates excellent comprehension, leading with a high score of 91.08% in the strong evidence category, which highlights its capability to understand and accurately align with the expected understanding of questions.
  • Human Expert and GPT-4 Model: Both show strong comprehension abilities, though slightly behind the Apollo Model.
  • Claude Model: Generally performs well in comprehension despite a minor percentage of responses showing no evidence of understanding.
  • The superior comprehension performance of the Apollo Model underscores its suitability for applications needing precise interpretation of complex queries.


  • Apollo Model: Excels in retrieving relevant and accurate information, leading with 88.97% in the strong evidence category and showing the lowest incidences of no evidence or partial evidence in retrieval.
  • GPT-4 and Claude Model: Also perform well but occasionally include less targeted information compared to Apollo, suggesting areas for improvement in information precision.
  • The Apollo Model's robust performance in information retrieval underscores its reliability for tasks requiring high fidelity in data handling, like academic research or medical diagnosis.


  • Apollo Model: Exhibits superior logical reasoning, with the highest performance in minimizing incorrect reasoning (92.24% in the "No evidence" category) and leading in correct reasoning (90.24% in the strong evidence category).
  • GPT-4 and Claude Models: Perform well but show slightly less consistency in logical reasoning compared to Apollo.
  • Human Expert: Shows more variability in reasoning correctness, highlighting some limitations in consistently applying logical rules compared to automated models.

Incorrect Comprehension and Retrieval

  • Apollo Model: Demonstrates the highest accuracy in avoiding misunderstandings, with the lowest percentages in "Some misunderstanding" and "Strong misunderstanding" categories.
  • Incorrect Retrieval Across Models: All models generally minimize incorrect retrieval effectively, with Apollo and GPT-4 standing out for their precision in avoiding retrieval of incorrect information.

Incorrect Reasoning

  • Apollo Model: Stands out in effectively avoiding logical fallacies, emphasizing its capability in sophisticated and reliable logical processing.
  • The very low occurrences of illogical reasoning across all models underline their sophistication and reliability in handling complex reasoning tasks.

In summary, the Apollo Model particularly shines across all dimensions of comprehension, retrieval, and reasoning, proving its efficacy in environments that demand high accuracy and logical consistency.

Incorrect Content and Missing Content

Incorrect Content:

  • Apollo Model exhibits a solid performance with a majority indicating no clinical significance of incorrect content (72.94%), which is similar to the Claude Model and slightly higher than the Human Expert. However, it has a higher percentage of responses categorized under great clinical significance, suggesting that when incorrect content appears, it tends to be more significant.
  • Human Expert and Claude Model have moderate levels of incorrect content with great clinical significance, suggesting occasional lapses in content appropriateness.
  • GPT-4 Model shows a lower percentage for no clinical significance and the highest for great clinical significance among all models, indicating a higher tendency to include incorrect content that may have significant impacts.

Missing Content

  • The Apollo Model stands out with the highest score in minimizing the omission of crucial information, as indicated by its high percentage in the "No clinical significance" category (84.23%). This suggests that it is particularly adept at including necessary and relevant information in its responses.
  • Apollo Model also demonstrates a comparatively low percentage of responses with great clinical significance of missing content, reinforcing its efficiency in comprehensive information delivery.
  • The GPT-4 and Claude Models perform well but show higher percentages in the categories for little and great clinical significance, indicating they sometimes miss including some key details in their responses.

FewShot Prompting and TOT +/-RAG:

The Apollo Model's performance in coding questions, particularly those related to ICD (International Classification of Diseases) and CPT (Current Procedural Terminology) codes, benefits significantly from advanced prompting techniques like FewShot Prompting and TOT (Tree of Thought) +/- RAG (Retrieval-Augmented Generation). These techniques are used to enhance the model's ability to generate more accurate and relevant responses based on a deep understanding of the coding standards and medical billing practices. Here’s how each technique contributes to the model's performance:

FewShot Prompting

FewShot Prompting involves training or fine-tuning the model on a small set of example tasks or queries that are representative of the larger task it will perform. In the context of the Apollo Model:

  • Enhanced Learning from Limited Examples: By presenting the model with a few high-quality examples of coding questions and their correct answers, FewShot Prompting helps the model learn to generalize from these examples to unseen questions. This is particularly useful for coding questions where nuances in the language of the question can significantly alter the required codes.
  • Improved Accuracy: The model learns the patterns and context specific to ICD and CPT coding, which improves its accuracy in predicting the correct codes based on the medical descriptions or procedures mentioned in the questions.


TOT (Tree of Thought) and RAG (Retrieval-Augmented Generation) are techniques that extend the capabilities of the model by enabling it to reason more effectively and access a broader range of information:

  • Enhanced Reasoning with TOT: The TOT strategy involves prompting the model to "think out loud" by generating intermediate steps or reasoning paths before arriving at an answer. For ICD and CPT coding questions, this means the model might first identify key medical terms or conditions from the query, consider relevant coding guidelines, and then select the appropriate codes.
  • Information Retrieval with RAG: RAG integrates external information retrieval into the response generation process. The model retrieves relevant documents or data (such as coding manuals or previously answered queries) while generating answers. This is especially useful for complex coding scenarios where the model might benefit from referencing current coding guidelines or examples similar to the question at hand.
  • Adaptability with +/- RAG: Adjusting the use of RAG (either using more or less of the retrieval component) allows the Apollo Model to balance between its internal knowledge and externally retrieved information. This flexibility can be crucial in adapting the model's responses based on the specificity and complexity of the coding question.

The combination of FewShot Prompting and TOT +/- RAG in the Apollo Model particularly enhances its performance on medical coding questions by:

  • Providing a robust framework for understanding and applying complex medical and coding terminology accurately.
  • Enabling the model to adapt its responses based on the depth of coding knowledge required and the context of each specific query.
  • Ensuring high fidelity in the generation of coding-related responses, which reduces errors and increases reliability.

Overall, these techniques help the Apollo Model excel in tasks that require precise and contextually correct responses, such as those involving the generation of ICD and CPT codes in medical billing and coding questions.

Latest articles

Browse all Blog Post
More coming soon!