medical research study quality

  • Subscribe to journal Subscribe
  • Get new issue alerts Get alerts

Secondary Logo

Journal logo.

Colleague's E-mail is Invalid

Your message has been successfully sent to your colleague.

Save my selection

Appraising the Quality of Medical Education Research Methods

The medical education research study quality instrument and the newcastle–ottawa scale-education.

Cook, David A. MD, MHPE; Reed, Darcy A. MD, MPH

D.A. Cook is professor of medicine and medical education, Department of Medicine; associate director, Mayo Center for Online Learning; and research chair, Mayo Clinic Multidisciplinary Simulation Center, Mayo Clinic College of Medicine, Rochester, Minnesota.

D.A. Reed is associate professor of medicine and medical education, Department of Medicine, and senior associate dean of academic affairs, Mayo Medical School, Mayo Clinic College of Medicine, Rochester, Minnesota.

Funding/Support: None reported.

Other disclosures: None reported.

Ethical approval: Reported as not applicable.

Correspondence should be addressed to David A. Cook, Division of General Internal Medicine, Mayo Clinic College of Medicine, Mayo 17, 200 First St. SW, Rochester, MN 55905; telephone: (507) 266-4156; e-mail: [email protected] .

Purpose 

The Medical Education Research Study Quality Instrument (MERSQI) and the Newcastle–Ottawa Scale-Education (NOS-E) were developed to appraise methodological quality in medical education research. The study objective was to evaluate the interrater reliability, normative scores, and between-instrument correlation for these two instruments.

Method 

In 2014, the authors searched PubMed and Google for articles using the MERSQI or NOS-E. They obtained or extracted data for interrater reliability—using the intraclass correlation coefficient (ICC)—and normative scores. They calculated between-scale correlation using Spearman rho.

Results 

Each instrument contains items concerning sampling, controlling for confounders, and integrity of outcomes. Interrater reliability for overall scores ranged from 0.68 to 0.95. Interrater reliability was “substantial” or better (ICC > 0.60) for nearly all domain-specific items on both instruments. Most instances of low interrater reliability were associated with restriction of range, and raw agreement was usually good. Across 26 studies evaluating published research, the median overall MERSQI score was 11.3 (range 8.9–15.1, of possible 18). Across six studies, the median overall NOS-E score was 3.22 (range 2.08–3.82, of possible 6). Overall MERSQI and NOS-E scores correlated reasonably well (rho 0.49–0.72).

Conclusions 

The MERSQI and NOS-E are useful, reliable, complementary tools for appraising methodological quality of medical education research. Interpretation and use of their scores should focus on item-specific codes rather than overall scores. Normative scores should be used for relative rather than absolute judgments because different research questions require different study designs.

Full Text Access for Subscribers:

Individual subscribers.

medical research study quality

Institutional Users

Not a subscriber.

You can read the full text of this article if you:

  • + Favorites
  • View in Gallery

Readers Of this Article Also Read

Am last page: a snapshot of three common program evaluation approaches for..., clinical reasoning assessment methods: a scoping review and practical guidance, frameworks for effective feedback in health professions education, association of surgical resident competency ratings with patient outcomes, elaborated knowledge: a key to successful diagnostic thinking.

U.S. flag

An official website of the United States government

Here’s how you know

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( A locked padlock ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

  • Heart-Healthy Living
  • High Blood Pressure
  • Sickle Cell Disease
  • Sleep Apnea
  • Information & Resources on COVID-19
  • The Heart Truth®
  • Learn More Breathe Better®
  • Blood Diseases and Disorders Education Program
  • Publications and Resources
  • Blood Disorders and Blood Safety
  • Sleep Science and Sleep Disorders
  • Lung Diseases
  • Health Disparities and Inequities
  • Heart and Vascular Diseases
  • Precision Medicine Activities
  • Obesity, Nutrition, and Physical Activity
  • Population and Epidemiology Studies
  • Women’s Health
  • Research Topics
  • Clinical Trials
  • All Science A-Z
  • Grants and Training Home
  • Policies and Guidelines
  • Funding Opportunities and Contacts
  • Training and Career Development
  • Email Alerts
  • NHLBI in the Press
  • Research Features
  • Ask a Scientist
  • Past Events
  • Upcoming Events
  • Mission and Strategic Vision
  • Divisions, Offices and Centers
  • Advisory Committees
  • Budget and Legislative Information
  • Jobs and Working at the NHLBI
  • Contact and FAQs
  • NIH Sleep Research Plan
  • < Back To Health Topics

Study Quality Assessment Tools

In 2013, NHLBI developed a set of tailored quality assessment tools to assist reviewers in focusing on concepts that are key to a study’s internal validity. The tools were specific to certain study designs and tested for potential flaws in study methods or implementation. Experts used the tools during the systematic evidence review process to update existing clinical guidelines, such as those on cholesterol, blood pressure, and obesity. Their findings are outlined in the following reports:

  • Assessing Cardiovascular Risk: Systematic Evidence Review from the Risk Assessment Work Group
  • Management of Blood Cholesterol in Adults: Systematic Evidence Review from the Cholesterol Expert Panel
  • Management of Blood Pressure in Adults: Systematic Evidence Review from the Blood Pressure Expert Panel
  • Managing Overweight and Obesity in Adults: Systematic Evidence Review from the Obesity Expert Panel

While these tools have not been independently published and would not be considered standardized, they may be useful to the research community. These reports describe how experts used the tools for the project. Researchers may want to use the tools for their own projects; however, they would need to determine their own parameters for making judgements. Details about the design and application of the tools are included in Appendix A of the reports.

Quality Assessment of Controlled Intervention Studies - Study Quality Assessment Tools

Criteria Yes No Other
(CD, NR, NA)*
1. Was the study described as randomized, a randomized trial, a randomized clinical trial, or an RCT?      
2. Was the method of randomization adequate (i.e., use of randomly generated assignment)?      
3. Was the treatment allocation concealed (so that assignments could not be predicted)?      
4. Were study participants and providers blinded to treatment group assignment?      
5. Were the people assessing the outcomes blinded to the participants' group assignments?      
6. Were the groups similar at baseline on important characteristics that could affect outcomes (e.g., demographics, risk factors, co-morbid conditions)?      
7. Was the overall drop-out rate from the study at endpoint 20% or lower of the number allocated to treatment?      
8. Was the differential drop-out rate (between treatment groups) at endpoint 15 percentage points or lower?      
9. Was there high adherence to the intervention protocols for each treatment group?      
10. Were other interventions avoided or similar in the groups (e.g., similar background treatments)?      
11. Were outcomes assessed using valid and reliable measures, implemented consistently across all study participants?      
12. Did the authors report that the sample size was sufficiently large to be able to detect a difference in the main outcome between groups with at least 80% power?      
13. Were outcomes reported or subgroups analyzed prespecified (i.e., identified before analyses were conducted)?      
14. Were all randomized participants analyzed in the group to which they were originally assigned, i.e., did they use an intention-to-treat analysis?      
Quality Rating (Good, Fair, or Poor)
Rater #1 initials:
Rater #2 initials:
Additional Comments (If POOR, please state why):

*CD, cannot determine; NA, not applicable; NR, not reported

Guidance for Assessing the Quality of Controlled Intervention Studies

The guidance document below is organized by question number from the tool for quality assessment of controlled intervention studies.

Question 1. Described as randomized

Was the study described as randomized? A study does not satisfy quality criteria as randomized simply because the authors call it randomized; however, it is a first step in determining if a study is randomized

Questions 2 and 3. Treatment allocation–two interrelated pieces

Adequate randomization: Randomization is adequate if it occurred according to the play of chance (e.g., computer generated sequence in more recent studies, or random number table in older studies). Inadequate randomization: Randomization is inadequate if there is a preset plan (e.g., alternation where every other subject is assigned to treatment arm or another method of allocation is used, such as time or day of hospital admission or clinic visit, ZIP Code, phone number, etc.). In fact, this is not randomization at all–it is another method of assignment to groups. If assignment is not by the play of chance, then the answer to this question is no. There may be some tricky scenarios that will need to be read carefully and considered for the role of chance in assignment. For example, randomization may occur at the site level, where all individuals at a particular site are assigned to receive treatment or no treatment. This scenario is used for group-randomized trials, which can be truly randomized, but often are "quasi-experimental" studies with comparison groups rather than true control groups. (Few, if any, group-randomized trials are anticipated for this evidence review.)

Allocation concealment: This means that one does not know in advance, or cannot guess accurately, to what group the next person eligible for randomization will be assigned. Methods include sequentially numbered opaque sealed envelopes, numbered or coded containers, central randomization by a coordinating center, computer-generated randomization that is not revealed ahead of time, etc. Questions 4 and 5. Blinding

Blinding means that one does not know to which group–intervention or control–the participant is assigned. It is also sometimes called "masking." The reviewer assessed whether each of the following was blinded to knowledge of treatment assignment: (1) the person assessing the primary outcome(s) for the study (e.g., taking the measurements such as blood pressure, examining health records for events such as myocardial infarction, reviewing and interpreting test results such as x ray or cardiac catheterization findings); (2) the person receiving the intervention (e.g., the patient or other study participant); and (3) the person providing the intervention (e.g., the physician, nurse, pharmacist, dietitian, or behavioral interventionist).

Generally placebo-controlled medication studies are blinded to patient, provider, and outcome assessors; behavioral, lifestyle, and surgical studies are examples of studies that are frequently blinded only to the outcome assessors because blinding of the persons providing and receiving the interventions is difficult in these situations. Sometimes the individual providing the intervention is the same person performing the outcome assessment. This was noted when it occurred.

Question 6. Similarity of groups at baseline

This question relates to whether the intervention and control groups have similar baseline characteristics on average especially those characteristics that may affect the intervention or outcomes. The point of randomized trials is to create groups that are as similar as possible except for the intervention(s) being studied in order to compare the effects of the interventions between groups. When reviewers abstracted baseline characteristics, they noted when there was a significant difference between groups. Baseline characteristics for intervention groups are usually presented in a table in the article (often Table 1).

Groups can differ at baseline without raising red flags if: (1) the differences would not be expected to have any bearing on the interventions and outcomes; or (2) the differences are not statistically significant. When concerned about baseline difference in groups, reviewers recorded them in the comments section and considered them in their overall determination of the study quality.

Questions 7 and 8. Dropout

"Dropouts" in a clinical trial are individuals for whom there are no end point measurements, often because they dropped out of the study and were lost to followup.

Generally, an acceptable overall dropout rate is considered 20 percent or less of participants who were randomized or allocated into each group. An acceptable differential dropout rate is an absolute difference between groups of 15 percentage points at most (calculated by subtracting the dropout rate of one group minus the dropout rate of the other group). However, these are general rates. Lower overall dropout rates are expected in shorter studies, whereas higher overall dropout rates may be acceptable for studies of longer duration. For example, a 6-month study of weight loss interventions should be expected to have nearly 100 percent followup (almost no dropouts–nearly everybody gets their weight measured regardless of whether or not they actually received the intervention), whereas a 10-year study testing the effects of intensive blood pressure lowering on heart attacks may be acceptable if there is a 20-25 percent dropout rate, especially if the dropout rate between groups was similar. The panels for the NHLBI systematic reviews may set different levels of dropout caps.

Conversely, differential dropout rates are not flexible; there should be a 15 percent cap. If there is a differential dropout rate of 15 percent or higher between arms, then there is a serious potential for bias. This constitutes a fatal flaw, resulting in a poor quality rating for the study.

Question 9. Adherence

Did participants in each treatment group adhere to the protocols for assigned interventions? For example, if Group 1 was assigned to 10 mg/day of Drug A, did most of them take 10 mg/day of Drug A? Another example is a study evaluating the difference between a 30-pound weight loss and a 10-pound weight loss on specific clinical outcomes (e.g., heart attacks), but the 30-pound weight loss group did not achieve its intended weight loss target (e.g., the group only lost 14 pounds on average). A third example is whether a large percentage of participants assigned to one group "crossed over" and got the intervention provided to the other group. A final example is when one group that was assigned to receive a particular drug at a particular dose had a large percentage of participants who did not end up taking the drug or the dose as designed in the protocol.

Question 10. Avoid other interventions

Changes that occur in the study outcomes being assessed should be attributable to the interventions being compared in the study. If study participants receive interventions that are not part of the study protocol and could affect the outcomes being assessed, and they receive these interventions differentially, then there is cause for concern because these interventions could bias results. The following scenario is another example of how bias can occur. In a study comparing two different dietary interventions on serum cholesterol, one group had a significantly higher percentage of participants taking statin drugs than the other group. In this situation, it would be impossible to know if a difference in outcome was due to the dietary intervention or the drugs.

Question 11. Outcome measures assessment

What tools or methods were used to measure the outcomes in the study? Were the tools and methods accurate and reliable–for example, have they been validated, or are they objective? This is important as it indicates the confidence you can have in the reported outcomes. Perhaps even more important is ascertaining that outcomes were assessed in the same manner within and between groups. One example of differing methods is self-report of dietary salt intake versus urine testing for sodium content (a more reliable and valid assessment method). Another example is using BP measurements taken by practitioners who use their usual methods versus using BP measurements done by individuals trained in a standard approach. Such an approach may include using the same instrument each time and taking an individual's BP multiple times. In each of these cases, the answer to this assessment question would be "no" for the former scenario and "yes" for the latter. In addition, a study in which an intervention group was seen more frequently than the control group, enabling more opportunities to report clinical events, would not be considered reliable and valid.

Question 12. Power calculation

Generally, a study's methods section will address the sample size needed to detect differences in primary outcomes. The current standard is at least 80 percent power to detect a clinically relevant difference in an outcome using a two-sided alpha of 0.05. Often, however, older studies will not report on power.

Question 13. Prespecified outcomes

Investigators should prespecify outcomes reported in a study for hypothesis testing–which is the reason for conducting an RCT. Without prespecified outcomes, the study may be reporting ad hoc analyses, simply looking for differences supporting desired findings. Investigators also should prespecify subgroups being examined. Most RCTs conduct numerous post hoc analyses as a way of exploring findings and generating additional hypotheses. The intent of this question is to give more weight to reports that are not simply exploratory in nature.

Question 14. Intention-to-treat analysis

Intention-to-treat (ITT) means everybody who was randomized is analyzed according to the original group to which they are assigned. This is an extremely important concept because conducting an ITT analysis preserves the whole reason for doing a randomized trial; that is, to compare groups that differ only in the intervention being tested. When the ITT philosophy is not followed, groups being compared may no longer be the same. In this situation, the study would likely be rated poor. However, if an investigator used another type of analysis that could be viewed as valid, this would be explained in the "other" box on the quality assessment form. Some researchers use a completers analysis (an analysis of only the participants who completed the intervention and the study), which introduces significant potential for bias. Characteristics of participants who do not complete the study are unlikely to be the same as those who do. The likely impact of participants withdrawing from a study treatment must be considered carefully. ITT analysis provides a more conservative (potentially less biased) estimate of effectiveness.

General Guidance for Determining the Overall Quality Rating of Controlled Intervention Studies

The questions on the assessment tool were designed to help reviewers focus on the key concepts for evaluating a study's internal validity. They are not intended to create a list that is simply tallied up to arrive at a summary judgment of quality.

Internal validity is the extent to which the results (effects) reported in a study can truly be attributed to the intervention being evaluated and not to flaws in the design or conduct of the study–in other words, the ability for the study to make causal conclusions about the effects of the intervention being tested. Such flaws can increase the risk of bias. Critical appraisal involves considering the risk of potential for allocation bias, measurement bias, or confounding (the mixture of exposures that one cannot tease out from each other). Examples of confounding include co-interventions, differences at baseline in patient characteristics, and other issues addressed in the questions above. High risk of bias translates to a rating of poor quality. Low risk of bias translates to a rating of good quality.

Fatal flaws: If a study has a "fatal flaw," then risk of bias is significant, and the study is of poor quality. Examples of fatal flaws in RCTs include high dropout rates, high differential dropout rates, no ITT analysis or other unsuitable statistical analysis (e.g., completers-only analysis).

Generally, when evaluating a study, one will not see a "fatal flaw;" however, one will find some risk of bias. During training, reviewers were instructed to look for the potential for bias in studies by focusing on the concepts underlying the questions in the tool. For any box checked "no," reviewers were told to ask: "What is the potential risk of bias that may be introduced by this flaw?" That is, does this factor cause one to doubt the results that were reported in the study?

NHLBI staff provided reviewers with background reading on critical appraisal, while emphasizing that the best approach to use is to think about the questions in the tool in determining the potential for bias in a study. The staff also emphasized that each study has specific nuances; therefore, reviewers should familiarize themselves with the key concepts.

Quality Assessment of Systematic Reviews and Meta-Analyses - Study Quality Assessment Tools

Criteria Yes No Other
(CD, NR, NA)*
1. Is the review based on a focused question that is adequately formulated and described?      
2. Were eligibility criteria for included and excluded studies predefined and specified?      
3. Did the literature search strategy use a comprehensive, systematic approach?      
4. Were titles, abstracts, and full-text articles dually and independently reviewed for inclusion and exclusion to minimize bias?      
5. Was the quality of each included study rated independently by two or more reviewers using a standard method to appraise its internal validity?      
6. Were the included studies listed along with important characteristics and results of each study?      
7. Was publication bias assessed?      
8. Was heterogeneity assessed? (This question applies only to meta-analyses.)      

Guidance for Quality Assessment Tool for Systematic Reviews and Meta-Analyses

A systematic review is a study that attempts to answer a question by synthesizing the results of primary studies while using strategies to limit bias and random error.424 These strategies include a comprehensive search of all potentially relevant articles and the use of explicit, reproducible criteria in the selection of articles included in the review. Research designs and study characteristics are appraised, data are synthesized, and results are interpreted using a predefined systematic approach that adheres to evidence-based methodological principles.

Systematic reviews can be qualitative or quantitative. A qualitative systematic review summarizes the results of the primary studies but does not combine the results statistically. A quantitative systematic review, or meta-analysis, is a type of systematic review that employs statistical techniques to combine the results of the different studies into a single pooled estimate of effect, often given as an odds ratio. The guidance document below is organized by question number from the tool for quality assessment of systematic reviews and meta-analyses.

Question 1. Focused question

The review should be based on a question that is clearly stated and well-formulated. An example would be a question that uses the PICO (population, intervention, comparator, outcome) format, with all components clearly described.

Question 2. Eligibility criteria

The eligibility criteria used to determine whether studies were included or excluded should be clearly specified and predefined. It should be clear to the reader why studies were included or excluded.

Question 3. Literature search

The search strategy should employ a comprehensive, systematic approach in order to capture all of the evidence possible that pertains to the question of interest. At a minimum, a comprehensive review has the following attributes:

  • Electronic searches were conducted using multiple scientific literature databases, such as MEDLINE, EMBASE, Cochrane Central Register of Controlled Trials, PsychLit, and others as appropriate for the subject matter.
  • Manual searches of references found in articles and textbooks should supplement the electronic searches.

Additional search strategies that may be used to improve the yield include the following:

  • Studies published in other countries
  • Studies published in languages other than English
  • Identification by experts in the field of studies and articles that may have been missed
  • Search of grey literature, including technical reports and other papers from government agencies or scientific groups or committees; presentations and posters from scientific meetings, conference proceedings, unpublished manuscripts; and others. Searching the grey literature is important (whenever feasible) because sometimes only positive studies with significant findings are published in the peer-reviewed literature, which can bias the results of a review.

In their reviews, researchers described the literature search strategy clearly, and ascertained it could be reproducible by others with similar results.

Question 4. Dual review for determining which studies to include and exclude

Titles, abstracts, and full-text articles (when indicated) should be reviewed by two independent reviewers to determine which studies to include and exclude in the review. Reviewers resolved disagreements through discussion and consensus or with third parties. They clearly stated the review process, including methods for settling disagreements.

Question 5. Quality appraisal for internal validity

Each included study should be appraised for internal validity (study quality assessment) using a standardized approach for rating the quality of the individual studies. Ideally, this should be done by at least two independent reviewers appraised each study for internal validity. However, there is not one commonly accepted, standardized tool for rating the quality of studies. So, in the research papers, reviewers looked for an assessment of the quality of each study and a clear description of the process used.

Question 6. List and describe included studies

All included studies were listed in the review, along with descriptions of their key characteristics. This was presented either in narrative or table format.

Question 7. Publication bias

Publication bias is a term used when studies with positive results have a higher likelihood of being published, being published rapidly, being published in higher impact journals, being published in English, being published more than once, or being cited by others.425,426 Publication bias can be linked to favorable or unfavorable treatment of research findings due to investigators, editors, industry, commercial interests, or peer reviewers. To minimize the potential for publication bias, researchers can conduct a comprehensive literature search that includes the strategies discussed in Question 3.

A funnel plot–a scatter plot of component studies in a meta-analysis–is a commonly used graphical method for detecting publication bias. If there is no significant publication bias, the graph looks like a symmetrical inverted funnel.

Reviewers assessed and clearly described the likelihood of publication bias.

Question 8. Heterogeneity

Heterogeneity is used to describe important differences in studies included in a meta-analysis that may make it inappropriate to combine the studies.427 Heterogeneity can be clinical (e.g., important differences between study participants, baseline disease severity, and interventions); methodological (e.g., important differences in the design and conduct of the study); or statistical (e.g., important differences in the quantitative results or reported effects).

Researchers usually assess clinical or methodological heterogeneity qualitatively by determining whether it makes sense to combine studies. For example:

  • Should a study evaluating the effects of an intervention on CVD risk that involves elderly male smokers with hypertension be combined with a study that involves healthy adults ages 18 to 40? (Clinical Heterogeneity)
  • Should a study that uses a randomized controlled trial (RCT) design be combined with a study that uses a case-control study design? (Methodological Heterogeneity)

Statistical heterogeneity describes the degree of variation in the effect estimates from a set of studies; it is assessed quantitatively. The two most common methods used to assess statistical heterogeneity are the Q test (also known as the X2 or chi-square test) or I2 test.

Reviewers examined studies to determine if an assessment for heterogeneity was conducted and clearly described. If the studies are found to be heterogeneous, the investigators should explore and explain the causes of the heterogeneity, and determine what influence, if any, the study differences had on overall study results.

Quality Assessment Tool for Observational Cohort and Cross-Sectional Studies - Study Quality Assessment Tools

Criteria Yes No Other
(CD, NR, NA)*
1. Was the research question or objective in this paper clearly stated?      
2. Was the study population clearly specified and defined?      
3. Was the participation rate of eligible persons at least 50%?      
4. Were all the subjects selected or recruited from the same or similar populations (including the same time period)? Were inclusion and exclusion criteria for being in the study prespecified and applied uniformly to all participants?      
5. Was a sample size justification, power description, or variance and effect estimates provided?      
6. For the analyses in this paper, were the exposure(s) of interest measured prior to the outcome(s) being measured?      
7. Was the timeframe sufficient so that one could reasonably expect to see an association between exposure and outcome if it existed?      
8. For exposures that can vary in amount or level, did the study examine different levels of the exposure as related to the outcome (e.g., categories of exposure, or exposure measured as continuous variable)?      
9. Were the exposure measures (independent variables) clearly defined, valid, reliable, and implemented consistently across all study participants?      
10. Was the exposure(s) assessed more than once over time?      
11. Were the outcome measures (dependent variables) clearly defined, valid, reliable, and implemented consistently across all study participants?      
12. Were the outcome assessors blinded to the exposure status of participants?      
13. Was loss to follow-up after baseline 20% or less?      
14. Were key potential confounding variables measured and adjusted statistically for their impact on the relationship between exposure(s) and outcome(s)?      

Guidance for Assessing the Quality of Observational Cohort and Cross-Sectional Studies

The guidance document below is organized by question number from the tool for quality assessment of observational cohort and cross-sectional studies.

Question 1. Research question

Did the authors describe their goal in conducting this research? Is it easy to understand what they were looking to find? This issue is important for any scientific paper of any type. Higher quality scientific research explicitly defines a research question.

Questions 2 and 3. Study population

Did the authors describe the group of people from which the study participants were selected or recruited, using demographics, location, and time period? If you were to conduct this study again, would you know who to recruit, from where, and from what time period? Is the cohort population free of the outcomes of interest at the time they were recruited?

An example would be men over 40 years old with type 2 diabetes who began seeking medical care at Phoenix Good Samaritan Hospital between January 1, 1990 and December 31, 1994. In this example, the population is clearly described as: (1) who (men over 40 years old with type 2 diabetes); (2) where (Phoenix Good Samaritan Hospital); and (3) when (between January 1, 1990 and December 31, 1994). Another example is women ages 34 to 59 years of age in 1980 who were in the nursing profession and had no known coronary disease, stroke, cancer, hypercholesterolemia, or diabetes, and were recruited from the 11 most populous States, with contact information obtained from State nursing boards.

In cohort studies, it is crucial that the population at baseline is free of the outcome of interest. For example, the nurses' population above would be an appropriate group in which to study incident coronary disease. This information is usually found either in descriptions of population recruitment, definitions of variables, or inclusion/exclusion criteria.

You may need to look at prior papers on methods in order to make the assessment for this question. Those papers are usually in the reference list.

If fewer than 50% of eligible persons participated in the study, then there is concern that the study population does not adequately represent the target population. This increases the risk of bias.

Question 4. Groups recruited from the same population and uniform eligibility criteria

Were the inclusion and exclusion criteria developed prior to recruitment or selection of the study population? Were the same underlying criteria used for all of the subjects involved? This issue is related to the description of the study population, above, and you may find the information for both of these questions in the same section of the paper.

Most cohort studies begin with the selection of the cohort; participants in this cohort are then measured or evaluated to determine their exposure status. However, some cohort studies may recruit or select exposed participants in a different time or place than unexposed participants, especially retrospective cohort studies–which is when data are obtained from the past (retrospectively), but the analysis examines exposures prior to outcomes. For example, one research question could be whether diabetic men with clinical depression are at higher risk for cardiovascular disease than those without clinical depression. So, diabetic men with depression might be selected from a mental health clinic, while diabetic men without depression might be selected from an internal medicine or endocrinology clinic. This study recruits groups from different clinic populations, so this example would get a "no."

However, the women nurses described in the question above were selected based on the same inclusion/exclusion criteria, so that example would get a "yes."

Question 5. Sample size justification

Did the authors present their reasons for selecting or recruiting the number of people included or analyzed? Do they note or discuss the statistical power of the study? This question is about whether or not the study had enough participants to detect an association if one truly existed.

A paragraph in the methods section of the article may explain the sample size needed to detect a hypothesized difference in outcomes. You may also find a discussion of power in the discussion section (such as the study had 85 percent power to detect a 20 percent increase in the rate of an outcome of interest, with a 2-sided alpha of 0.05). Sometimes estimates of variance and/or estimates of effect size are given, instead of sample size calculations. In any of these cases, the answer would be "yes."

However, observational cohort studies often do not report anything about power or sample sizes because the analyses are exploratory in nature. In this case, the answer would be "no." This is not a "fatal flaw." It just may indicate that attention was not paid to whether the study was sufficiently sized to answer a prespecified question–i.e., it may have been an exploratory, hypothesis-generating study.

Question 6. Exposure assessed prior to outcome measurement

This question is important because, in order to determine whether an exposure causes an outcome, the exposure must come before the outcome.

For some prospective cohort studies, the investigator enrolls the cohort and then determines the exposure status of various members of the cohort (large epidemiological studies like Framingham used this approach). However, for other cohort studies, the cohort is selected based on its exposure status, as in the example above of depressed diabetic men (the exposure being depression). Other examples include a cohort identified by its exposure to fluoridated drinking water and then compared to a cohort living in an area without fluoridated water, or a cohort of military personnel exposed to combat in the Gulf War compared to a cohort of military personnel not deployed in a combat zone.

With either of these types of cohort studies, the cohort is followed forward in time (i.e., prospectively) to assess the outcomes that occurred in the exposed members compared to nonexposed members of the cohort. Therefore, you begin the study in the present by looking at groups that were exposed (or not) to some biological or behavioral factor, intervention, etc., and then you follow them forward in time to examine outcomes. If a cohort study is conducted properly, the answer to this question should be "yes," since the exposure status of members of the cohort was determined at the beginning of the study before the outcomes occurred.

For retrospective cohort studies, the same principal applies. The difference is that, rather than identifying a cohort in the present and following them forward in time, the investigators go back in time (i.e., retrospectively) and select a cohort based on their exposure status in the past and then follow them forward to assess the outcomes that occurred in the exposed and nonexposed cohort members. Because in retrospective cohort studies the exposure and outcomes may have already occurred (it depends on how long they follow the cohort), it is important to make sure that the exposure preceded the outcome.

Sometimes cross-sectional studies are conducted (or cross-sectional analyses of cohort-study data), where the exposures and outcomes are measured during the same timeframe. As a result, cross-sectional analyses provide weaker evidence than regular cohort studies regarding a potential causal relationship between exposures and outcomes. For cross-sectional analyses, the answer to Question 6 should be "no."

Question 7. Sufficient timeframe to see an effect

Did the study allow enough time for a sufficient number of outcomes to occur or be observed, or enough time for an exposure to have a biological effect on an outcome? In the examples given above, if clinical depression has a biological effect on increasing risk for CVD, such an effect may take years. In the other example, if higher dietary sodium increases BP, a short timeframe may be sufficient to assess its association with BP, but a longer timeframe would be needed to examine its association with heart attacks.

The issue of timeframe is important to enable meaningful analysis of the relationships between exposures and outcomes to be conducted. This often requires at least several years, especially when looking at health outcomes, but it depends on the research question and outcomes being examined.

Cross-sectional analyses allow no time to see an effect, since the exposures and outcomes are assessed at the same time, so those would get a "no" response.

Question 8. Different levels of the exposure of interest

If the exposure can be defined as a range (examples: drug dosage, amount of physical activity, amount of sodium consumed), were multiple categories of that exposure assessed? (for example, for drugs: not on the medication, on a low dose, medium dose, high dose; for dietary sodium, higher than average U.S. consumption, lower than recommended consumption, between the two). Sometimes discrete categories of exposure are not used, but instead exposures are measured as continuous variables (for example, mg/day of dietary sodium or BP values).

In any case, studying different levels of exposure (where possible) enables investigators to assess trends or dose-response relationships between exposures and outcomes–e.g., the higher the exposure, the greater the rate of the health outcome. The presence of trends or dose-response relationships lends credibility to the hypothesis of causality between exposure and outcome.

For some exposures, however, this question may not be applicable (e.g., the exposure may be a dichotomous variable like living in a rural setting versus an urban setting, or vaccinated/not vaccinated with a one-time vaccine). If there are only two possible exposures (yes/no), then this question should be given an "NA," and it should not count negatively towards the quality rating.

Question 9. Exposure measures and assessment

Were the exposure measures defined in detail? Were the tools or methods used to measure exposure accurate and reliable–for example, have they been validated or are they objective? This issue is important as it influences confidence in the reported exposures. When exposures are measured with less accuracy or validity, it is harder to see an association between exposure and outcome even if one exists. Also as important is whether the exposures were assessed in the same manner within groups and between groups; if not, bias may result.

For example, retrospective self-report of dietary salt intake is not as valid and reliable as prospectively using a standardized dietary log plus testing participants' urine for sodium content. Another example is measurement of BP, where there may be quite a difference between usual care, where clinicians measure BP however it is done in their practice setting (which can vary considerably), and use of trained BP assessors using standardized equipment (e.g., the same BP device which has been tested and calibrated) and a standardized protocol (e.g., patient is seated for 5 minutes with feet flat on the floor, BP is taken twice in each arm, and all four measurements are averaged). In each of these cases, the former would get a "no" and the latter a "yes."

Here is a final example that illustrates the point about why it is important to assess exposures consistently across all groups: If people with higher BP (exposed cohort) are seen by their providers more frequently than those without elevated BP (nonexposed group), it also increases the chances of detecting and documenting changes in health outcomes, including CVD-related events. Therefore, it may lead to the conclusion that higher BP leads to more CVD events. This may be true, but it could also be due to the fact that the subjects with higher BP were seen more often; thus, more CVD-related events were detected and documented simply because they had more encounters with the health care system. Thus, it could bias the results and lead to an erroneous conclusion.

Question 10. Repeated exposure assessment

Was the exposure for each person measured more than once during the course of the study period? Multiple measurements with the same result increase our confidence that the exposure status was correctly classified. Also, multiple measurements enable investigators to look at changes in exposure over time, for example, people who ate high dietary sodium throughout the followup period, compared to those who started out high then reduced their intake, compared to those who ate low sodium throughout. Once again, this may not be applicable in all cases. In many older studies, exposure was measured only at baseline. However, multiple exposure measurements do result in a stronger study design.

Question 11. Outcome measures

Were the outcomes defined in detail? Were the tools or methods for measuring outcomes accurate and reliable–for example, have they been validated or are they objective? This issue is important because it influences confidence in the validity of study results. Also important is whether the outcomes were assessed in the same manner within groups and between groups.

An example of an outcome measure that is objective, accurate, and reliable is death–the outcome measured with more accuracy than any other. But even with a measure as objective as death, there can be differences in the accuracy and reliability of how death was assessed by the investigators. Did they base it on an autopsy report, death certificate, death registry, or report from a family member? Another example is a study of whether dietary fat intake is related to blood cholesterol level (cholesterol level being the outcome), and the cholesterol level is measured from fasting blood samples that are all sent to the same laboratory. These examples would get a "yes." An example of a "no" would be self-report by subjects that they had a heart attack, or self-report of how much they weigh (if body weight is the outcome of interest).

Similar to the example in Question 9, results may be biased if one group (e.g., people with high BP) is seen more frequently than another group (people with normal BP) because more frequent encounters with the health care system increases the chances of outcomes being detected and documented.

Question 12. Blinding of outcome assessors

Blinding means that outcome assessors did not know whether the participant was exposed or unexposed. It is also sometimes called "masking." The objective is to look for evidence in the article that the person(s) assessing the outcome(s) for the study (for example, examining medical records to determine the outcomes that occurred in the exposed and comparison groups) is masked to the exposure status of the participant. Sometimes the person measuring the exposure is the same person conducting the outcome assessment. In this case, the outcome assessor would most likely not be blinded to exposure status because they also took measurements of exposures. If so, make a note of that in the comments section.

As you assess this criterion, think about whether it is likely that the person(s) doing the outcome assessment would know (or be able to figure out) the exposure status of the study participants. If the answer is no, then blinding is adequate. An example of adequate blinding of the outcome assessors is to create a separate committee, whose members were not involved in the care of the patient and had no information about the study participants' exposure status. The committee would then be provided with copies of participants' medical records, which had been stripped of any potential exposure information or personally identifiable information. The committee would then review the records for prespecified outcomes according to the study protocol. If blinding was not possible, which is sometimes the case, mark "NA" and explain the potential for bias.

Question 13. Followup rate

Higher overall followup rates are always better than lower followup rates, even though higher rates are expected in shorter studies, whereas lower overall followup rates are often seen in studies of longer duration. Usually, an acceptable overall followup rate is considered 80 percent or more of participants whose exposures were measured at baseline. However, this is just a general guideline. For example, a 6-month cohort study examining the relationship between dietary sodium intake and BP level may have over 90 percent followup, but a 20-year cohort study examining effects of sodium intake on stroke may have only a 65 percent followup rate.

Question 14. Statistical analyses

Were key potential confounding variables measured and adjusted for, such as by statistical adjustment for baseline differences? Logistic regression or other regression methods are often used to account for the influence of variables not of interest.

This is a key issue in cohort studies, because statistical analyses need to control for potential confounders, in contrast to an RCT, where the randomization process controls for potential confounders. All key factors that may be associated both with the exposure of interest and the outcome–that are not of interest to the research question–should be controlled for in the analyses.

For example, in a study of the relationship between cardiorespiratory fitness and CVD events (heart attacks and strokes), the study should control for age, BP, blood cholesterol, and body weight, because all of these factors are associated both with low fitness and with CVD events. Well-done cohort studies control for multiple potential confounders.

Some general guidance for determining the overall quality rating of observational cohort and cross-sectional studies

The questions on the form are designed to help you focus on the key concepts for evaluating the internal validity of a study. They are not intended to create a list that you simply tally up to arrive at a summary judgment of quality.

Internal validity for cohort studies is the extent to which the results reported in the study can truly be attributed to the exposure being evaluated and not to flaws in the design or conduct of the study–in other words, the ability of the study to draw associative conclusions about the effects of the exposures being studied on outcomes. Any such flaws can increase the risk of bias.

Critical appraisal involves considering the risk of potential for selection bias, information bias, measurement bias, or confounding (the mixture of exposures that one cannot tease out from each other). Examples of confounding include co-interventions, differences at baseline in patient characteristics, and other issues throughout the questions above. High risk of bias translates to a rating of poor quality. Low risk of bias translates to a rating of good quality. (Thus, the greater the risk of bias, the lower the quality rating of the study.)

In addition, the more attention in the study design to issues that can help determine whether there is a causal relationship between the exposure and outcome, the higher quality the study. These include exposures occurring prior to outcomes, evaluation of a dose-response gradient, accuracy of measurement of both exposure and outcome, sufficient timeframe to see an effect, and appropriate control for confounding–all concepts reflected in the tool.

Generally, when you evaluate a study, you will not see a "fatal flaw," but you will find some risk of bias. By focusing on the concepts underlying the questions in the quality assessment tool, you should ask yourself about the potential for bias in the study you are critically appraising. For any box where you check "no" you should ask, "What is the potential risk of bias resulting from this flaw in study design or execution?" That is, does this factor cause you to doubt the results that are reported in the study or doubt the ability of the study to accurately assess an association between exposure and outcome?

The best approach is to think about the questions in the tool and how each one tells you something about the potential for bias in a study. The more you familiarize yourself with the key concepts, the more comfortable you will be with critical appraisal. Examples of studies rated good, fair, and poor are useful, but each study must be assessed on its own based on the details that are reported and consideration of the concepts for minimizing bias.

Quality Assessment of Case-Control Studies - Study Quality Assessment Tools

Criteria Yes No Other
(CD, NR, NA)*
1. Was the research question or objective in this paper clearly stated and appropriate?      
2. Was the study population clearly specified and defined?      
3. Did the authors include a sample size justification?      
4. Were controls selected or recruited from the same or similar population that gave rise to the cases (including the same timeframe)?      
5. Were the definitions, inclusion and exclusion criteria, algorithms or processes used to identify or select cases and controls valid, reliable, and implemented consistently across all study participants?      
6. Were the cases clearly defined and differentiated from controls?      
7. If less than 100 percent of eligible cases and/or controls were selected for the study, were the cases and/or controls randomly selected from those eligible?      
8. Was there use of concurrent controls?      
9. Were the investigators able to confirm that the exposure/risk occurred prior to the development of the condition or event that defined a participant as a case?      
10. Were the measures of exposure/risk clearly defined, valid, reliable, and implemented consistently (including the same time period) across all study participants?      
11. Were the assessors of exposure/risk blinded to the case or control status of participants?      
12. Were key potential confounding variables measured and adjusted statistically in the analyses? If matching was used, did the investigators account for matching during study analysis?      
Quality Rating (
Rater #1 Initials:
Rater #2 Initials:
Additional Comments (If POOR, please state why):

Guidance for Assessing the Quality of Case-Control Studies

The guidance document below is organized by question number from the tool for quality assessment of case-control studies.

Did the authors describe their goal in conducting this research? Is it easy to understand what they were looking to find? This issue is important for any scientific paper of any type. High quality scientific research explicitly defines a research question.

Question 2. Study population

Did the authors describe the group of individuals from which the cases and controls were selected or recruited, while using demographics, location, and time period? If the investigators conducted this study again, would they know exactly who to recruit, from where, and from what time period?

Investigators identify case-control study populations by location, time period, and inclusion criteria for cases (individuals with the disease, condition, or problem) and controls (individuals without the disease, condition, or problem). For example, the population for a study of lung cancer and chemical exposure would be all incident cases of lung cancer diagnosed in patients ages 35 to 79, from January 1, 2003 to December 31, 2008, living in Texas during that entire time period, as well as controls without lung cancer recruited from the same population during the same time period. The population is clearly described as: (1) who (men and women ages 35 to 79 with (cases) and without (controls) incident lung cancer); (2) where (living in Texas); and (3) when (between January 1, 2003 and December 31, 2008).

Other studies may use disease registries or data from cohort studies to identify cases. In these cases, the populations are individuals who live in the area covered by the disease registry or included in a cohort study (i.e., nested case-control or case-cohort). For example, a study of the relationship between vitamin D intake and myocardial infarction might use patients identified via the GRACE registry, a database of heart attack patients.

NHLBI staff encouraged reviewers to examine prior papers on methods (listed in the reference list) to make this assessment, if necessary.

Question 3. Target population and case representation

In order for a study to truly address the research question, the target population–the population from which the study population is drawn and to which study results are believed to apply–should be carefully defined. Some authors may compare characteristics of the study cases to characteristics of cases in the target population, either in text or in a table. When study cases are shown to be representative of cases in the appropriate target population, it increases the likelihood that the study was well-designed per the research question.

However, because these statistics are frequently difficult or impossible to measure, publications should not be penalized if case representation is not shown. For most papers, the response to question 3 will be "NR." Those subquestions are combined because the answer to the second subquestion–case representation–determines the response to this item. However, it cannot be determined without considering the response to the first subquestion. For example, if the answer to the first subquestion is "yes," and the second, "CD," then the response for item 3 is "CD."

Question 4. Sample size justification

Did the authors discuss their reasons for selecting or recruiting the number of individuals included? Did they discuss the statistical power of the study and provide a sample size calculation to ensure that the study is adequately powered to detect an association (if one exists)? This question does not refer to a description of the manner in which different groups were included or excluded using the inclusion/exclusion criteria (e.g., "Final study size was 1,378 participants after exclusion of 461 patients with missing data" is not considered a sample size justification for the purposes of this question).

An article's methods section usually contains information on sample size and the size needed to detect differences in exposures and on statistical power.

Question 5. Groups recruited from the same population

To determine whether cases and controls were recruited from the same population, one can ask hypothetically, "If a control was to develop the outcome of interest (the condition that was used to select cases), would that person have been eligible to become a case?" Case-control studies begin with the selection of the cases (those with the outcome of interest, e.g., lung cancer) and controls (those in whom the outcome is absent). Cases and controls are then evaluated and categorized by their exposure status. For the lung cancer example, cases and controls were recruited from hospitals in a given region. One may reasonably assume that controls in the catchment area for the hospitals, or those already in the hospitals for a different reason, would attend those hospitals if they became a case; therefore, the controls are drawn from the same population as the cases. If the controls were recruited or selected from a different region (e.g., a State other than Texas) or time period (e.g., 1991-2000), then the cases and controls were recruited from different populations, and the answer to this question would be "no."

The following example further explores selection of controls. In a study, eligible cases were men and women, ages 18 to 39, who were diagnosed with atherosclerosis at hospitals in Perth, Australia, between July 1, 2000 and December 31, 2007. Appropriate controls for these cases might be sampled using voter registration information for men and women ages 18 to 39, living in Perth (population-based controls); they also could be sampled from patients without atherosclerosis at the same hospitals (hospital-based controls). As long as the controls are individuals who would have been eligible to be included in the study as cases (if they had been diagnosed with atherosclerosis), then the controls were selected appropriately from the same source population as cases.

In a prospective case-control study, investigators may enroll individuals as cases at the time they are found to have the outcome of interest; the number of cases usually increases as time progresses. At this same time, they may recruit or select controls from the population without the outcome of interest. One way to identify or recruit cases is through a surveillance system. In turn, investigators can select controls from the population covered by that system. This is an example of population-based controls. Investigators also may identify and select cases from a cohort study population and identify controls from outcome-free individuals in the same cohort study. This is known as a nested case-control study.

Question 6. Inclusion and exclusion criteria prespecified and applied uniformly

Were the inclusion and exclusion criteria developed prior to recruitment or selection of the study population? Were the same underlying criteria used for all of the groups involved? To answer this question, reviewers determined if the investigators developed I/E criteria prior to recruitment or selection of the study population and if they used the same underlying criteria for all groups. The investigators should have used the same selection criteria, except for study participants who had the disease or condition, which would be different for cases and controls by definition. Therefore, the investigators use the same age (or age range), gender, race, and other characteristics to select cases and controls. Information on this topic is usually found in a paper's section on the description of the study population.

Question 7. Case and control definitions

For this question, reviewers looked for descriptions of the validity of case and control definitions and processes or tools used to identify study participants as such. Was a specific description of "case" and "control" provided? Is there a discussion of the validity of the case and control definitions and the processes or tools used to identify study participants as such? They determined if the tools or methods were accurate, reliable, and objective. For example, cases might be identified as "adult patients admitted to a VA hospital from January 1, 2000 to December 31, 2009, with an ICD-9 discharge diagnosis code of acute myocardial infarction and at least one of the two confirmatory findings in their medical records: at least 2mm of ST elevation changes in two or more ECG leads and an elevated troponin level. Investigators might also use ICD-9 or CPT codes to identify patients. All cases should be identified using the same methods. Unless the distinction between cases and controls is accurate and reliable, investigators cannot use study results to draw valid conclusions.

Question 8. Random selection of study participants

If a case-control study did not use 100 percent of eligible cases and/or controls (e.g., not all disease-free participants were included as controls), did the authors indicate that random sampling was used to select controls? When it is possible to identify the source population fairly explicitly (e.g., in a nested case-control study, or in a registry-based study), then random sampling of controls is preferred. When investigators used consecutive sampling, which is frequently done for cases in prospective studies, then study participants are not considered randomly selected. In this case, the reviewers would answer "no" to Question 8. However, this would not be considered a fatal flaw.

If investigators included all eligible cases and controls as study participants, then reviewers marked "NA" in the tool. If 100 percent of cases were included (e.g., NA for cases) but only 50 percent of eligible controls, then the response would be "yes" if the controls were randomly selected, and "no" if they were not. If this cannot be determined, the appropriate response is "CD."

Question 9. Concurrent controls

A concurrent control is a control selected at the time another person became a case, usually on the same day. This means that one or more controls are recruited or selected from the population without the outcome of interest at the time a case is diagnosed. Investigators can use this method in both prospective case-control studies and retrospective case-control studies. For example, in a retrospective study of adenocarcinoma of the colon using data from hospital records, if hospital records indicate that Person A was diagnosed with adenocarcinoma of the colon on June 22, 2002, then investigators would select one or more controls from the population of patients without adenocarcinoma of the colon on that same day. This assumes they conducted the study retrospectively, using data from hospital records. The investigators could have also conducted this study using patient records from a cohort study, in which case it would be a nested case-control study.

Investigators can use concurrent controls in the presence or absence of matching and vice versa. A study that uses matching does not necessarily mean that concurrent controls were used.

Question 10. Exposure assessed prior to outcome measurement

Investigators first determine case or control status (based on presence or absence of outcome of interest), and then assess exposure history of the case or control; therefore, reviewers ascertained that the exposure preceded the outcome. For example, if the investigators used tissue samples to determine exposure, did they collect them from patients prior to their diagnosis? If hospital records were used, did investigators verify that the date a patient was exposed (e.g., received medication for atherosclerosis) occurred prior to the date they became a case (e.g., was diagnosed with type 2 diabetes)? For an association between an exposure and an outcome to be considered causal, the exposure must have occurred prior to the outcome.

Question 11. Exposure measures and assessment

Were the exposure measures defined in detail? Were the tools or methods used to measure exposure accurate and reliable–for example, have they been validated or are they objective? This is important, as it influences confidence in the reported exposures. Equally important is whether the exposures were assessed in the same manner within groups and between groups. This question pertains to bias resulting from exposure misclassification (i.e., exposure ascertainment).

For example, a retrospective self-report of dietary salt intake is not as valid and reliable as prospectively using a standardized dietary log plus testing participants' urine for sodium content because participants' retrospective recall of dietary salt intake may be inaccurate and result in misclassification of exposure status. Similarly, BP results from practices that use an established protocol for measuring BP would be considered more valid and reliable than results from practices that did not use standard protocols. A protocol may include using trained BP assessors, standardized equipment (e.g., the same BP device which has been tested and calibrated), and a standardized procedure (e.g., patient is seated for 5 minutes with feet flat on the floor, BP is taken twice in each arm, and all four measurements are averaged).

Question 12. Blinding of exposure assessors

Blinding or masking means that outcome assessors did not know whether participants were exposed or unexposed. To answer this question, reviewers examined articles for evidence that the outcome assessor(s) was masked to the exposure status of the research participants. An outcome assessor, for example, may examine medical records to determine the outcomes that occurred in the exposed and comparison groups. Sometimes the person measuring the exposure is the same person conducting the outcome assessment. In this case, the outcome assessor would most likely not be blinded to exposure status. A reviewer would note such a finding in the comments section of the assessment tool.

One way to ensure good blinding of exposure assessment is to have a separate committee, whose members have no information about the study participants' status as cases or controls, review research participants' records. To help answer the question above, reviewers determined if it was likely that the outcome assessor knew whether the study participant was a case or control. If it was unlikely, then the reviewers marked "no" to Question 12. Outcome assessors who used medical records to assess exposure should not have been directly involved in the study participants' care, since they probably would have known about their patients' conditions. If the medical records contained information on the patient's condition that identified him/her as a case (which is likely), that information would have had to be removed before the exposure assessors reviewed the records.

If blinding was not possible, which sometimes happens, the reviewers marked "NA" in the assessment tool and explained the potential for bias.

Question 13. Statistical analysis

Were key potential confounding variables measured and adjusted for, such as by statistical adjustment for baseline differences? Investigators often use logistic regression or other regression methods to account for the influence of variables not of interest.

This is a key issue in case-controlled studies; statistical analyses need to control for potential confounders, in contrast to RCTs in which the randomization process controls for potential confounders. In the analysis, investigators need to control for all key factors that may be associated with both the exposure of interest and the outcome and are not of interest to the research question.

A study of the relationship between smoking and CVD events illustrates this point. Such a study needs to control for age, gender, and body weight; all are associated with smoking and CVD events. Well-done case-control studies control for multiple potential confounders.

Matching is a technique used to improve study efficiency and control for known confounders. For example, in the study of smoking and CVD events, an investigator might identify cases that have had a heart attack or stroke and then select controls of similar age, gender, and body weight to the cases. For case-control studies, it is important that if matching was performed during the selection or recruitment process, the variables used as matching criteria (e.g., age, gender, race) should be controlled for in the analysis.

General Guidance for Determining the Overall Quality Rating of Case-Controlled Studies

NHLBI designed the questions in the assessment tool to help reviewers focus on the key concepts for evaluating a study's internal validity, not to use as a list from which to add up items to judge a study's quality.

Internal validity for case-control studies is the extent to which the associations between disease and exposure reported in the study can truly be attributed to the exposure being evaluated rather than to flaws in the design or conduct of the study. In other words, what is ability of the study to draw associative conclusions about the effects of the exposures on outcomes? Any such flaws can increase the risk of bias.

In critical appraising a study, the following factors need to be considered: risk of potential for selection bias, information bias, measurement bias, or confounding (the mixture of exposures that one cannot tease out from each other). Examples of confounding include co-interventions, differences at baseline in patient characteristics, and other issues addressed in the questions above. High risk of bias translates to a poor quality rating; low risk of bias translates to a good quality rating. Again, the greater the risk of bias, the lower the quality rating of the study.

In addition, the more attention in the study design to issues that can help determine whether there is a causal relationship between the outcome and the exposure, the higher the quality of the study. These include exposures occurring prior to outcomes, evaluation of a dose-response gradient, accuracy of measurement of both exposure and outcome, sufficient timeframe to see an effect, and appropriate control for confounding–all concepts reflected in the tool.

If a study has a "fatal flaw," then risk of bias is significant; therefore, the study is deemed to be of poor quality. An example of a fatal flaw in case-control studies is a lack of a consistent standard process used to identify cases and controls.

Generally, when reviewers evaluated a study, they did not see a "fatal flaw," but instead found some risk of bias. By focusing on the concepts underlying the questions in the quality assessment tool, reviewers examined the potential for bias in the study. For any box checked "no," reviewers asked, "What is the potential risk of bias resulting from this flaw in study design or execution?" That is, did this factor lead to doubt about the results reported in the study or the ability of the study to accurately assess an association between exposure and outcome?

By examining questions in the assessment tool, reviewers were best able to assess the potential for bias in a study. Specific rules were not useful, as each study had specific nuances. In addition, being familiar with the key concepts helped reviewers assess the studies. Examples of studies rated good, fair, and poor were useful, yet each study had to be assessed on its own.

Quality Assessment Tool for Before-After (Pre-Post) Studies With No Control Group - Study Quality Assessment Tools

Criteria Yes No

Other
(CD, NR, NA)*

1. Was the study question or objective clearly stated?      
2. Were eligibility/selection criteria for the study population prespecified and clearly described?      
3. Were the participants in the study representative of those who would be eligible for the test/service/intervention in the general or clinical population of interest?      
4. Were all eligible participants that met the prespecified entry criteria enrolled?      
5. Was the sample size sufficiently large to provide confidence in the findings?      
6. Was the test/service/intervention clearly described and delivered consistently across the study population?      
7. Were the outcome measures prespecified, clearly defined, valid, reliable, and assessed consistently across all study participants?      
8. Were the people assessing the outcomes blinded to the participants' exposures/interventions?      
9. Was the loss to follow-up after baseline 20% or less? Were those lost to follow-up accounted for in the analysis?      
10. Did the statistical methods examine changes in outcome measures from before to after the intervention? Were statistical tests done that provided p values for the pre-to-post changes?      
11. Were outcome measures of interest taken multiple times before the intervention and multiple times after the intervention (i.e., did they use an interrupted time-series design)?      
12. If the intervention was conducted at a group level (e.g., a whole hospital, a community, etc.) did the statistical analysis take into account the use of individual-level data to determine effects at the group level?      

Guidance for Assessing the Quality of Before-After (Pre-Post) Studies With No Control Group

Question 1. Study question

Question 2. Eligibility criteria and study population

Did the authors describe the eligibility criteria applied to the individuals from whom the study participants were selected or recruited? In other words, if the investigators were to conduct this study again, would they know whom to recruit, from where, and from what time period?

Here is a sample description of a study population: men over age 40 with type 2 diabetes, who began seeking medical care at Phoenix Good Samaritan Hospital, between January 1, 2005 and December 31, 2007. The population is clearly described as: (1) who (men over age 40 with type 2 diabetes); (2) where (Phoenix Good Samaritan Hospital); and (3) when (between January 1, 2005 and December 31, 2007). Another sample description is women who were in the nursing profession, who were ages 34 to 59 in 1995, had no known CHD, stroke, cancer, hypercholesterolemia, or diabetes, and were recruited from the 11 most populous States, with contact information obtained from State nursing boards.

To assess this question, reviewers examined prior papers on study methods (listed in reference list) when necessary.

Question 3. Study participants representative of clinical populations of interest

The participants in the study should be generally representative of the population in which the intervention will be broadly applied. Studies on small demographic subgroups may raise concerns about how the intervention will affect broader populations of interest. For example, interventions that focus on very young or very old individuals may affect middle-aged adults differently. Similarly, researchers may not be able to extrapolate study results from patients with severe chronic diseases to healthy populations.

Question 4. All eligible participants enrolled

To further explore this question, reviewers may need to ask: Did the investigators develop the I/E criteria prior to recruiting or selecting study participants? Were the same underlying I/E criteria used for all research participants? Were all subjects who met the I/E criteria enrolled in the study?

Question 5. Sample size

Did the authors present their reasons for selecting or recruiting the number of individuals included or analyzed? Did they note or discuss the statistical power of the study? This question addresses whether there was a sufficient sample size to detect an association, if one did exist.

An article's methods section may provide information on the sample size needed to detect a hypothesized difference in outcomes and a discussion on statistical power (such as, the study had 85 percent power to detect a 20 percent increase in the rate of an outcome of interest, with a 2-sided alpha of 0.05). Sometimes estimates of variance and/or estimates of effect size are given, instead of sample size calculations. In any case, if the reviewers determined that the power was sufficient to detect the effects of interest, then they would answer "yes" to Question 5.

Question 6. Intervention clearly described

Another pertinent question regarding interventions is: Was the intervention clearly defined in detail in the study? Did the authors indicate that the intervention was consistently applied to the subjects? Did the research participants have a high level of adherence to the requirements of the intervention? For example, if the investigators assigned a group to 10 mg/day of Drug A, did most participants in this group take the specific dosage of Drug A? Or did a large percentage of participants end up not taking the specific dose of Drug A indicated in the study protocol?

Reviewers ascertained that changes in study outcomes could be attributed to study interventions. If participants received interventions that were not part of the study protocol and could affect the outcomes being assessed, the results could be biased.

Question 7. Outcome measures clearly described, valid, and reliable

Were the outcomes defined in detail? Were the tools or methods for measuring outcomes accurate and reliable–for example, have they been validated or are they objective? This question is important because the answer influences confidence in the validity of study results.

An example of an outcome measure that is objective, accurate, and reliable is death–the outcome measured with more accuracy than any other. But even with a measure as objective as death, differences can exist in the accuracy and reliability of how investigators assessed death. For example, did they base it on an autopsy report, death certificate, death registry, or report from a family member? Another example of a valid study is one whose objective is to determine if dietary fat intake affects blood cholesterol level (cholesterol level being the outcome) and in which the cholesterol level is measured from fasting blood samples that are all sent to the same laboratory. These examples would get a "yes."

An example of a "no" would be self-report by subjects that they had a heart attack, or self-report of how much they weight (if body weight is the outcome of interest).

Question 8. Blinding of outcome assessors

Blinding or masking means that the outcome assessors did not know whether the participants received the intervention or were exposed to the factor under study. To answer the question above, the reviewers examined articles for evidence that the person(s) assessing the outcome(s) was masked to the participants' intervention or exposure status. An outcome assessor, for example, may examine medical records to determine the outcomes that occurred in the exposed and comparison groups. Sometimes the person applying the intervention or measuring the exposure is the same person conducting the outcome assessment. In this case, the outcome assessor would not likely be blinded to the intervention or exposure status. A reviewer would note such a finding in the comments section of the assessment tool.

In assessing this criterion, the reviewers determined whether it was likely that the person(s) conducting the outcome assessment knew the exposure status of the study participants. If not, then blinding was adequate. An example of adequate blinding of the outcome assessors is to create a separate committee whose members were not involved in the care of the patient and had no information about the study participants' exposure status. Using a study protocol, committee members would review copies of participants' medical records, which would be stripped of any potential exposure information or personally identifiable information, for prespecified outcomes.

Question 9. Followup rate

Higher overall followup rates are always desirable to lower followup rates, although higher rates are expected in shorter studies, and lower overall followup rates are often seen in longer studies. Usually an acceptable overall followup rate is considered 80 percent or more of participants whose interventions or exposures were measured at baseline. However, this is a general guideline.

In accounting for those lost to followup, in the analysis, investigators may have imputed values of the outcome for those lost to followup or used other methods. For example, they may carry forward the baseline value or the last observed value of the outcome measure and use these as imputed values for the final outcome measure for research participants lost to followup.

Question 10. Statistical analysis

Were formal statistical tests used to assess the significance of the changes in the outcome measures between the before and after time periods? The reported study results should present values for statistical tests, such as p values, to document the statistical significance (or lack thereof) for the changes in the outcome measures found in the study.

Question 11. Multiple outcome measures

Were the outcome measures for each person measured more than once during the course of the before and after study periods? Multiple measurements with the same result increase confidence that the outcomes were accurately measured.

Question 12. Group-level interventions and individual-level outcome efforts

Group-level interventions are usually not relevant for clinical interventions such as bariatric surgery, in which the interventions are applied at the individual patient level. In those cases, the questions were coded as "NA" in the assessment tool.

General Guidance for Determining the Overall Quality Rating of Before-After Studies

The questions in the quality assessment tool were designed to help reviewers focus on the key concepts for evaluating the internal validity of a study. They are not intended to create a list from which to add up items to judge a study's quality.

Internal validity is the extent to which the outcome results reported in the study can truly be attributed to the intervention or exposure being evaluated, and not to biases, measurement errors, or other confounding factors that may result from flaws in the design or conduct of the study. In other words, what is the ability of the study to draw associative conclusions about the effects of the interventions or exposures on outcomes?

Critical appraisal of a study involves considering the risk of potential for selection bias, information bias, measurement bias, or confounding (the mixture of exposures that one cannot tease out from each other). Examples of confounding include co-interventions, differences at baseline in patient characteristics, and other issues throughout the questions above. High risk of bias translates to a rating of poor quality; low risk of bias translates to a rating of good quality. Again, the greater the risk of bias, the lower the quality rating of the study.

In addition, the more attention in the study design to issues that can help determine if there is a causal relationship between the exposure and outcome, the higher quality the study. These issues include exposures occurring prior to outcomes, evaluation of a dose-response gradient, accuracy of measurement of both exposure and outcome, and sufficient timeframe to see an effect.

Generally, when reviewers evaluate a study, they will not see a "fatal flaw," but instead will find some risk of bias. By focusing on the concepts underlying the questions in the quality assessment tool, reviewers should ask themselves about the potential for bias in the study they are critically appraising. For any box checked "no" reviewers should ask, "What is the potential risk of bias resulting from this flaw in study design or execution?" That is, does this factor lead to doubt about the results reported in the study or doubt about the ability of the study to accurately assess an association between the intervention or exposure and the outcome?

The best approach is to think about the questions in the assessment tool and how each one reveals something about the potential for bias in a study. Specific rules are not useful, as each study has specific nuances. In addition, being familiar with the key concepts will help reviewers be more comfortable with critical appraisal. Examples of studies rated good, fair, and poor are useful, but each study must be assessed on its own.

Quality Assessment Tool for Case Series Studies - Study Quality Assessment Tools

Criteria Yes No

Other
(CD, NR, NA)*

1. Was the study question or objective clearly stated?       
2. Was the study population clearly and fully described, including a case definition?      
3. Were the cases consecutive?      
4. Were the subjects comparable?      
5. Was the intervention clearly described?      
6. Were the outcome measures clearly defined, valid, reliable, and implemented consistently across all study participants?      
7. Was the length of follow-up adequate?      
8. Were the statistical methods well-described?      
9. Were the results well-described?      

Background: Development and Use - Study Quality Assessment Tools

Learn more about the development and use of Study Quality Assessment Tools.

Last updated: July, 2021

  • Search Menu
  • Sign in through your institution
  • Advance articles
  • Editor's Choice
  • ESHRE Pages
  • Mini-reviews
  • Author Guidelines
  • Submission Site
  • Reasons to Publish
  • Open Access
  • Advertising and Corporate Services
  • Advertising
  • Reprints and ePrints
  • Sponsored Supplements
  • Branded Books
  • Journals Career Network
  • About Human Reproduction
  • About the European Society of Human Reproduction and Embryology
  • Editorial Board
  • Self-Archiving Policy
  • Dispatch Dates
  • Contact ESHRE
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

Introduction, overview of challenges in credibility and utility of medical research, clinical trials and clinical research, big data and large observational datasets, systematic reviews and meta-analyses, acknowledgements, authors’ roles, conflict of interest.

  • < Previous

Protect us from poor-quality medical research

The list of the ESHRE Capri Workshop Group contributors is given in the  Appendix .

  • Article contents
  • Figures & tables
  • Supplementary Data

ESHRE Capri Workshop Group , Protect us from poor-quality medical research, Human Reproduction , Volume 33, Issue 5, May 2018, Pages 770–776, https://doi.org/10.1093/humrep/dey056

  • Permissions Icon Permissions

Much of the published medical research is apparently flawed, cannot be replicated and/or has limited or no utility. This article presents an overview of the current landscape of biomedical research, identifies problems associated with common study designs and considers potential solutions. Randomized clinical trials, observational studies, systematic reviews and meta-analyses are discussed in terms of their inherent limitations and potential ways of improving their conduct, analysis and reporting. The current emphasis on statistical significance needs to be replaced by sound design, transparency and willingness to share data with a clear commitment towards improving the quality and utility of clinical research.

Much of the published medical research is apparently flawed, cannot be replicated and/or has limited or no utility. Poor-medical research has long been called a scandal ( Altman, 1994 ). Even though there have been some improvements in many research practices over time, some of the new opportunities in medical research create more and more complex challenges on how to avoid and deal with poor research. The curricula of most medical schools do not prioritize conduct and interpretation of medical research. This creates a problem for future clinicians who wish to practice evidence based medicine, an issue that is compounded by the unreliability of much of the published clinical research. Doctors need methodological training in order to critically appraise the quality of available evidence instead of taking all published literature on trust ( Ioannidis et al. , 2017 ).

The present manuscript is based on an ESHRE Capri Workshop held 31 August–1 September 2017. The workshop and this resulting document tried to: first, define the main current problems underlying poor-biomedical research, with emphasis on examples that would be relevant for reproductive medicine in particular; second, analyze the main causes of the problems; and third, propose changes that would solve some of these problems. This has major implications not only for research, but also for the conduct of medicine and for medical outcomes that depend on research evidence.

We recognize upfront that perfectly reliable/credible and useful research is clearly an unattainable utopia. However, there are many ways in which the existing situation can be improved. In the following sections, we overview challenges in credibility and utility that affect medical research at large and then focus on specific challenges that are more specific for some key types of influential studies: clinical trials and clinical research; big data and large observational studies; and systematic reviews (SRs) and meta-analyses (MAs).

Most biomedical research studies are of poor quality

It has been estimated that 85% of all research funding is actually wasted, due to inappropriate research questions, faulty study design, flawed execution, irrelevant endpoints, poor reporting and/or non-publication ( MacLeod et al. , 2014 ; Moher et al. , 2016 ).

Yet, credibility of biomedical research is an essential pre-requisite for evidence based medical decision-making. Reliability and credibility refer to how likely the results of a study are to be true. Accuracy refers to the difference between the observed results and the ‘truth’. Reproducibility of methods implies that use of the same methods and tools on the same data and samples will generate the same results. Reproducibility of results denotes the ability to generate comparable results in a new study using methods which are similar to those in the original study. Finally, reproducibility of inferences indicates the ability to reach similar conclusions when different individuals read the same results ( Goodman et al. , 2016 ). Apart from these essential attributes, a highly desirable characteristic of pre-clinical and clinical research is utility, i.e. clinical usefulness.

The elusive P -value

Although reliability and utility are critical, most research studies primarily aim to obtain and present significant results. Significance itself can be conceptual, clinical and statistical, each carrying a very specific meaning. Statistical significance (typically expressed through P -values obtained from null hypothesis testing) is almost ubiquitous in the biomedical literature. An overwhelming majority of published papers claim to have found (statistically and/or conceptually) significant results. An empirical evaluation of all abstracts published in Medline (1990–2015) reporting P -values showed that 96% reported statistically significant results. An in depth analysis of close to 1 million full-text papers in the same time period identified a similarly high proportion with statistically significant P -values ( Chavalarias et al. , 2016 ). Simulation studies have shown that in the absence of a pre-specified protocol and analysis plan, analytical manipulation can produce almost any desired result as a spurious artefact ( Patel et al. , 2015 ). Multiple analyses of the same dataset can lead to results which demonstrate variations in both magnitude and direction of effect, occasionally leading to a Janus phenomenon where different analyses of the same data provide conflicting results to the same question ( Patel et al. , 2015 ).

While these problems are most prevalent in observational studies, even experimental research is not immune from them. Small and biased randomized trials can also produce unreliable results. Large treatment effects produced by trials with modest sample sizes and questionable quality often disappear when the same interventions are tested in large populations by well-conducted trials ( Pereira et al. , 2012 ). The literature is replete with ways of assessing quality and the risk of bias in clinical trials and other types of studies. Empirical studies have shown that deficiencies in study characteristics that reflect low quality, or high risk of bias, can lead on average to inflated treatment effects ( Savović et al. , 2012 ). However, as the effect of quality shows large between-trial and between-topic heterogeneity, the impact of poor design in a single study cannot be accurately assessed. A low quality study should lead to greater uncertainty, but this does not make it any easier to compute a correction factor to get a clean, ‘corrected’ result.

So far we have focused too much on P -values. The P -value suggests a black-and-white distinction that is elusive ( Farland et al. , 2016 ). Effect sizes and confidence intervals are to be preferred in studies in the context of clinically relevant questions, biological plausibility, good study design and conduct. Interpretation of data should be performed in view of prior knowledge, and should preferably lead to the generation of a scientific theory. Our goal should be to perform relevant studies (for which collective equipoise is mandatory) that have adequate power ( Braakhekke et al. , 2017 ). Their findings should be placed in context of broader research agendas and the updated evidence should be used to inform clinical practice.

The research landscape changes

The landscape of clinical research is also being transformed by an increasing volume of studies from outside Europe and the USA. There is some evidence that published results from developing countries without an established tradition of clinical research tend to report larger estimates of benefits for medical interventions ( Panagiotou et al. , 2013 ), even in multi-centre randomized trials ( De Denus et al. , 2017 ).

Commercial sponsors may design research in ways to maximize the chances of success of a new discovery, especially where large markets are involved. In these circumstances, trials may not necessarily be of lower quality but the questions may be defined and the analyses pre-specified in such ways as to yield favourable conclusions. For example, 96.5% of non-inferiority trials in 2011 resulted in conclusions that favoured a new drug or intervention ( Flacco et al. , 2015 ).

The advent of big data (see below) allows for more ambitious analyses but most available data are of questionable quality and the chance of uncovering genuine effects tends to be relatively low mainly because of high risk of bias. Bias is separate from random error; while random error affects the precision of the signal and big data diminishes the random error, bias may create signals that do not exist or may inflate signals or cause signals in the entirely wrong direction. The availability of big data has been perceived as the dawn of a new paradigm which liberates researchers from some of the more stringent aspects of scientific rigour such as a clear hypothesis, pre-planned analysis, validation and replication, but this is wrong. Hype surrounding new technologies can sway the best academic institutions and innovative entrepreneurs, leading to false expectations about what the new tools and massive datasets can deliver ( Lipworth et al. , 2017 ).

Finally, utility is an attribute that seems to have been overlooked by much of medical research. It comprises the following key elements ( Ioannidis, 2016c ): having a real problem to fix, appropriate anchoring of the question within the context of prior evidence, substantial prospects of acquiring relevant new information from the new study (irrespective of the direction of its results), pragmatism, patient-centeredness (‘what the patient wants’), value for money, feasibility and transparency (including protection from bias). For a full discussion of these eight features of useful research, see a previous discussion publication ( Ioannidis, 2016c ). Most studies published even in the very best journals meet only a minority of these features ( Ioannidis, 2016c ).

Conflicts of interest

While recent years have seen major improvements in reporting of conflicts of interest, many continue to go unreported, and there is a growing realization that non-financial conflicts may have a bigger impact than previously imagined. High-level evidence synthesis (e.g. SRs and MAs) and guidelines may help streamline some of the uncertainty surrounding the available evidence and facilitate medical decision-making. However, these tools also have their weaknesses ( Clinical Practice Guidelines We Can Trust, 2011 ). As an example, a series of red flags has been proposed for guidelines ( Lenzer et al. , 2013 ), suggesting caution for those planning to use them in clinical practice. Some of these red flags are difficult to detect, e.g. when a committee for a guideline does not seem to have any major conflicts of interest among its members, but the selection of the members has been pre-emptively biased in favour of a particular recommendation, based on their known views on a subject.

There are many proposed solutions to improve research practices

While the challenges listed above are considerable, there is also a large body of research that has identified examples of good practice and highlighted ways of bypassing problems ( Ioannidis, 2014 ; Munafò et al. , 2017 ). Solutions need to be tailored to the type of study design and the questions being asked. For example, the following aspects can all be helpful for clinical trials: pre-registration of protocols with detailed description of outcomes, adoption of reporting standards, data sharing, multi-site trials with careful selection of sites, involvement of methodological experts, appropriate regulatory oversight and containment of conflicts of interest. There are still many unanswered questions about who needs to lead these positive changes in research practices; whether it is the responsibility of investigators, institutions, funders, journals, professional societies, industry or other stakeholders. There is continuing debate on how to protect the biomedical literature from preventable bias and error. However special considerations need to be made for specific influential types or research.

Clinical relevance of selected outcomes

Outcomes for effectiveness studies should be relevant. Efficacy and mechanistic studies can be used judiciously to inform the best conduct of effectiveness trials with relevant outcomes. Standardization of outcomes is useful for both effectiveness and efficacy studies. Many specialties are reaching consensus on what are the core outcomes that are worth prioritizing. For example, the CROWN initiative aims at developing core outcome measures in woman’s health ( Core Outcomes in Women’s Health (CROWN) Initiative, 2014 ).

Multiplicity issues in clinical research

If researchers perform many analyses, some will turn out to be statistically significant purely by chance, yielding false-positive results. Multiple testing might represent a particular problem in infertility treatment; due to the multistage nature of many treatments, many outcomes may be reported in a study.

Registration

Registration of clinical research has become more common, especially for clinical trials, but still many trials are not pre-registered. Ideally, before carrying out a clinical trial, its full study design, including all primary and secondary outcomes (e.g. number of oocytes obtained per woman randomized or cumulative live birth rate after three completed cycles of ART treatment), should be pre-specified and the trial registered in a WHO approved clinical trial registry, together with the latest approved version of the protocol ( COMPare, 2017 ). In the absence of registration (or with incomplete details about registration), it is not possible to tell which goals, objectives, design aspects or analyses were pre-specified and which were post-hoc explorations.

Reporting of pre-specified outcomes

Once the trial is finished, the trial report should present all pre-specified outcomes. When reported outcomes differ from those pre-specified, this must be declared in the report, along with an appropriate explanation ( COMPare, 2017 ). Changing endpoints of a study after the analysis of the data has occurred may denote scientific misconduct, especially if the change is instigated by the lack of significance in the primary outcome, but not in some arbitrary subordinate outcomes ( COMPare, 2017 ). This is popularly known as P -hacking, data dredging, cherry picking, snooping, significance chasing or the Texas sharpshooter fallacy ( Evers, 2017 ).

Reporting guidance exists for randomized trials (CONOSRT), as well as for other types of clinical research, e.g. STARD (for diagnostic test studies), PRISMA (for meta-analyses) and IMPRINT (specifically for fertility trials). These guidance documents aim to improve the quality and completeness of clinical research reports ( Glasziou et al. , 2014 ). It is very disturbing that comparisons of protocols with publications in major medical journals revealed that most studies had at least one primary endpoint changed, introduced or omitted ( Chan et al. , 2004 , 2014 ; Glasziou et al. , 2014 ).

Power considerations in clinical trials

Lack of sufficient power is a major problem across various types of studies, including randomized trials in diverse disciplines, and reproductive medicine is no exception. Differences in live birth rates of 3–5% may still be clinically relevant to detect, but hardly any trials in the field have a sufficient sample size for this. Therefore, one should be careful to consider confidence intervals when interpreting study results. Some trials, where it is concluded that the intervention had no effect, may in fact offer no conclusive information about whether the treatment is effective or not. Moreover, small trials are more likely to generate exaggerated effects and even false-positive spurious effects.

Database linkage: maximum temptation meets maximum opportunity

Sources of health care data include governments, health care providers, insurers and registers of births and deaths as well as registries of specific conditions, treatments and medical devices. Increasingly, data are available in electronic formats and can be linked with other health, social, geographical and education data to create massive datasets incorporating complex longitudinal records with large-scale population coverage and long-term follow-up. Medical records can provide demographic information, lifestyle choices, clinical findings, laboratory and imaging results, treatment details and outcomes. The ability to link sociodemographic and clinical details with genomic, proteomic and metabolomic data could potentially allow physicians to deliver precision medicine ( Peek et al. , 2014 ) for individual patients. Routinely collected health data can also allow a real-world evaluation of treatment outcomes. While opportunities seem to abound in theory, there are many serious limitations to big data and large observational datasets. Here, we discuss some of the key ones.

Problems with information

The event-based nature of routinely collected health data is a potential limitation, as important problems or treatments not resulting in hospital contacts may be missing. Inaccuracies in the data can occur due to mistakes in data entry and lack of appropriate checks. Routine data are also likely to contain a minimum set of variables and many key confounders such as body weight, height, smoking status, alcohol intake and socio-economic status may be missing. Many historical datasets lack a planned schema, which can create problems during analysis ( Jorm, 2015 ) although others have detailed metadata ( Ayorinde et al. , 2016 ). Finally, data is often missing in a non-random fashion thus introducing the possibility of bias. While some ways of dealing with missing data ( Jagsi et al. , 2014 ) are better than others, it may be difficult to address missing data with high confidence.

Ethical challenges

Major concerns arise in any discussion around the use of routinely collected data to answer questions for which the data were not originally collected. These concerns involve lack of informed consent; possible identification of subjects during linkage procedures (even after anonymization); the dilemma of dealing with detected individual risks in an anonymised (rather than anonymous) population, who could potentially be identified and informed; and individuals in very small categories of groups with unusual conditions. Instead of widely open use of big data, it may be required to employ data safe havens where access is limited to trained staff and safe release of data occurs after rigorous checks minimizes risks of identification ( Lea et al. , 2016 ).

Difficulties in linkage

Linkage presents a common technical challenge that could introduce significant error if done incorrectly. The most accurate is the deterministic method using a unique identifier, such as the personal identity number in the Nordic countries and the community health index (CHI) number in Scotland ( Ayorinde et al. , 2016 ). Where this is not feasible, probabilistic methods based on characteristics such as name, date of birth and geographical location have been used but this approach can result in errors.

Dealing with confounding in large observational datasets

All large observational datasets are prone to confounding that can cause spurious associations. For example, in the context of fertility data, age is a common confounder that influences the choice of treatment as well as its outcome. Often the choice of therapy is usually based on preference, predicted response or other non-random selection features which can impact outcomes ( Jagsi et al. , 2014 ). For example, as women with more severe endometriosis may be more likely to receive surgery than women with less severe disease, the outcome of surgical treatment may appear to be worse than medical alternatives. Methods such as propensity score matching, propensity score stratification and inverse probability of treatment weighting have been used to correct for this. Other methods have been used to counteract the effects of hidden bias, such as instrumental variable analysis, which uses counterfactuals to try to approximate a randomized design situation. Although some reviews ( Anglemyer et al. , 2014 ) suggest that there is limited evidence for significant differences in health care outcomes between observational studies and randomized trials, it is clear from other studies that further refinements in analysis need to be made in order to achieve the same degree of accuracy ( McGale et al. , 2016 ). Empirical evaluations suggest that routinely collected data are not yet used to their maximal potential utility ( Hemkens et al. , 2016a ) and that they tend to generate inflated treatment effects even when sophisticated propensity score methods are used ( Hemkens et al. , 2016b ).

Overpowered big data

Studies based on large datasets can have sample sizes that are so large that they detect very small and clinically unimportant effect sizes. Such studies should be interpreted appropriately according to their clinical significance. Highly statistically significant results may actually be attributable to bias ( Peek et al. , 2014 ). With small effects, bias or confounding cannot be excluded. Interpretation must therefore be cautious, despite the statistical significance.

Personalized medicine prospects

Advances in computational infrastructures for dealing with big datasets and the related explosion in data science methodology, lead to speculations that the future of life sciences is likely to be dominated by systems which can ingest and sift through large volumes of -omics data to generate reliable information for individualized decision making (e.g. personalized or precision medicine). However, these expectations have yet to be fully realized. A naïve expectation of accurate predictions from inherently flawed and incomplete data could turn out to be no more than blind faith in fool’s gold ( Khoury and Ioannidis, 2014 ; Lipworth et al. , 2017 ). Personalized medicine is an interesting concept but it meets with many conceptual ( Senn, 2016 ) and practical difficulties in making it work.

A prolific industry of meta-analyses

Most hierarchies of evidence place well-conducted SRs and MAs at the top of the evidence pyramid and these publications have grown in volume as well as influence. As of mid-2017, nearly 100 000 published meta-analysis articles were indexed in PubMed with over 1000 new ones indexed every month ( Ioannidis, 2016a ). There are also ~250 000 published SRs in PubMed, with another 2500 new ones indexed every month. In many fields there are more SRs than primary studies ( Prior et al. , 2017 ) and, in many situations, SRs have replaced experience and clinical acumen in terms of driving clinical decision-making. This has not gone unnoticed by individuals and groups with vested interests (financial or non-financial) who have used them as tools to influence practice in favour of their preferred drugs and interventions ( Ioannidis, 2016b ).

Most SRs and MAs are not very useful and many are not useful at all

A common conclusion of many SRs, particularly those that address questions on effective treatment is that primary evidence is lacking, suboptimal or unreliable. This statement alone has some utility, because it can still help calibrate the level of uncertainty in decision-making and may suggest avenues of new research. However, very often the primary data feeding into SRs and MAs are so unreliable that these may have a more important role in detecting bias rather than uncovering the truth. SRs and MAs may also help identify gaps in the use of patient-relevant outcomes where multiple studies exist but outcomes that matter are not addressed.

The global profile of SRs and MAs

The profile of SRs and MAs has changed over the last decade, with increasing numbers of MAs now being generated in China. Most of these MAs are unreliable or misleading (especially the bulk-produced meta-analyses of candidate gene associations). Moreover, there is a new large portfolio of MAs conducted by contractor companies that are commissioned and paid by industry ( Schuit and Ioannidis, 2016 ). Only a small proportion of these MAs are published and publication bias may be related to the results of the MAs and the interests of the sponsor. An online search suggested that over 100 service-offering companies perform SRs and MAs ( Schuit and Ioannidis, 2016 ).

Redundancy in SRs and MAs

A recent evaluation suggested that only ~3% of current MAs are both methodologically sound and clinically useful ( Ioannidis, 2016a ). There is a lot of redundancy and large numbers of SRs and MAs continue to be conducted on some topics without clear evidence for the additional value of the newer publications, e.g. in the area of urinary derived versus recombinant FSH treatment ( Van Wely et al. , 2011 ).

More sophisticated MA designs

Even for more sophisticated forms of evidence synthesis such as network MAs, an empirical evaluation identified 28 publications on the same topic, each including part of the available evidence with inconsistent conclusions ( Naudet et al. , 2017 ). Registration of MAs at the protocol stage, e.g. in registries like PROSPERO, may be helpful, but it is unclear whether this alone can create a more efficient, transparent and, ultimately, a more accurate compilation of all the available facts ( Moher et al. , 2014 ; Tricco et al. , 2016 ).

An increasing number of MAs have been able to use individual participant data. These require more resources to perform compared with MAs of aggregated data, but they have a number of advantages in terms of being able to clean the data, standardize definitions, outcomes and co-variates across studies, and can explore subgroup differences in a more reliable fashion ( Simmonds et al. , 2005 ). Apart from higher costs, their disadvantage includes incomplete retrieval of data, potentially leading to bias, if some trials with specific directions of effect are missed. As results from randomized trials and other types of studies become more readily available, it may be easier to perform comprehensive MAs using individual-level data in the future. Using advanced meta-analysis methods requires statistical and methodological competence that is often currently lacking in the reviewers undertaking such analyses and using software that they do not fully understand.

Systematic reviews and meta-analyses in the future

Despite limitations, SRs and MAs will continue to be indispensable for summarizing the evidence and understanding its biases, strengths and weaknesses. Moving forward, hopefully there will be more MAs in the future which use optimal methods for systematic searches, retrieving, analysing and reporting data. It is also likely that there will be more MAs that will use either networks or individual-level data or both, allowing for more informative analyses and data syntheses. Eventually, MAs may be planned as prospective exercises, i.e. designed contemporaneously with primary evaluative studies with a clear a priori plan of combining results from primary studies on completion ( Ioannidis, 2017 ). This approach may help to minimize some of the biases that exist in retrospective data synthesis.

Given the challenges described above, it is probably not surprising that most medical research shows poor reproducibility of methods and results. Some of the problems are increasingly recognized by the scientific community. A 2016 Nature survey showed that more than two-thirds of scientists believed that there is a reproducibility problem ( Baker, 2017 ). Replicability is a benchmark of scientific quality; authors should always try to replicate their own results and provide sufficiently detailed instructions for others to do so. While research fraud is uncommon, the temptation to cut corners prompts many authors to indulge in poor-scientific practices ( Tanksalva, 2017 ). The ‘publish or perish’ attitude favours hasty, low quality, incomplete research with the aim of maximizing the number of papers from a single research project (sometimes known as salami slicing). Researchers should confine themselves to their findings and resist the temptation to sensationalize their results. Incentive structures for rewarding research, e.g. publication, funding, promotion and tenure, need to pay more attention to the quality and reproducibility of the work produced.

Investigators can learn from studies that cannot be replicated. Adoption of reporting standards will help, as will multi-site trials, involvement of methodological experts, appropriate regulatory oversight and transparency about conflicts of interest. As gatekeepers, journals can offer high quality peer review (which should include proper statistical/methodological review, as appropriate). Prospective trial registration is not enough, full protocols should also be published, and data should be shared.

Finally, many changes will require emphasis on education, including training of researchers in methodological competence and training at medical schools (where physicians should be sensitized to strengths and weaknesses of the evidence that affects their practices) or, more realistically, the research should be left to those who train in it for years and never lose sight of how difficult it is.

The secretarial assistance of Mrs Simonetta Vassallo is gratefully acknowledged.

Lecturers and discussants (Chairs of the scientific sessions and past-Chairs of ESHRE) contributed to the preparation of the final article.

The meeting was organized by the European Society of Human Reproduction and Embryology with an unrestricted educational grant from Institut Biochimique S.A. (Switzerland).

None declared.

Altman DG . The scandal of poor medical research . Br Med J 1994 ; 308 : 283 .

Google Scholar

Anglemyer A , Horvath HT , Bero L . Healthcare outcomes assessed with observational study designs compared with those assessed in randomized trials . Cochrane Database Syst Rev 2014 ; 4 : MR000034 .

Ayorinde AA , Wilde K , Lemon J , Campbell D , Bhattacharya S . Data resource profile: the Aberdeen Maternity and Neonatal Databank (AMND) . Int J Epidemiol 2016 ; 45 : 389 – 394 .

Baker M . www.nature.com (accessed 21 August 2017, date last accessed).

Braakhekke M , Mol F , Mastenbroek S , Mol BW , van der Veen F . Equipoise and the RCT . Hum Reprod 2017 ; 32 : 257 – 260 .

Chan AW , Hróbjartsson A , Haahr MT , Gøtzsche PC , Altman DG . Empirical evidence for selective reporting of outcomes in randomized trials: comparison of protocols to published articles . J Am Med Assoc 2004 ; 291 : 2457 – 2465 .

Chan AW , Krleza-Jerić K , Schmid I , Altman DG . Outcome reporting bias in randomized trials funded by the Canadian Institutes of Health Research . CMAJ 2014 ; 171 : 735 – 740 .

Chavalarias D , Wallach J , Li A , Ioannidis JPA . Evolution of reporting of p-values in the biomedical literature, 1990–2015 . J Am Med Assoc 2016 ; 315 : 1141 – 1148 .

Clinical Practice Guidelines We Can Trust Graham R , Mancher M , Miller Wolman D , Greenfield S , Steinberg E (eds) . Institute of Medicine (US) Committee on Standards for Developing Trustworthy Clinical Practice Guidelines . Washington, DC : National Academies Press (US) , 2011 .

Google Preview

COMPare Trials Project . Goldacre B, Drysdale H, Powell-Smith A, Dale A, Milosevic I, Slade E, Hartley P, Marston C, Mahtani K, Heneghan C. www.COMPare-trials.org . (23 May 2017 , date last accessed).

Core Outcomes in Women’s Health (CROWN) Initiative . The CROWN Initiative: journal editors invite researchers to develop core outcomes in women’s health . Hum Reprod 2014 ; 29 : 1349 – 1350 .

De Denus S , O’Meara E , Desai AS , Claggett B , Lewis EF , Leclair G , Jutras M , Lavoie J , Solomon SD , Pitt B et al.  . Spironolactone metabolites in TOPCAT—new insights into regional variation . N Engl J Med 2017 ; 376 : 1690 – 1692 .

Evers JL . The Texas scharpshooter fallacy . Hum Reprod 2017 ; 32 : 1363 .

Farland LV , Correia KF , Wise LA , Williams PL , Ginsburg ES , Missmer SA . P-values and reproductive health: what can clinical researchers learn from the American Statistical Association? Hum Reprod 2016 ; 31 : 2406 – 2410 .

Flacco ME , Manzoli L , Boccia S , Capasso L , Aleksovska K , Rosso A , Scaioli G , De Vito C , Siliquini R , Villari P et al.  . Head-to-head randomized trials are mostly industry-sponsored and almost always favour the industry sponsor . J Clin Epidemiol 2015 ; 68 : 811 – 820 .

Glasziou P , Altman DG , Bossuyt P , Boutron I , Clarke M , Julious S , Michie S , Moher D , Wager E . Reducing waste from incomplete or unusable reports of biomedical research . Lancet 2014 ; 383 : 267 – 276 .

Goodman SN , Fanelli D , Ioannidis JP . What does research reproducibility mean . Sci Transl Med 2016 ; 8 : 341ps12 .

Hemkens LG , Contopoulos-Ioannidis DG , Ioannidis JP . Current use of routinely collected health data to complement randomized controlled trials: a meta-epidemiological survey . CMAJ Open 2016 a; 4 : E132 – E140 .

Hemkens LG , Contopoulos-Ioannidis DG , Ioannidis JP . Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey . Br Med J 2016 b; 352 : i493 .

Ioannidis JP . How to make more published research true . PLoS Med 2014 ; 11 : e1001747 .

Ioannidis JP . The mass production of redundant, misleading, and conflicted systematic reviews and meta-analyses . Milbank Q 2016 a; 94 : 485 – 514 .

Ioannidis JP . Evidence-based medicine has been hijacked: a report to David Sackett . J Clin Epidemiol 2016 b; 73 : 82 – 86 .

Ioannidis JP . Why most clinical research is not useful . PLoS Med 2016 c; 13 : e1002049 .

Ioannidis JP . Meta-analyses can be credible and useful: a new standard . JAMA Psychiatry 2017 ; 74 : 311 – 312 .

Ioannidis JPA , Stuart ME , Brownlee S , Strite SA . How to survive the medical misinformation mess . Eur J Clin Invest 2017 ; 47 : 795 – 802 .

Jagsi R , Bekelman JE , Chen A , Chen RC , Hoffman K , Shih YC , Smith BD , Yu JB . Considerations for observational research using large data sets in radiation oncology . Int J Radiat Oncol Biol Phys 2014 ; 90 : 11 – 24 .

Jorm L . Routinely collected data as a strategic resource for research: priorities for methods and workforce . Public Health Res Pract 2015 ; 25 : e2541540 .

Khoury MJ , Ioannidis JP . Medicine. Big data meets public health . Science 2014 ; 346 : 1054 – 1055 .

Lea NC , Nicholls J , Dobbs C , Sethi N , Cunningham J , Ainsworth J , Heaven M , Peacock T , Peacock A , Jones K et al.  . Data safe havens and trust: toward a common understanding of trusted research platforms for governing secure and ethical health research . JMIR Med Inform 2016 ; 4 : e22 .

Lenzer J , Hoffman JR , Furberg CD , Ioannidis JP , Guideline Panel Review working group . Ensuring the integrity of clinical practice guidelines: a tool for protecting patients . Br Med J 2013 ; 347 : f5535 .

Lipworth W , Mason PH , Kerridge I , Ioannidis JP . Ethics and epistemology in big data research . J Bioeth Inq 2017 . doi:10.1007/s11673-017-9771-3 . [Epub ahead of print].

MacLeod MR , Michie S , Roberts I , Dirnagl U , Chalmers I , Ioannidis JP , Al-Shahi Salman R , Chan AW , Glasziou P . Biomedical research: increasing value, reducing waste . Lancet 2014 ; 383 : 101 – 104 .

McGale P , Cutter D , Darby SC , Henson KE , Jagsi R , Taylor CW . Can observational data replace randomized trials? J Clin Oncol 2016 ; 34 : 3355 – 3357 .

Moher D , Booth A , Stewart L . How to reduce unnecessary duplication: use PROSPERO . BJOG 2014 ; 121 : 784 – 786 .

Moher D , Glasziou P , Chalmers I , Nasser M , Bossuyt PM , Korevaar DA , Graham ID , Ravaud P , Boutron I . Increasing value and reducing waste in biomedical research: who’s listening? Lancet 2016 ; 387 : 1573 – 1586 .

Munafò MR , Bishop DV , Button KS , Chambers C , Nosek B , Percie du Sert N , Simonsohn U , Wagenmakers E-J , Ware JJ , Ioannidis JPA . A manifesto for reprodubile science . Nature Human Behaviour 2017 ; 1 : 0021 .

Naudet F , Schuit E , Ioannidis JPA . Overlapping network meta-analyses on the same topic: survey of published studies . Int J Epidemiol 2017 . doi:10.1093/ije/dyx138 . [Epub ahead of print].

Panagiotou OA , Contopoulos-Ioannidis DG , Ioannidis JP . Comparative effect sizes in randomised trials from less developed and more developed countries: meta-epidemiological assessment . Br Med J 2013 ; 346 : f707 .

Patel C , Burford B , Ioannidis JP . Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations . J Clin Epidemiol 2015 ; 68 : 1046 – 1058 .

Peek N , Holmes JH , Sun J . Technical challenges for big data in biomedicine and health: data sources, infrastructure, and analytics . Yearb Med Inform 2014 ; 9 : 42 – 47 .

Pereira TV , Horwitz RI , Ioannidis JP . Empirical evaluation of very large treatment effects of medical interventions . J Am Med Assoc 2012 ; 308 : 1676 – 1684 .

Prior M , Hibberd R , Asemota N , Thornton JG . Inadvertent P-hacking among trials and systematic reviews of the effect of progestogens in pregnancy? A systematic review and meta-analysis . BJOG 2017 ; 124 : 1008 – 1015 .

Savović J , Jones HE , Altman DG , Harris RS , Jűni P , Pildal J , Als-Nielsen B , Balk EM , Gluud C , Gluud LL et al.  . Influence of reported study design characteristics on intervention effect estimates from randomized controlled trials: combined analysis of meta-epidemiologic studies . Ann Intern Med 2012 ; 157 : 429 – 438 .

Schuit E , Ioannidis JPA . Network meta-analyses performed by contracting companies and commissioned by industry . Sys Rev 2016 ; 5 : 198 .

Senn S . Mastering variation: variance components and personalised medicine . Stat Med 2016 ; 35 : 966 – 977 .

Simmonds MC , Higgins JP , Stewart LA , Tierney JF , Clarke MJ , Thompson SG . Meta-analysis of individual patient data from randomized trials: a review of methods used in practice . Clin Trials 2005 ; 2 : 209 – 217 .

Tanksalva S . www.clarivate.com/blog (21 August 2017, date last accessed).

Tricco AC , Cogo E , Page MJ , Polisena J , Booth A , Dwan K , MacDonald H , Clifford TJ , Stewart LA , Straus SE et al.  . A third of systematic reviews changed or did not specify the primary outcome: a PROSPERO register study . J Clin Epidemiol 2016 ; 79 : 46 – 54 .

Van Wely M , Kwan I , Burt AL , Thomas J , Vail A , Van der Veen F , Al-Inany HG . Recombinant versus urinary gonadotrophin for ovarian stimulation in assisted reproductive technology cycles . Cochrane Database Syst Rev 2011 ; 2 : CD005354 . doi:10.1002/14651858.CD005354.pub2 .

List of the ESHRE Capri Workshop Group contributors:

The lecturers included John P.A. Ioannidis (2 lectures) (Departments of Medicine, of Health Research and Policy, of Biomedical Data Science, and of Statistics, and Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, USA), Siladitya Bhattacharya (Professor of Reproductive Medicine, Head of Division of Applied Health Sciences and Director Institute of Applied Health Sciences, School of Medicine and Dentistry, University of Aberdeen, Aberdeen Maternity Hospital, Foresterhill, Aberdeen, UK), J.L.H. Evers (Dept. Obstet. Gynecol., Maastricht University Medical Centre, Maastricht, The Netherlands), Fulco van der Veen (Academic Medical Centre, University of Amsterdam, Reproduction Medicine, Amsterdam, The Netherlands), Edgardo Somigliana (Clinica Ostetrica e Ginecologica, IRCCS Ca’ Granda Foundation, Maggiore Policlinico Hospital, Milano, Italy), Christopher L.R. Barratt (Division of Molecular & Clinical Medicine, School of Medicine, University of Dundee, Ninewells Hospital and Medical School, Dundee, UK), Gianluca Bontempi (co-Head of the Machine Learning Group, Département d’Informatique, Université Libre de Bruxelles, Bruxelles, Belgium). The discussants included: David T. Baird (Centre for Reproductive Biology, University of Edinburgh, UK), PierGiorgio Crosignani (IRCCS Ca’ Granda Foundation, Maggiore Policlinico Hospital, Milano, Italy), Paul Devroey (AZ-VUB, Centre for Reproductive Medicine, Brussels, Belgium), Klaus Diedrich (Klin. Frauenheilkunde und Geburtshilfe, Univ. zu Lubeck, Lubeck, Germany), Roy G. Farquharson (Liverpool Women’s Hospital, Department of OB/GYN, Liverpool, UK), Lynn R. Fraser (Centre for Reproduction, Endocrinology & Diabetes, School of Biomedical Sciences, New Hunt’s House, Kings College London, Guy’s Campus, London, UK), Joep P.M. Geraedts (AZ Maastricht, Klinische Genetica, Maastricht, The Netherlands), Luca Gianaroli (S.I.S.M.E.R., Bologna, Italy), Carlo La Vecchia (Department of Clinical Sciences and Community Health, Università degli Studi di Milano, Milan, Italy), Kersti Lundin (Reproductive Medicine, Sahlgrenska University Hospital, Gothenburg, Sweden), Cristina Magli (S.I.S.M.E.R., Bologna, Italy), Eva Negri (Department of Biomedical and Clinical Sciences, Università degli Studi di Milano, Milano, Italy), Arne Sunde (University Hospital, Dept. Obstet. Gynecol., Trondheim, Norway), Juha S. Tapanainen (University of Helsinki, Department of Obstetrics and Gynecology, Helsinki University Hospital, University of Helsinki, Helsinki, Finland), Basil C. Tarlatzis (Infertility & IVF Center, Geniki Kliniki, Thessaloniki, Greece), Andre Van Steirteghem (Centre for Reproductive Medicine, Universitair Ziekenhuis Vrije Universiteit Brussel, Belgium), Anna Veiga (Reproductive Medicine Service, Dexeus Woemn’s Health, Barcelona, Spain).

Author notes

The list of the ESHRE Capri Workshop Group contributors is given in the   Appendix .

  • meconium aspiration syndrome
  • russell-silver syndrome
  • biomedical research
  • somatostatin receptor scintigraphy
  • medical research
  • macrophage activation syndrome
  • clinical research
Month: Total Views:
March 2018 4
April 2018 282
May 2018 384
June 2018 2,098
July 2018 316
August 2018 204
September 2018 131
October 2018 131
November 2018 167
December 2018 361
January 2019 255
February 2019 124
March 2019 175
April 2019 205
May 2019 142
June 2019 182
July 2019 217
August 2019 166
September 2019 170
October 2019 217
November 2019 140
December 2019 122
January 2020 144
February 2020 151
March 2020 160
April 2020 160
May 2020 173
June 2020 624
July 2020 153
August 2020 131
September 2020 168
October 2020 192
November 2020 161
December 2020 159
January 2021 137
February 2021 624
March 2021 234
April 2021 268
May 2021 354
June 2021 148
July 2021 154
August 2021 182
September 2021 252
October 2021 209
November 2021 213
December 2021 208
January 2022 221
February 2022 172
March 2022 232
April 2022 206
May 2022 193
June 2022 161
July 2022 119
August 2022 135
September 2022 104
October 2022 134
November 2022 110
December 2022 145
January 2023 141
February 2023 120
March 2023 108
April 2023 130
May 2023 181
June 2023 107
July 2023 110
August 2023 142
September 2023 133
October 2023 167
November 2023 114
December 2023 147
January 2024 219
February 2024 183
March 2024 208
April 2024 221
May 2024 186
June 2024 128
July 2024 51

Email alerts

Companion article.

  • Why after 50 years of effective contraception do we still have unintended pregnancy? A European perspective

Citing articles via

  • Recommend to your Library

Affiliations

  • Online ISSN 1460-2350
  • Copyright © 2024 European Society of Human Reproduction and Embryology
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • BMC Med Educ

Logo of bmcmedu

A Modified Medical Education Research Study Quality Instrument (MMERSQI) developed by Delphi consensus

Mansour al asmri.

1 Clinical Skills Training Centre, King Fahad Specialist Hospital, Dammam, Saudi Arabia

M. Sayeed Haque

2 Institute of Applied Health Research, University of Birmingham, Birmingham, B15 2TT UK

3 Institute of Clinical Sciences, College of Medical and Dental Sciences, University of Birmingham, Birmingham, B15 2TT UK

Associated Data

The datasets used and/or analysed during the current study are available in an anonymised format from the corresponding author on reasonable request.

The Medical Education Research Study Quality Instrument (MERSQI) is widely used to appraise the methodological quality of medical education studies. However, the MERSQI lacks some criteria which could facilitate better quality assessment. The objective of this study is to achieve consensus among experts on: (1) the MERSQI scoring system and the relative importance of each domain (2) modifications of the MERSQI.

A modified Delphi technique was used to achieve consensus among experts in the field of medical education. The initial item pool contained all items from MERSQI and items added in our previous published work. Each Delphi round comprised a questionnaire and, after the first iteration, an analysis and feedback report. We modified the quality instruments’ domains, items and sub-items and re-scored items/domains based on the Delphi panel feedback.

A total of 12 experts agreed to participate and were sent the first and second-round questionnaires. First round: 12 returned of which 11 contained analysable responses; second-round: 10 returned analysable responses. We started with seven domains with an initial item pool of 12 items and 38 sub-items. No change in the number of domains or items resulted from the Delphi process; however, the number of sub-items increased from 38 to 43 across the two Delphi rounds. In Delphi-2: eight respondents gave ‘study design’ the highest weighting while ‘setting’ was given the lowest weighting by all respondents. There was no change in the domains’ average weighting score and ranks between rounds.

Conclusions

The final criteria list and the new domain weighting score of the Modified MERSQI (MMERSQI) was satisfactory to all respondents. We suggest that the MMERSQI, in building on the success of the MERSQI, may help further establish a reference standard of quality measures for many medical education studies.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12909-023-04033-6.

The Medical Education Research Study Quality Instrument (MERSQI) was introduced in 2007 to appraise the methodological quality of studies of medical education [ 1 ]. MERSQI evaluates the quality of the research itself rather than the quality of the reporting and the authors [ 1 ] excluded elements such as “importance of research questions” and “quality of conceptual frameworks”. MERSQI has been validated, gained acceptance and been widely used [ 2 ]. MERSQI contains ten items reflecting six domains: study design, sampling, type of data, validity of evaluation instrument, data analysis, and outcomes. All domains have the same maximum score of three; maximum score is 18. Previous research has established validity evidence for MERSQI including reliability and internal consistency, as well as relationship to other variables such as likelihood of publication, citation rate, and study funding [ 1 – 3 ]. Cook, DA and Reed, DA. [ 4 ] discussed and compared MERSQI with Newcastle–Ottawa scale-education method of evaluation and reported that MERSQI is a reliable tool for appraising the methodological quality of medical education research, however, it “lacks items on blinding and comparability of cohorts”. The limitations of MERSQI which are presented in our report have not been previously discussed or mentioned in the literature.

We argue that the existing instrument would be improved by adding or modifying the criteria to facilitate better quality assessment. We suggest that: (i) the risk of bias of randomised controlled trials should be considered [ 5 ]; (ii) participant characteristics should be included [ 6 ] (particularly in some domains such as teaching intimate examination skills); (iii) the robustness of objective data measurement required to discriminate learners’ level of mastery should be assessed, as per Miller’s pyramid [ 7 ]. Learning a skill goes through three stages [ 8 ]: cognitive (understanding), associative (practise), and autonomous (automatic). Thus, the learner could, for example, form a cognitive picture of the skill but lack the fundamentals and mechanics required to perform the skill. The cognitive framework is clearly a pre-requisite to enable practise. Similarly, to use Miller’s framework, learners progress through ‘knows’ to ‘knows how’ to ‘shows how’ to ‘does’ (by which Miller means performs in the real clinic as a practicing clinician). In assessing acquisition of skills, therefore, we argue that it is consistent with Miller’s pyramid to weight performance (e.g. in our context ‘high fidelity simulation’ which is the closest to actual performance in almost all these reported studies) above testing ‘on paper’ which clearly can only assess the cognitive imagining of a skill, not its performance as such. Furthermore, we argue that the impact that each of the six domains has on the quality of the study is not equal (indeed that this is clear a priori) and therefore, each domain should be weighted based on its impact on study quality, see for example Timmer et al. [ 9 ] who gave study design the highest score in the development and evaluation of a quality score for abstracts.

The purpose of this paper, therefore, is to report on our modification of the MERSQI utilising the Modified Delphi method [ 10 ]. We aimed to achieve consensus among experts on: (1) modifications of the MERSQI domains, items or sub-items (2) the MERSQI scoring system and weighting of each domain.

Research team

The research team consists of all the authors. The researchers have different backgrounds: clinical, academic, statistical, and simulation education.

Selection of items

We included the initial pool of MERSQI items and included new items (Table ​ (Table1) 1 ) which we had developed in our previous work [ 11 ] to improve the granularity of the MERSQI. Based on the first modified MERSQI list, we created a Delphi questionnaire of 12 items under seven domains i.e. the original six domains plus a ‘settings’ domain. We used the Delphi method as it is implicitly based on both empirical evidence and personal opinion and allows conflicting scientific evidence to be addressed using quasi-anonymity of experts’ opinion [ 12 – 14 ]. We used the modified Delphi (i.e. utilising our previous work), because this method increases the response rate of the initial round [ 10 ]. Delphi rounds continue till sufficient consensus is reached (consensus is defined as general agreement of a substantial majority [ 12 ], please see procedure in the methods section for more details). Expert panel members were given the opportunity, in each round, to add items, to suggest rewording of items, to score items, and to weight the seven domains, see for example Timmer et al. [ 9 ] (Additional file 1 ).

Showing original MERSQI and first Modification of the MERSQI, showing the new items used in Delphi-1

A. Original MERSQI ItemDomainB. Modified MERSQI Item
1. Study designStudy design1. Study design
Single group cross-sectional or single group post-test onlya. Single group cross-sectional or single group post-test only
Single group pre-test & post-testb. Single group pre-test & post-test
Nonrandomized, 2 groupsc. Nonrandomised, 2 groups
Randomized controlled triald.
e.
f.
2. Institutions studied: Sampling1.
12.
23. Response rate, %: (Pls. select one)
 > 2
3. Response rate, %:
Not applicablea. Not applicable
 < 50 or not reportedb. < 50 or not reported
50‐74c. 50‐74
 > 75d. > 75
4. Institutions studied:
a. Single centre
b. Multi centre
4. Type of dataType of data5. Type of data
Assessment by participantsAssessment by participants
Objective measurementObjective measurement (
a.
b.
c.
5. Internal structure:Validity of evaluation instrument6. Internal structure:
a. Not applicablea. Not applicable
b. Not reportedb. Not reported
c. Reportedc. Reported
6. Content:7. Content:
a. Not applicablea. Not applicable
b. Not reportedb. Not reported
c. Reportedc. Reported
7. Relationships to other variables:8. Relationships to other variables:
a. Not applicablea. Not applicable
b. Not reportedb. Not reported
c. Reportedc. Reported
8. Appropriateness of analysis:Data analysis9. Appropriateness of analysis:
a. Inappropriate for study design or type of dataa. Inappropriate for study design or type of data
b. Appropriate for study design, type of datab. Appropriate for study design, type of data
9. Complexity of analysis:10. Complexity of analysis: (
a. Descriptive analysis onlya. Descriptive analysis only
b. Beyond descriptive analysisb.
c.
10. OutcomesOutcomes11. Outcomes
Satisfaction, attitudes, perceptions, opinions, general factsSatisfaction, attitudes, perceptions, opinions, general facts
Knowledge, skillsKnowledge, skills measured by: (
Behavioursa.
Patient/health care outcomeb.
Behaviours
Patient/health care outcome

a Risk of bias judgment based on: sequence generation, blinding & allocation concealment. For more details, please see Additional file 2 Cochrane Risk of Bias Tool for Randomized Controlled Trials

Selection of expert panel

Potential panel members were identified based on our knowledge of their fields of interest and published work in medical education research. We identified 22 potential respondents who were approached by email. There was no response from 7 and 3 declined. All twelve respondents were experts in medical education: one Clinical Outcomes Assessment Consultant, one Associate Professor of Education, one Professor of Health Sciences and Medical Education, and one Professor of Clinical Communication as well as eight medical academics: two Professors of Medical Education, two Associate Professors of Medical Education, one Professor of General Practice, one Professor of Simulation Education, one Professor of Anaesthetics, and one Professor of Clinical Epidemiology.

Questionnaires were distributed to panel members by emails. In the first round (Delphi-1) we requested respondents to (i) give a score that reflected research quality for items and/or sub-items (in case the item has multiple choices) within each domain; on a scale of one to ten, ten being the highest (ii) indicate whether there should be any additional items, or modifications to the existing ones (iii) estimate the weighting for each domain out of 100 available points to be allocated across the domains. For Delphi-2, a feedback report was prepared and shared anonymously with respondents, summarising responses with additional items included as recommended in Delphi-1. Items or sub-items were added, removed, or modified if eight or more out of twelve panellists agreed. We considered consensus had been achieved when the agreement rate reaches 70% or more amongst respondents [ 15 ]. In the Delphi-1free text feedback it was clear that the respondents had different interpretations of high and low simulation fidelity, as is common in the literature [ 16 ]. Subsequently, in Delphi-2 we provided them with a clear definition of high fidelity, which we defined as “the ability of the simulated training to provide a true representation of intended learning goals”. Respondents were also provided with their previous scores plus the mean score of other respondents (anonymised) on each item from the previous round. They were asked to score any new items and re-evaluate their previous scores, bearing in mind the scores given by the rest of the panel, altering their score if they wished.

This procedure (Delphi rounds) is ended if a general consensus is achieved by visual inspection in all the domains with respect to domain weighting or ranks between two subsequent Delphi rounds [ 17 ].

The University of Birmingham Research Ethics Committee (reference number ERN_20-0728) approved this study.

Delphi round one

All 12 experts (7 male, and 5 female) returned the questionnaires. Eight respondents were from the UK and four were from outside the UK; respondents were from nine different institutions. Unfortunately, one of the questionnaires was returned unusable (mostly blank) and therefore was excluded from analysis.

Respondents suggested five sub-items to be added ( Bold and Italic in Table ​ Table2). 2 ). The ‘study design’ domain was given the highest weighting by eight (73%) respondents although five of these eight respondents scored study design equal highest with another domain. Two (18%) respondents gave data analysis the highest weighting and one (9%) scored outcomes highest.

The final Modified MERSQI; items after Deplhi-1, scores after Delphi-2

DomainMMERSQI ItemPointsScore
EachTotalMax
Study design1. Study design: (Pls. select one)
a. Single group cross-sectional or single group post-test only723
b. Single group pre-test & post-test9
c. Nonrandomised, 2 groups10
d. Randomised controlled trial with high risk bias 11
e. Randomised controlled trial with moderate risk bias 16
f. Randomised controlled trial with low risk bias 23
Sampling2. Is there a power calculation for sample size?10
a. No0
b. Yes3
3. Are detailed participant characteristics for each arm reported?
a. No0
b. Yes3
4. Response rate, %: (Pls. select one)
a. Not reported0.5
b. < 501
c. 50‐742
d. > 754
5. Institutions studied: 8
a. Single centre5
5
5
8
Type of data6. Type of data11
Assessment by participants4
Objective measurement (
a. Knowledge test (e.g. recall type questions)6
b. Applied knowledge test (e.g. analysis and problem-solving type questions)8
c. Skills11
Validity of evaluation instrument7. Internal structure:15
a. Not applicable
b. Not reported0
c. Reported5
8. Content:
a. Not applicable
b. Not reported0
c. Reported5
9. Relationships to other variables:
a. Not applicable
b. Not reported0
a. Reported5
Data analysis10. Appropriateness of analysis:17
a. Inappropriate for study design or type of data0
b. Appropriate for study design, type of data9
11. Complexity of analysis ( ):
a. Descriptive analysis only4
b. Simple inferential statistics4
c. Modelling and more complex analysis8
Outcomes12. Outcomes
Satisfaction, attitudes, perceptions, opinions, general facts716
9
12
8
12
Behaviours in clinical environment13
Patient/health care outcome16
Total Possible scoreMin 23.5Max 100

Delphi round two

For Delphi-2, 12 questionnaires were distributed and 10 were returned. Of the two non-responders, one had not responded to the first round. Only two respondents modified their distribution of score weighting between the domains. Eight (80%) respondents gave ‘study design’ the highest weighting (average 23 percentage points) and ‘setting’ was given the lowest weighting by all respondents (average 8 percentage points) (Table ​ (Table3). 3 ). Five of the eight respondents weighted study design equally with another domain. The domains weighted equally to the study design domain were outcomes (by three respondents), evaluation instrument validity domain (by one respondent) and data analysis domain (by one respondent). As can be seen in Table ​ Table3 3 there is general consensus in all these domains. There was no change in domain average weighting or ranks between Delphi-1 and Delphi-2. Therefore, we ended the Delphi rounds.

Domains weighting score (to sum up 100)

Respondents NoStudy designSamplingSettingType of dataEvaluation instrument validityData analysisOutcomes
Respondents did not change scores between rounds1
2
3
4
5
6
7
8
Respondents changed their scoors910 15 25
20 10 20
10 10 25
15 20
Respondent no response round 21130 10 10 20 5 20 5
NR NR NR NR NR NR NR
Delphi-1 Mean2310811151716
RANK1675423
Delphi-2 Mean2310811151716
Standard errors (SE) of Mean1.50.80.81.41.42.81.5
RANK1675423
change in rank?0.00.00.00.00.00.00.0

NR No response

a Delphi-1 score

b Delphi-2 score; Bold = Delphi 1&2 score (no change in score)

We used the average weighting score (out of 100) to determine the weighting of each domain. Thus, for example, ‘study design’ received the average weight of 23 out of the 100 points available and so each sub-item within that domain had the ‘possibility’ of scoring the full 23 points. We used the score out of ten which had been given by respondents for each sub-item to then allocate a proportion (in this example up to a maximum of 23 points) to each sub-item in this domain. Thus, for example, the sub-item ‘single group cross-sectional or single group post-test only’ scored 3/10 and was thus allocated three tenths of the available 23 points for that domain (i.e. 7). In contrast, ‘Randomised controlled trial with low-risk bias’ scored 9/10 and was therefore allocated 21 points (i.e. 90% of the domain weighting (23)). For simplicity, we rounded up the points for the item which achieved the highest points in each domain so that the overall total had at least the possibility of achieving 100. For domains where more than one sub-item could be scored, we used the highest scoring item. For example, in the data analysis domain, the maximum possible score is 17. This domain has two items and each item has multiple sub-items. If scoring a paper containing both simple inferential statistics and modelling, we use the highest scoring item, and thus 8 points (for modelling) are awarded rather than 4 points (for simple inferential statistics). The final quality criteria list is shown in Table ​ Table2 2 .

Summary and discussions

A group of respondents with known relevant expertise [ 11 ] participated in two Delphi rounds to achieve consensus on MMERSQI. We derived our MMERSQI from the original MERSQI with the addition of items developed by the research team and which have been supplemented and assessed through a Delphi process.

After two rounds, there was a clear consensus that some domains have significantly more importance in determining educational research quality. It is of course possible that a different expert panel would have given different results, but our panel consisted of a wide range of people from different perspectives who were all experts in medical education. However, the standard errors (SE) of the mean are very small, thus the probable scores that may be given by other panels would most likely not vary much from the scores we got from this panel.

The learning effectiveness of simulation-based medical education is well-established in the literature [ 18 , 19 ]. Of course, simulation-based medical education cannot replace but can support and supplement clinical placement in terms of effectiveness, self-confidence, and preparation for clinical practice [ 20 ]. Surprisingly small differences were found between the points given by the Delphi panel to high fidelity simulation (accuracy of simulation) (12 points) compared to the clinical environment (13 points).This is consistent with report from Quail et al. [ 21 ] that learning communication skills in a brief placement in virtual, standardised, or traditional learning environments achieved the same outcomes in knowledge and confidence.

The fidelity of the training has to be high for all types of learners and constant all the time but focus must be shifted from the appearance to the accuracy of stimulus, information processing and response in a certain situation. If a learner has learned a skill incorrectly for the first time, it appears, a priori, that performance may be hindered even with further training [ 22 , 23 ]. On the other hand, the difficulty / simplicity level of the simulated training should match the learner level to improve engagement in learning [ 24 ]. As Vygotsky [ 25 ] says, skills development takes place in the zone of the learner being able to solve a problem independently or with help of an expert as described by the concept of the zone of proximal development. The most important issue therefore is the ability of the simulation to achieve the intended transferable learning goals.

The Delphi process achieved consensus on the MMERSQI. Respondents achieved consensus that the domain weighting should not be equal and that some domains have more importance than others. We suggest that the MMERSQI, in building on the success of the MERSQI, may help further establish a minimum reference standard of quality measures for medical education studies. The validity of this criteria list and scoring system will have to be further evaluated over time.

Acknowledgements

The authors thank Prof A’Aziz Boker, A/Prof Andy Wearn, A/Prof Anna Vnuk, A/Prof Celia Brown, Prof Chris McManus, Prof Debra Nestel, Dr Ian Davison, Prof John Skelton, A/Prof Katherine Woolf, Prof Richard Lilford, Prof Roger Jones for their participation in the Delphi panel. We also thank Dr John Easton for his comments on an earlier version of this manuscript.

Other disclosures

Previous presentations, abbreviations.

MERSQIMedical Education Research Study Quality Instrument
MMERSQIModified Medical Education Research Study Quality Instrument

Authors’ contributions

The research idea was conceived by M.A., J.P. and S.H. Initial Delphi 1&2 questionnaire drafted by M.A. amended and reviewed by J.P. and S.H. Respondents recruitment by J.P. and M.A. The data collection and initial paper draft by M.A. Data analysis and draft and redrafting by M.A., J.P. and S.H. Delphi feedback to respondents and second questionnaire by M.A. and J.P. The authors read and approved the final manuscript.

The research was performed as part of the authors’ regular work duties. MA was funded for his PhD by the Ministry of Higher Education, Saudi Arabia.

Availability of data and materials

Declarations.

The University of Birmingham Research Ethics Committee has approved this study. Informed consent was obtained from all the participants. We confirm that all methods were performed in accordance with the guidelines and regulations of the University of Birmingham Research Ethics Committee approval reference number ERN_20-0728.

Not applicable.

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Appraising the quality of medical education research methods: the Medical Education Research Study Quality Instrument and the Newcastle-Ottawa Scale-Education

Affiliation.

  • 1 D.A. Cook is professor of medicine and medical education, Department of Medicine; associate director, Mayo Center for Online Learning; and research chair, Mayo Clinic Multidisciplinary Simulation Center, Mayo Clinic College of Medicine, Rochester, Minnesota. D.A. Reed is associate professor of medicine and medical education, Department of Medicine, and senior associate dean of academic affairs, Mayo Medical School, Mayo Clinic College of Medicine, Rochester, Minnesota.
  • PMID: 26107881
  • DOI: 10.1097/ACM.0000000000000786

Purpose: The Medical Education Research Study Quality Instrument (MERSQI) and the Newcastle-Ottawa Scale-Education (NOS-E) were developed to appraise methodological quality in medical education research. The study objective was to evaluate the interrater reliability, normative scores, and between-instrument correlation for these two instruments.

Method: In 2014, the authors searched PubMed and Google for articles using the MERSQI or NOS-E. They obtained or extracted data for interrater reliability-using the intraclass correlation coefficient (ICC)-and normative scores. They calculated between-scale correlation using Spearman rho.

Results: Each instrument contains items concerning sampling, controlling for confounders, and integrity of outcomes. Interrater reliability for overall scores ranged from 0.68 to 0.95. Interrater reliability was "substantial" or better (ICC > 0.60) for nearly all domain-specific items on both instruments. Most instances of low interrater reliability were associated with restriction of range, and raw agreement was usually good. Across 26 studies evaluating published research, the median overall MERSQI score was 11.3 (range 8.9-15.1, of possible 18). Across six studies, the median overall NOS-E score was 3.22 (range 2.08-3.82, of possible 6). Overall MERSQI and NOS-E scores correlated reasonably well (rho 0.49-0.72).

Conclusions: The MERSQI and NOS-E are useful, reliable, complementary tools for appraising methodological quality of medical education research. Interpretation and use of their scores should focus on item-specific codes rather than overall scores. Normative scores should be used for relative rather than absolute judgments because different research questions require different study designs.

PubMed Disclaimer

Similar articles

  • Association between funding and quality of published medical education research. Reed DA, Cook DA, Beckman TJ, Levine RB, Kern DE, Wright SM. Reed DA, et al. JAMA. 2007 Sep 5;298(9):1002-9. doi: 10.1001/jama.298.9.1002. JAMA. 2007. PMID: 17785645
  • Medical education research study quality instrument: an objective instrument susceptible to subjectivity. Jaros S, Beck Dallaghan G. Jaros S, et al. Med Educ Online. 2024 Dec 31;29(1):2308359. doi: 10.1080/10872981.2024.2308359. Epub 2024 Jan 24. Med Educ Online. 2024. PMID: 38266115 Free PMC article.
  • Conference presentation to publication: a retrospective study evaluating quality of abstracts and journal articles in medical education research. Stephenson CR, Vaa BE, Wang AT, Schroeder DR, Beckman TJ, Reed DA, Sawatsky AP. Stephenson CR, et al. BMC Med Educ. 2017 Nov 9;17(1):193. doi: 10.1186/s12909-017-1048-3. BMC Med Educ. 2017. PMID: 29121891 Free PMC article.
  • Method and reporting quality in health professions education research: a systematic review. Cook DA, Levinson AJ, Garside S. Cook DA, et al. Med Educ. 2011 Mar;45(3):227-38. doi: 10.1111/j.1365-2923.2010.03890.x. Med Educ. 2011. PMID: 21299598 Review.
  • Assessing methodological quality in dental education research using MERSQI: Analysis of publications from two journals. Sukotjo C, Koseoglu M, Suwannasin P, Yuan JC, Park YS, Johnson BR, Thammasitboon K, Tekian A. Sukotjo C, et al. J Dent Educ. 2024 Jun;88(6):786-797. doi: 10.1002/jdd.13477. Epub 2024 Feb 12. J Dent Educ. 2024. PMID: 38343340 Review.
  • Mapping Evidence in Teaching Palliative Care in Undergraduate Curriculum of Healthcare Professionals Qualification: A Systematic Scoping Review Protocol. Nkambule SJ, Gaede B. Nkambule SJ, et al. Res Sq [Preprint]. 2024 Jul 3:rs.3.rs-4313706. doi: 10.21203/rs.3.rs-4313706/v1. Res Sq. 2024. PMID: 39011104 Free PMC article. Preprint.
  • Systematic review of feedback literacy instruments for health professions students. Mohd Noor MN, Fatima S, Grace Cockburn J, Romli MH, Pallath V, Hong WH, Vadivelu J, Foong CC. Mohd Noor MN, et al. Heliyon. 2024 May 10;10(10):e31070. doi: 10.1016/j.heliyon.2024.e31070. eCollection 2024 May 30. Heliyon. 2024. PMID: 38813152 Free PMC article.
  • Efficacy and safety of robotic radical hysterectomy in cervical cancer compared with laparoscopic radical hysterectomy: a meta-analysis. Dai Z, Qin F, Yang Y, Liang W, Wang X. Dai Z, et al. Front Oncol. 2024 May 15;14:1303165. doi: 10.3389/fonc.2024.1303165. eCollection 2024. Front Oncol. 2024. PMID: 38812787 Free PMC article.
  • What makes an article a must read in medical education? Nakhostin-Ansari A, Mirabal SC, Mendes TB, Ma YE, Saldanha Neves Horta Lima C, Chapla K, Reynolds S, Oswalt H, Wright SM, Tackett S. Nakhostin-Ansari A, et al. BMC Med Educ. 2024 May 28;24(1):582. doi: 10.1186/s12909-024-05564-2. BMC Med Educ. 2024. PMID: 38807077 Free PMC article.
  • Efficacy and safety of lenvatinib combined with PD‑1/PD‑L1 inhibitors in the treatment of hepatocellular carcinoma: A meta‑analysis and systematic review. Zhang B, Su L, Lin Y. Zhang B, et al. Oncol Lett. 2024 May 14;28(1):312. doi: 10.3892/ol.2024.14445. eCollection 2024 Jul. Oncol Lett. 2024. PMID: 38803443 Free PMC article.
  • Search in MeSH

LinkOut - more resources

Full text sources.

  • Ovid Technologies, Inc.
  • Wolters Kluwer

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

  • Matters Arising
  • Open access
  • Published: 08 May 2023

Quantifying the Quality of Medical Education Studies: The MMERSQI Approach

  • Aaron Lawson McLean   ORCID: orcid.org/0000-0001-5528-6905 1 &
  • Falko Schwarz   ORCID: orcid.org/0000-0003-0821-193X 1  

BMC Medical Education volume  23 , Article number:  320 ( 2023 ) Cite this article

932 Accesses

1 Citations

Metrics details

Peer Review reports

Dear Editor,

We read with interest the recent publication "A Modified Medical Education Research Study Quality Instrument (MMERSQI) developed by Delphi Consensus" by Al Asmri et al. [ 1 ]. The authors present a modification of a commonly used study quality instrument, the Medical Education Research Study Quality Instrument (MERSQI), using a modified Delphi technique to reach consensus among experts in the field of medical education, with the aim of identifying any changes required in the scoring system and relative importance of each domain [ 2 ]. The authors also added new criteria to the instrument based on feedback received from the Delphi panel. The final criteria list and the new domain weighting score of the MMERSQI was satisfactory to all respondents and the authors suggest that the MMERSQI may help establish a reference standard of quality measures for many medical education studies.

The development of this tool is an important step in the evaluation of medical education research studies and the fact that that the authors obtained high levels of agreement among the Delphi panel members is a strength of the study, suggesting that the MMERSQI has good face validity. However, while the MMERSQI, like the original MERSQI, has certain strengths, we have concerns that need to be addressed before the tool can be widely adopted and used.

First, we appreciate the use of the Delphi method in reaching a consensus on the MMERSQI items. This method helps to ensure that the resulting instrument reflects the opinions of experts in the field and minimizes the risk of bias. However, the sample size of the Delphi participants could have been larger and more representative of the field, which would have increased the external validity of the results. Moreover, the specific criteria used to select the Delphi participants and the method of data analysis were not reported in the article, which limits the transparency and reproducibility of the study.

Second, the MMERSQI only covers a limited number of aspects of study quality. While the items selected are important, they do not encompass all the aspects of quality that need to be considered when evaluating medical education research studies. For example, ethical considerations like obtaining informed consent from participants, protecting participant confidentiality, and minimizing harm should also be included. Additionally, confounding variables, which can impact the validity of the results, should be reported.

Third, the scoring system used in the MMERSQI is not optimal. In the study, the weighting system was determined through a Delphi consensus, which while useful in gaining a general consensus, may not accurately reflect the true importance of each item in relation to the quality of the study. Further, the scoring system operates on a binary principle, in which a score is assigned based on the presence or absence of certain key features, without considering the importance of these features relative to one another. The weight given to each criterion is not well justified, nor is the weighting system anchored to a clear definition of study quality or to a clear understanding of the importance of each item in determining the quality of a study. This lack of weighting can lead to a flawed representation of study quality, as critical components may be given equal consideration as less critical ones. This issue is compounded by the fact that some components, such as participant characteristics or response rate, can have a significant impact on the overall validity of the study, whereas others, such as the type of institution, may have a relatively minor impact. Thus, the scoring system as it stands may not accurately reflect the true quality of the study, which can hinder the interpretation of the results and their application in practice.

It would have been helpful if the authors had provided some guidelines or a scoring algorithm to assist in the interpretation of the scores. The current approach does not allow for a nuanced assessment of the quality of the study and may result in a misleading overall score. A more nuanced scoring system, such as a modified Likert scale, could have been used to allow for a more nuanced assessment of the quality of each item. This would have allowed for a more in-depth examination of the strengths and limitations of each study, and would have enabled a more accurate and comprehensive assessment of study quality. Moreover, the scoring system could also be enhanced by considering the interplay between components, as certain combinations of features may have a compounded impact on the overall validity of the study. For example, a study with a low response rate may be compensated for by the use of high-fidelity simulation for measuring outcomes, whereas a study with a high response rate but limited internal validity may still produce questionable results.

Finally, the MMERSQI needs to be further tested and validated before it can be used in practice. The authors reported moderate to high levels of agreement among the Delphi panel members, but this does not necessarily mean that the MMERSQI is a valid and reliable instrument. Psychometric testing of the instrument, encompassing features such as reliability, validity and responsiveness, must now be pursued. Furthermore, the generalizability of the MMERSQI to different medical education research study designs and settings has not been established. A possible research agenda would include testing the MMERSQI in a range of contexts, assessing its reliability by measuring inter-rater agreement, and validating it against external standards of study quality, including alternative instruments like the Newcastle-Ottawa Scale-Education (NOS-E) or expert evaluations [ 3 ].

Of course, we recognise that obtaining expert evaluations or using a validated instrument for comparison can be time-consuming, costly, and may not always be feasible for every manuscript or study. Moreover, the use of an external gold standard assumes a level of consistency in the ratings provided, which may not always be the case. Nevertheless, efforts to develop and validate the MMERSQI should aim to incorporate as much external validation as possible, to improve its representativeness, precision in scoring, and the configuration of its multiple components, within the constraints of practicality and available resources.

In conclusion, while the MMERSQI is a useful step in the evaluation of medical education research studies, it still has several limitations and needs to be further developed and tested. It is clear to us that several of our comments, for example regarding the scoring procedure, selection of assessment criteria and dimensions, and interpretation of these attributes as indicators of study validity are fundamental issues that require attention in both the original MERSQI and the newly developed MMERSQI. We recommend that the authors conduct additional psychometric testing, validate the MMERSQI in different study designs and settings, and consider adding items to address the ethical and reporting aspects of study quality. We hope that the authors will take these aspects into consideration and that the MMERSQI will be further developed and refined to provide a comprehensive and accurate evaluation of medical education research studies.

Aaron Lawson McLean & Falko Schwarz

Data Availability

Not applicable.

Al Asmri M, Haque MS, Parle JA, Modified Medical Education Research Study Quality. Instrument (MMERSQI) developed by Delphi consensus. BMC Med Educ. 2023;23(1):63. https://doi.org/10.1186/s12909-023-04033-6 .

Article   Google Scholar  

Reed DA, Cook DA, Beckman TJ, Levine RB, Kern DE, Wright SM. Association between funding and quality of published medical education research. JAMA. 2007;298(9):1002–9. https://doi.org/10.1001/jama.298.9.1002 .

Cook DA, Reed DA. Appraising the quality of medical education research methods: the Medical Education Research Study Quality Instrument and the Newcastle-Ottawa Scale-Education. Acad Med. 2015;90(8):1067–76. https://doi.org/10.1097/ACM.0000000000000786 .

Download references

Acknowledgements

Author information, authors and affiliations.

Department of Neurosurgery, Jena University Hospital – Friedrich Schiller University Jena, Am Klinikum 1, 07747, Jena, Germany

Aaron Lawson McLean & Falko Schwarz

You can also search for this author in PubMed   Google Scholar

Contributions

ALM and FS were responsible for the writing and submission of this letter. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Aaron Lawson McLean .

Ethics declarations

Ethics approval, consent for publication, competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Lawson McLean, A., Schwarz, F. Quantifying the Quality of Medical Education Studies: The MMERSQI Approach. BMC Med Educ 23 , 320 (2023). https://doi.org/10.1186/s12909-023-04303-3

Download citation

Received : 10 February 2023

Accepted : 28 April 2023

Published : 08 May 2023

DOI : https://doi.org/10.1186/s12909-023-04303-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

BMC Medical Education

ISSN: 1472-6920

medical research study quality

IMAGES

  1. Levels of evidence by the US Agency for Healthcare Research and Quality

    medical research study quality

  2. Evidence-based medicine pyramid. The levels of evidence are

    medical research study quality

  3. (PDF) A Modified Medical Education Research Study Quality Instrument

    medical research study quality

  4. What Are The Different Types Of Research Designs I

    medical research study quality

  5. Curious look into the quality of evidence produced in clinical trials

    medical research study quality

  6. Quality by Design for Clinical Trials

    medical research study quality

VIDEO

  1. 1-3- Types of Clinical Research

  2. Peer Review and Its Impact on Quality

  3. Wikipedia vs. Peer- Reviewed Medical Literature

  4. Parkinson's Disease- Medical Research Study- May 19, 2022

  5. DYSENTERY & DIARRHOEA (पेचीस , दस्त ) NANO TECHNOLOGY OF ELECTRO HOMEOPATHY MEDICAL RESEARCH STUDY

  6. PEDIATRIC STOMATITIS ( मूंह के छाले ) NANO TECHNOLOGY OF ELECTRO HOMEOPATHY MEDICAL RESEARCH STUDY

COMMENTS

  1. Appraising the Quality of Medical Education Research Methods

    The Medical Education Research Study Quality Instrument (MERSQI) and the Newcastle–Ottawa Scale-Education (NOS-E) were developed to appraise methodological quality in medical education research. The study objective was to evaluate the interrater reliability, normative scores, and between-instrument correlation for these two instruments.

  2. Study Quality Assessment Tools | NHLBI, NIH

    In 2013, NHLBI developed a set of tailored quality assessment tools to assist reviewers in focusing on concepts that are key to a study’s internal validity. The tools were specific to certain study designs and tested for potential flaws in study methods or implementation.

  3. Medical education research study quality instrument: an ...

    The MERSQI was designed to assess the quality of medical education research methodology and to offer guidance on medical education research design . The MERSQI specifically focuses on the quality of experimental, quasi-experimental, and observational studies.

  4. A Modified Medical Education Research Study Quality ...

    The Medical Education Research Study Quality Instrument (MERSQI) is widely used to appraise the methodological quality of medical education studies. However, the MERSQI lacks some criteria which could facilitate better quality assessment.

  5. In brief: How is the quality of studies assessed ...

    Important questions to ask when assessing the quality of a study: Is the study design suitable for answering the research question? For example, you can’t use a questionnaire to find out whether a new surgical procedure is better than a tried-and-tested approach. A randomized controlled trial (RCT) is needed instead.

  6. Protect us from poor-quality medical research | Human ...

    In the following sections, we overview challenges in credibility and utility that affect medical research at large and then focus on specific challenges that are more specific for some key types of influential studies: clinical trials and clinical research; big data and large observational studies; and systematic reviews (SRs) and meta-analyses ...

  7. A Modified Medical Education Research Study Quality ...

    The Medical Education Research Study Quality Instrument (MERSQI) is widely used to appraise the methodological quality of medical education studies. However, the MERSQI lacks some criteria which could facilitate better quality assessment.

  8. Appraising the quality of medical education research methods ...

    The MERSQI and NOS-E are useful, reliable, complementary tools for appraising methodological quality of medical education research. Interpretation and use of their scores should focus on item-specific codes rather than overall scores.

  9. Full article: Medical education research study quality ...

    The MERSQI was designed to assess the quality of medical education research methodology and to offer guidance on medical education research design [Citation 2]. The MERSQI specifically focuses on the quality of experimental, quasi-experimental, and observational studies.

  10. Quantifying the Quality of Medical Education Studies: The ...

    The authors present a modification of a commonly used study quality instrument, the Medical Education Research Study Quality Instrument (MERSQI), using a modified Delphi technique to reach consensus among experts in the field of medical education, with the aim of identifying any changes required in the scoring system and relative importance of e...