• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Cohort Study: Definition, Benefits & Examples

By Jim Frost Leave a Comment

What is a Cohort Study?

A cohort study is a longitudinal experimental design that follows a group of participants who share a defining characteristic. For example, a cohort study can select subjects who have exposure to a risk factor , are in the same profession, population or generation, or experience a particular event, such as a medical procedure. This design determines whether exposure to a risk factor affects an outcome. Cohort studies are a type of longitudinal study because they track the same set of subjects over time.

Image of a group of people representing a cohort.

Cohort studies are observational designs, meaning that the researchers do not manipulate experimental or environmental conditions. Instead, they collect data over time and try to understand how various factors affect the outcome. These projects can last for periods ranging from weeks to decades, depending on the research questions.

Learn more about Experimental Design: Definition, Types, and Examples .

Examples of Cohort Studies

Researchers frequently use cohort studies to identify disease risk factors and understand how they affect disease incidence rates.

British Doctors

This cohort study ran from 1951 to 2001 and tracked 60,000 participants with various smoking habits. The researchers found a link between smoking, lung cancer, and death rates.

Nurses’ Health

This cohort study started in 1976 and tracks over 120,000 nurses. It assesses risk factors for many major chronic diseases in women.

Framingham Heart

The study tracks 15,000 participants from three generations and started in 1948. It has identified risk factors for high blood pressure and high cholesterol, among others.

Types of Cohort Studies

Cohort designs can be retrospective or prospective.

Retrospective Cohort Study

In a retrospective cohort study, the scientists identify subjects where the outcomes are known when the project starts. For example, they can find patients who already have the condition of interest and compare them to those who do not. They look for patterns in predicting those who developed the disease.

In retrospective designs, the researchers collect their data using existing records. Consequently, they can complete their study more quickly and inexpensively than prospective designs. However, the various factors and other variables might not have been measured consistently or accurately because they weren’t explicitly designed to be part of a cohort study.

Researchers using a retrospective design have to make do with data that other people recorded in the past for other purposes. Those data were not chosen and measured with the project’s needs in mind. Alternatively, some studies might ask the subjects to recall exposure information or use other subjective evaluations, which introduces a variety of biases.

Learn more about Retrospective Studies .

Prospective Cohort Study

In a prospective cohort study, researchers identify subjects based on the cohort, but the outcomes are unknown when the study begins. Typically, the study recruits people with and without exposure to facilitate comparisons. Then, they track the participants over time, record all the necessary data, and watch for patterns in those who develop the outcome of interest.

Prospective designs are more expensive and time consuming than retrospective studies. However, the researchers can measure all the required data at regular intervals.

Generally, conclusions from a prospective cohort study are superior to those from a retrospective design.

Learn more about Prospective Studies .

Benefits of a Cohort Study

Scientists frequently use cohort studies in epidemiological studies, psychology, social sciences, and nursing. This design is great for identifying both protective and risk factors in natural settings and understanding how they affect incident rates.

In other words, this design helps develop an understanding of the variables that increase and decrease the probability of contracting a disease or other condition. Additionally, researchers can track multiple outcomes (e.g., several diseases) in a single cohort. For example, do smokers have an increased incidence of both lung cancer and emphysema?

Typically, researchers recruit a group where some participants have exposure to a risk factor while others do not. Researchers can include multiple subgroups related to various risk and protective factors in the cohort study. In this manner, analysts can track those factors and link them to occurrences of the outcome they’re studying.

When a risk factor is rare, a cohort study can specifically recruit participants with exposure and follow them. In contrast, other methods are unlikely to obtain a sufficient number of subjects exposed to the risk factor, making it difficult to produce meaningful results.

The longitudinal nature of this design allows a cohort study to understand how exposure and timing relate to the outcome. The scientists don’t need to understand those relationships fully to conduct the research. Instead, they can collect data and evaluate relationships as they appear. Additionally, exposure can change over time, providing insight into its relationship with the outcome.

Weaknesses of a Cohort Study

As mentioned earlier, a prospective cohort study can be expensive and time consuming. In some cases, they involve tens of thousands of participants and last for years or decades. The researchers must track these subjects, perform follow-up evaluations regularly, and record all the data. Over this time, participants will drop out, making the results sensitive to attrition bias.

A cohort study is not a true experiment. It’s a type of observational design and, as such, it opens the door to the problem of confounding variables and spurious correlations . The observed relationships between risk factors and the outcome might be only correlational and not causal. Confounders can bias the results. While these studies are an excellent way to identify potential factors, they require follow-up experiments to verify causal relationships. Learn more about Correlation vs. Causation: Understanding the Differences .

Because cohort studies are observational, they do not use random assignment . Researchers must be wary of confounding factors and take appropriate countermeasures.

For more information, read my posts about Observational Studies Explained  and Confounding Variables .

Cohort Study vs. Case-Control Study

Cohort and case-control studies are observational designs that medical and epidemiological researchers use to evaluate risk factors. While they are similar, there are crucial differences.

A cohort study evaluates the frequency of a disease/condition by exposure. The researchers assess differences in exposure and see how that relates to differences in the incidence rate.

Does exposure affect the incidence rate?

To answer this question, cohort studies often use regression models to estimate the relationships and control for confounders.

Case-Control

In contrast, a case-control design focuses on the comparative exposure for those who have the condition relative to those without it.

Do people with a condition have greater exposure?

To answer this question, case-control studies typically report an odds ratio .

Case-control designs are always retrospective, whereas cohort research can be retrospective or prospective.

For more information, read my post about Case-Control Studies .

Share this:

meaning of cohort study in research

Reader Interactions

Comments and questions cancel reply.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Cohort Studies: Design, Analysis, and Reporting

Affiliations.

  • 1 Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, OH. Electronic address: [email protected].
  • 2 Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, OH.
  • PMID: 32658655
  • DOI: 10.1016/j.chest.2020.03.014

Cohort studies are types of observational studies in which a cohort, or a group of individuals sharing some characteristic, are followed up over time, and outcomes are measured at one or more time points. Cohort studies can be classified as prospective or retrospective studies, and they have several advantages and disadvantages. This article reviews the essential characteristics of cohort studies and includes recommendations on the design, statistical analysis, and reporting of cohort studies in respiratory and critical care medicine. Tools are provided for researchers and reviewers.

Keywords: bias; cohort studies; confounding; prospective; retrospective.

Copyright © 2020 American College of Chest Physicians. Published by Elsevier Inc. All rights reserved.

PubMed Disclaimer

Similar articles

  • A primer on cohort studies in vascular surgery research. Kabeil M, Gillette R, Moore E, Cuff RF, Chuen J, Wohlauer MV. Kabeil M, et al. Semin Vasc Surg. 2022 Dec;35(4):404-412. doi: 10.1053/j.semvascsurg.2022.09.004. Epub 2022 Oct 8. Semin Vasc Surg. 2022. PMID: 36414356 Review.
  • Cross-Sectional Studies: Strengths, Weaknesses, and Recommendations. Wang X, Cheng Z. Wang X, et al. Chest. 2020 Jul;158(1S):S65-S71. doi: 10.1016/j.chest.2020.03.012. Chest. 2020. PMID: 32658654 Review.
  • A Practical Overview of Case-Control Studies in Clinical Practice. Dey T, Mukherjee A, Chakraborty S. Dey T, et al. Chest. 2020 Jul;158(1S):S57-S64. doi: 10.1016/j.chest.2020.03.009. Chest. 2020. PMID: 32658653 Review.
  • Ten statistics commandments that almost never should be broken. Knapp TR, Brown JK. Knapp TR, et al. Res Nurs Health. 2014 Aug;37(4):347-51. doi: 10.1002/nur.21605. Epub 2014 Jun 29. Res Nurs Health. 2014. PMID: 24976481
  • [Cohort studies]. Mathis S, Gartlehner G. Mathis S, et al. Wien Med Wochenschr. 2008;158(5-6):174-9. doi: 10.1007/s10354-008-0516-0. Wien Med Wochenschr. 2008. PMID: 18421560 German.
  • Association between p16/Ki-67 dual stain cytology results prior to and 6 months after LLETZ treatment for CIN and the follow-up regimen three years after treatment: a retrospective cohort study. Packet B, Goyens J, Weynand B, Poppe W, Dewilde K. Packet B, et al. Arch Gynecol Obstet. 2024 Jul;310(1):493-499. doi: 10.1007/s00404-024-07553-8. Epub 2024 May 28. Arch Gynecol Obstet. 2024. PMID: 38806944
  • Systemic inflammation, lifestyle behaviours and dementia: A 10-year follow-up investigation. Hillari L, Frank P, Cadar D. Hillari L, et al. Brain Behav Immun Health. 2024 Apr 22;38:100776. doi: 10.1016/j.bbih.2024.100776. eCollection 2024 Jul. Brain Behav Immun Health. 2024. PMID: 38706574 Free PMC article.
  • 'Joining the Dots: Linking Prenatal Drug Exposure to Childhood and Adolescence' - research protocol of a population cohort study. Lawler K, Dronavalli M, Page A, Lee E, Uebel H, Bajuk B, Burns L, Dickson M, Green C, Dicair L, Eastwood J, Oei JL. Lawler K, et al. BMJ Paediatr Open. 2024 Apr 11;8(1):e002557. doi: 10.1136/bmjpo-2024-002557. BMJ Paediatr Open. 2024. PMID: 38604771 Free PMC article.
  • Effects of the Momentum project on postpartum family planning norms and behaviors among married and unmarried adolescent and young first-time mothers in Kinshasa: A quasi-experimental study. Gage AJ, Wood FE, Gay R, Akilimali P. Gage AJ, et al. PLoS One. 2024 Mar 28;19(3):e0300342. doi: 10.1371/journal.pone.0300342. eCollection 2024. PLoS One. 2024. PMID: 38547207 Free PMC article.
  • Designing and analyzing studies of coronavirus disease 2019 and post-acute sequelae of severe acute respiratory syndrome coronavirus 2 among immunocompromised individuals. DeGruttola V, Aslam S. DeGruttola V, et al. Transpl Infect Dis. 2024 Apr;26(2):e14231. doi: 10.1111/tid.14231. Epub 2024 Feb 20. Transpl Infect Dis. 2024. PMID: 38375954

Publication types

  • Search in MeSH

LinkOut - more resources

Full text sources.

  • Elsevier Science
  • Ovid Technologies, Inc.
  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

  • Type 2 Diabetes
  • Heart Disease
  • Digestive Health
  • Multiple Sclerosis
  • Diet & Nutrition
  • Supplements
  • Health Insurance
  • Public Health
  • Patient Rights
  • Caregivers & Loved Ones
  • End of Life Concerns
  • Health News
  • Thyroid Test Analyzer
  • Doctor Discussion Guides
  • Hemoglobin A1c Test Analyzer
  • Lipid Test Analyzer
  • Complete Blood Count (CBC) Analyzer
  • What to Buy
  • Editorial Process
  • Meet Our Medical Expert Board

What Is a Cohort Study?

A cohort study often looks at 2 (or more) groups of people that have a different attribute (for example, some smoke and some don't) to try to understand how the specific attribute affects an outcome. The goal is to understand the relationship between one group's shared attribute (in this case, smoking) and its eventual outcome.

 pixelfit/Getty Images

Cohort Study Design

There are two categories of evidence-based human medical research:

Experimental research: This involves a controlled process through which each participant in a clinical trial is exposed to some type of intervention or situation—like a drug, vaccine, or environmental exposure. Sometimes there is also a control group that is not exposed for comparison. The results come from tracking the effects of the exposure or intervention over a set period of time.

Observational research: This is when there is no intervention. The researchers simply observe the participants' exposure and outcomes over a set period of time in an attempt to identify potential factors that could affect a variety of health conditions.

Cohort studies are longitudinal, meaning that they take place over a set period of time—frequently, years—with periodic check-ins with the participants to record information like their health status and health behaviors.

They can be either:

  • Prospective: Start in the present and continue into the future
  • Retrospective: Start in the present, but look to the past for information on medical outcomes and events

Purpose of Cohort Studies

The purpose of cohort studies is to help advance medical knowledge and practice, such as by getting a better understanding of the risk factors that increase a person's chances of getting a particular disease.

Participants in cohort studies are grouped together based on having a shared characteristic—like being from the same geographic location, having the same occupation, or having a diagnosis of the same medical condition.

Each time the researchers check-in with participants in cohort trials, they're able to measure their health behaviors and outcomes over a set period of time. For example, a study could involve two cohorts: one that smokes and the other that doesn't. As the data is collected over time, the researchers would have a better idea of whether there appears to be a link between a behavior—in this case, smoking—and a particular outcome (like lung cancer, for example).  

Strengths of Cohort Studies

Much of the medical profession's current knowledge of disease risk factors comes from cohort studies. In addition to showing disease progression, cohort studies also help researchers calculate the incidence rate, cumulative incidence, relative risk, and hazard ratio of health conditions.  

  • Size : Large cohort studies with many participants usually give researchers more confident conclusions than small studies.
  • Timeline : Because they track the progression of diseases over time, cohort studies can also be helpful in establishing a timeline of a health condition and determining whether specific behaviors are potential contributing factors to disease.  
  • Multiple measures : Often, cohort studies allow researchers to observe and track multiple outcomes from the same exposure. For example, if a cohort study is following a group of people undergoing chemotherapy, researchers can study the incidence of nausea and skin rashes in the patients. In this case, there is one exposure (chemotherapy) and multiple outcomes (nausea and skin rashes).  
  • Accuracy : Another strength of cohort studies—specifically, prospective cohort studies—is that researchers might be able to measure the exposure variable, other variables, and the participants' health outcomes with relative accuracy.
  • Consistency : Outcomes measured in a study can be done uniformly.

Retrospective cohort studies have their own benefits, namely that they can be conducted relatively quickly, easily, and cheaply than other types of research.

Weaknesses of Cohort Studies

While cohort studies are an essential part of medical research, they are not without their limitations.

These can include:

  • Time: Researchers aren't simply bringing participants into the lab for one day to answer a few questions. Cohort studies can last for years—even decades—which means that the costs of running the study can really add up.
  • Self-reporting: Even though retrospective cohort studies are less costly, they come with their own significant weakness in that they might rely on participants' self-reporting of past conditions, outcomes, and behaviors. Because of this, it can be more difficult to get accurate results.  
  • Drop-out: Given the lengthy time commitment required to be a part of a cohort study, it's not unusual for participants to drop out of this type of research. Though they have every right to do that, having too many people leave the study could potentially increase the risk of bias.
  • Behavior alteration: Another weakness of cohort studies is that participants may alter their behavior in ways they wouldn't otherwise if they were not part of a study, which could alter the results of the research.
  • Potential for biases: Even the most well-designed cohort studies won't achieve results as robust as those reached via randomized controlled trials. This is because by design—i.e. people put into groups based on certain shared traits—there is an inherent lack of randomization.  

A Word From Verywell

Medicines, devices, and other treatments come to the market after many years of research. There's a long journey between the first tests of early formulations of a drug in a lab, and seeing commercials for it on TV with a list of side effects read impossibly quickly.

Think about the last time you had a physical. Your healthcare provider likely measured several of your vital signs and gave you a blood test, then reported back to you about the various behaviors you may need to change in order to reduce your risk of developing certain diseases. Those risk factors aren't just guesses; many of them are the result of cohort studies.

Song JW, Chung KC. Observational studies: cohort and case-control studies .  Plast Reconstr Surg . 2010;126(6):2234-2242. doi:10.1097/PRS.0b013e3181f44abc.

Barrett D, Noble H. What are cohort studies? Evidence-Based Nursing . 2019;22(4):95-96. doi:10.1136/ebnurs-2019-103183

Wang X, Kattan MW. Cohort studies: design, analysis, and reporting .  CHEST . 2020;158(1):S72-S78. doi: 10.1016/j.chest.2020.03.014.

Setia MS. Methodology series module 1: cohort studies.   Indian J Dermatol . 2016;61(1):21-25. doi:10.4103/0019-5154.174011.

By Elizabeth Yuko, PhD Yuko has a doctorate in bioethics and medical ethics and is a freelance journalist based in New York.

Study Design 101: Cohort Study

  • Case Report
  • Case Control Study
  • Cohort Study
  • Randomized Controlled Trial
  • Practice Guideline
  • Systematic Review
  • Meta-Analysis
  • Helpful Formulas
  • Finding Specific Study Types

A study design where one or more samples (called cohorts) are followed prospectively and subsequent status evaluations with respect to a disease or outcome are conducted to determine which initial participants exposure characteristics (risk factors) are associated with it. As the study is conducted, outcome from participants in each cohort is measured and relationships with specific characteristics determined

  • Subjects in cohorts can be matched, which limits the influence of confounding variables
  • Standardization of criteria/outcome is possible
  • Easier and cheaper than a randomized controlled trial (RCT)

Disadvantages

  • Cohorts can be difficult to identify due to confounding variables
  • No randomization, which means that imbalances in patient characteristics could exist
  • Blinding/masking is difficult
  • Outcome of interest could take time to occur

Design pitfalls to look out for

The cohorts need to be chosen from separate, but similar, populations.

How many differences are there between the control cohort and the experiment cohort? Will those differences cloud the study outcomes?

Fictitious Example

A cohort study was designed to assess the impact of sun exposure on skin damage in beach volleyball players. During a weekend tournament, players from one team wore waterproof, SPF 35 sunscreen, while players from the other team did not wear any sunscreen. At the end of the volleyball tournament players' skin from both teams was analyzed for texture, sun damage, and burns. Comparisons of skin damage were then made based on the use of sunscreen. The analysis showed a significant difference between the cohorts in terms of the skin damage.

Real-life Examples

Hoepner, L., Whyatt, R., Widen, E., Hassoun, A., Oberfield, S., Mueller, N., ... Rundle, A. (2016). Bisphenol A and Adiposity in an Inner-City Birth Cohort. Environmental Health Perspectives, 124 (10), 1644-1650. https://doi.org/10.1289/EHP205

This longitudinal cohort study looked at whether exposure to bisphenol A (BPA) early in life affects obesity levels in children later in life. Positive associations were found between prenatal BPA concentrations in urine and increased fat mass index, percent body fat, and waist circumference at age seven.

Lao, X., Liu, X., Deng, H., Chan, T., Ho, K., Wang, F., ... Yeoh, E. (2018). Sleep Quality, Sleep Duration, and the Risk of Coronary Heart Disease: A Prospective Cohort Study With 60,586 Adults. Journal Of Clinical Sleep Medicine, 14 (1), 109-117. https://doi.org/10.5664/jcsm.6894

This prospective cohort study explored "the joint effects of sleep quality and sleep duration on the development of coronary heart disease." The study included 60,586 participants and an association was shown between increased risk of coronary heart disease and individuals who experienced short sleep duration and poor sleep quality. Long sleep duration did not demonstrate a significant association.

Related Formulas

  • Relative Risk

Related Terms

A group that shares the same characteristics among its members (population).

Confounding Variables

Variables that cause/prevent an outcome from occurring outside of or along with the variable being studied. These variables render it difficult or impossible to distinguish the relationship between the variable and outcome being studied).

Population Bias/Volunteer Bias

A sample may be skewed by those who are selected or self-selected into a study. If only certain portions of a population are considered in the selection process, the results of a study may have poor validity.

Prospective Study

A study that moves forward in time, or that the outcomes are being observed as they occur, as opposed to a retrospective study, which looks back on outcomes that have already taken place.

Now test yourself!

1. In a cohort study, an exposure is assessed and then participants are followed prospectively to observe whether they develop the outcome.

a) True b) False

2. Cohort Studies generally look at which of the following?

a) Determining the sensitivity and specificity of diagnostic methods b) Identifying patient characteristics or risk factors associated with a disease or outcome c) Variations among the clinical manifestations of patients with a disease d) The impact of blinding or masking a study population

Evidence Pyramid - Navigation

  • Meta- Analysis
  • Case Reports
  • << Previous: Case Control Study
  • Next: Randomized Controlled Trial >>

Creative Commons License

  • Last Updated: Sep 25, 2023 10:59 AM
  • URL: https://guides.himmelfarb.gwu.edu/studydesign101

GW logo

  • Himmelfarb Intranet
  • Privacy Notice
  • Terms of Use
  • GW is committed to digital accessibility. If you experience a barrier that affects your ability to access content on this page, let us know via the Accessibility Feedback Form .
  • Himmelfarb Health Sciences Library
  • 2300 Eye St., NW, Washington, DC 20037
  • Phone: (202) 994-2850
  • [email protected]
  • https://himmelfarb.gwu.edu

Quantitative study designs: Cohort Studies

Quantitative study designs.

  • Introduction
  • Cohort Studies
  • Randomised Controlled Trial
  • Case Control
  • Cross-Sectional Studies
  • Study Designs Home

Cohort Study

Did you know that the majority of people will develop a diagnosable mental illness whilst only a minority will experience enduring mental health?  Or that groups of people at risk of having high blood pressure and other related health issues by the age of 38 can be identified in childhood?  Or that a poor credit rating can be indicative of a person’s health status?

These findings (and more) have come out of a large cohort study started in 1972 by researchers at the University of Otago in New Zealand.  This study is known as The Dunedin Study and it has followed the lives of 1037 babies born between 1 April 1972 and 31 March 1973 since their birth. The study is now in its fifth decade and has produced over 1200 publications and reports, many of which have helped inform policy makers in New Zealand and overseas.

In Introduction to Study Designs, we learnt that there are many different study design types and that these are divided into two categories:  Experimental and Observational. Cohort Studies are a type of observational study. 

What is a Cohort Study design?

  • Cohort studies are longitudinal, observational studies, which investigate predictive risk factors and health outcomes. 
  • They differ from clinical trials, in that no intervention, treatment, or exposure is administered to the participants. The factors of interest to researchers already exist in the study group under investigation.
  • Study participants are observed over a period of time. The incidence of disease in the exposed group is compared with the incidence of disease in the unexposed group.
  • Because of the observational nature of cohort studies they can only find correlation between a risk factor and disease rather than the cause. 

Cohort studies are useful if:

  • There is a persuasive hypothesis linking an exposure to an outcome.
  • The time between exposure and outcome is not too long (adding to the study costs and increasing the risk of participant attrition).
  • The outcome is not too rare.

The stages of a Cohort Study

  • A cohort study starts with the selection of a group of participants (known as a ‘cohort’) sourced from the same population, who must be free of the outcome under investigation but have the potential to develop that outcome.
  • The participants must be identical, having common characteristics except for their exposure status.
  • The participants are divided into two groups – the first group is the ‘exposure’ group, the second group is free of the exposure. 

Types of Cohort Studies

There are two types of cohort studies:  Prospective and Retrospective .

How Cohort Studies are carried out

meaning of cohort study in research

Adapted from: Cohort Studies: A brief overview by Terry Shaneyfelt [video] https://www.youtube.com/watch?v=FRasHsoORj0)

Which clinical questions does this study design best answer?

What risk factors predict disease? This looks at dietary and lifestyle risk factors and investigates how they might contribute to hypertension in women.
What factors cause these outcomes? This looks at factors in early life that may predict the occurrence of adolescent suicide.
What happens with this disease over time? This examines the instances of recovery from a first-time episode of psychosis.
If the test is positive, what happens to the patient? This examines recently released adults from prison who have been diagnosed with both a mental illness and substance use disorder and investigates what happens to them following their diagnosis.

What are the advantages and disadvantages to consider when using a Cohort Study?

What does a strong Cohort Study look like?

  • The aim of the study is clearly stated.
  • It is clear how the sample population was sourced, including inclusion and exclusion criteria, with justification provided for the sample size.  The sample group accurately reflects the population from which it is drawn.
  • Loss of participants to follow up are stated and explanations provided.
  • The control group is clearly described, including the selection methodology, whether they were from the same sample population, whether randomised or matched to minimise bias and confounding.
  • It is clearly stated whether the study was blinded or not, i.e. whether the investigators were aware of how the subject and control groups were allocated.
  • The methodology was rigorously adhered to.
  • Involves the use of valid measurements (recognised by peers) as well as appropriate statistical tests.
  • The conclusions are logically drawn from the results – the study demonstrates what it says it has demonstrated.
  • Includes a clear description of the data, including accessibility and availability.

What are the pitfalls to look for?

  • Confounding factors within the sample groups may be difficult to identify and control for, thus influencing the results.
  • Participants may move between exposure/non-exposure categories or not properly comply with methodology requirements.
  • Being in the study may influence participants’ behaviour.
  • Too many participants may drop out, thus rendering the results invalid.

Critical appraisal tools

To assist with the critical appraisal of a cohort study here are some useful tools that can be applied.

Critical appraisal checklist for cohort studies (JBI)

CASP appraisal checklist for cohort studies

Real World Examples

Bell, A.F., Rubin, L.H., Davis, J.M., Golding, J., Adejumo, O.A. & Carter, C.S. (2018). The birth experience and subsequent maternal caregiving attitudes and behavior: A birth cohort study . Archives of Women’s Mental Health .

Dykxhoorn, J., Hatcher, S., Roy-Gagnon, M.H., & Colman, I. (2017). Early life predictors of adolescent suicidal thoughts and adverse outcomes in two population-based cohort studies . PLoS ONE , 12(8).

Feeley, N., Hayton, B., Gold, I. & Zelkowitz, P. (2017). A comparative prospective cohort study of women following childbirth: Mothers of low birthweight infants at risk for elevated PTSD symptoms . Journal of Psychosomatic Research , 101, 24–30.

Forman, J.P., Stampfer, M.J. & Curhan, G.C. (2009). Diet and lifestyle risk factors associated with incident hypertension in women . JAMA: Journal of the American Medical Association , 302(4), 401–411.

Suarez, E. (2002). Prognosis and outcome of first-episode psychoses in Hawai’i: Results of the 15-year follow-up of the Honolulu cohort of the WHO international study of schizophrenia . ProQuest Information & Learning, Dissertation Abstracts International: Section B: The Sciences and Engineering , 63(3-B), 1577.

Young, J.T., Heffernan, E., Borschmann, R., Ogloff, J.R.P., Spittal, M.J., Kouyoumdjian, F.G., Preen, D.B., Butler, A., Brophy, L., Crilly, J. & Kinner, S.A. (2018). Dual diagnosis of mental illness and substance use disorder and injury in adults recently released from prison: a prospective cohort study . The Lancet. Public Health , 3(5), e237–e248.

References and Further Reading

Greenhalgh, T. (2014). How to Read a Paper : The Basics of Evidence-Based Medicine , John Wiley & Sons, Incorporated, Somerset, United Kingdom.

Hoffmann, T. a., Bennett, S. P., & Mar, C. D. (2017). Evidence-Based Practice Across the Health Professions (Third edition. ed.): Elsevier.

Song, J.W. & Chung, K.C. (2010). Observational studies: cohort and case-control studies . Plastic and Reconstructive Surgery , 126(6), 2234-42.

Mann, C.J. (2003). Observational research methods. Research design II: cohort, cross sectional, and case-control studies . Emergency Medicine Journal , 20(1), 54-60.

  • << Previous: Introduction
  • Next: Randomised Controlled Trial >>
  • Last Updated: Jun 13, 2024 10:34 AM
  • URL: https://deakin.libguides.com/quantitative-study-designs

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals

You are here

  • Volume 22, Issue 4

What are cohort studies?

  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • http://orcid.org/0000-0003-4308-4219 David Barrett 1 ,
  • Helen Noble 2
  • 1 Faculty of Health Sciences , University of Hull , Hull , UK
  • 2 School of Nursing and Midwifery , Queen’s University Belfast , Belfast , UK
  • Correspondence to Dr David Barrett, Faculty of Health Sciences, University of Hull, Hull HU6 7RX, UK; D.I.Barrett{at}hull.ac.uk

https://doi.org/10.1136/ebnurs-2019-103183

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

  • statistics and research methods

In 1951, Richard Doll and Austin Bradford-Hill commenced a ground-breaking research project by writing to all registered doctors in the UK to ask about their smoking habits. The British Doctors Study recruited and followed-up over 40 000 participants, monitoring mortality rates and causes of death over the subsequent years and decades. Even by the time of the first set of preliminary results in 1954, there was evidence to link smoking with lung cancer and increased mortality. 1 Over the following decades, the study provided further definitive evidence of the health risks from smoking, and was extended to explore other causes of death (eg, heart disease) and other behavioural variables (eg, alcohol intake).

The Doctors Health Survey is one of the largest, most ambitious and best-known cohort studies and demonstrates the value of this approach in supporting our understanding of disease risk. However, as a method, cohort studies can have much wider applications. This article provides an overview of cohort studies, identifying the opportunities and challenges they present to researchers, and the role they play in developing the evidence base for nursing and healthcare more broadly.

Cohort studies are a type of longitudinal study —an approach that follows research participants over a period of time (often many years). Specifically, cohort studies recruit and follow participants who share a common characteristic, such as a particular occupation or demographic similarity. During the period of follow-up, some of the cohort will be exposed to a specific risk factor or characteristic; by measuring outcomes over a period of time, it is then possible to explore the impact of this variable (eg, identifying the link between smoking and lung cancer in the British Doctors Study.) Cohort studies are, therefore, of particular value in epidemiology, helping to build an understanding of what factors increase or decrease the likelihood of developing disease.

Though the most high-profile types of cohort studies are usually related to large epidemiological research studies, they are not the only application of this method. Within nursing research, cohort studies have focused on the progress of nurses through their education and careers. Li et al —as part of the European NEXT study group—recruited almost 6500 female nurses who, at the time of recruitment, had no intention to leave the profession. The study followed the cohort up for a year, identifying that 8% developed the intention to leave nursing, often due to issues such as poor salary or limited promotion prospects. 4

Usually, cohort studies should adopt a purely observational approach. However, some research is labelled as a cohort study while exploring the effectiveness of specific interventions. For example, Lansperger et al explored nurse practitioner (NP)-led critical care in a large university hospital in the USA. They collected data on all patients who were admitted to the intensive care unit over a 3-year period. Patients from this cohort were cared for by teams led by either doctors or NPs, and outcomes (primarily 90-day mortality) were monitored. By comparing the groups, the researchers established that outcomes were similar regardless of whether patient care was led by a doctor or an NP. 5

Strengths and weaknesses of cohort studies

Cohort studies are an effective and robust method of establishing cause and effect. As they are usually large in size, researchers are able to draw confident conclusions regarding the link between risk factors and disease. In many cases, because participants are often free of disease at the commencement of the study, cohort studies are particularly useful at identifying the timelines over which certain behaviours can contribute to disease.

However, the nature of cohort studies can cause challenges. Collecting prospective data on thousands of participants over many years (and sometimes decades) is complex, time-consuming and expensive. Participants may drop out, increasing the risk of bias; equally, it is possible that the behaviour of participants may alter because they are aware that they are part of a study cohort. The analysis of data from these large-scale studies is also complex, with large numbers of confounding variables making it difficult to link cause and effect. Where cohort (or ‘cohort-like’) studies link to a specific intervention (as in the case of the Lansperger et al study into nursing practitioner-led critical care 5 ), the lack of randomisation to different arms of the study makes the approach less robust than randomised controlled trials.

One way of making a cohort study less time-consuming is to carry it out retrospectively. This is a more pragmatic approach, as it can be completed more quickly using historical data. For example, Wray et al used a retrospective cohort study to identify factors that were associated with non-continuation of students on nursing programmes. By exploring characteristics in five previous cohorts of students, they were able to identify that factors such as being older and/or local were linked to higher levels of continuation. 6

However, this retrospective approach increases the risk of bias in the sampling of the cohort, with greater likelihood of missing data. Retrospective cohort studies are also weakened by the fact that the data fields available are not designed with the study in mind—instead, the researcher simply has to make use of whatever data are available, which may hinder the quality of the study.

Reporting and critiquing of cohort studies

When reporting a cohort study, it is recommended that STROBE guidance 7 is followed. STROBE is an international, collaborative enterprise which includes experts with experience in the organisation and of dissemination of observational studies, including cohort studies. The aim is to STrengthen the Reporting of OBservational studies in Epidemiology. The STROBE checklist for cohort studies - available at https://www.strobe-statement.org/fileadmin/Strobe/uploads/checklists/STROBE_checklist_v4_combined.pdf - includes detail related to the introduction/methods/results/discussion of the study.

Critical appraisal of any cohort study is essential to identify the strengths and weaknesses of the study and to determine the usefulness and validity of the study findings. Components of critical appraisal in relation to cohort studies include evaluation of the study design in relation to the research question, assessment of the methodology, suitability of statistical methods used, conflicts of interest and how relevant the research is to practice. 8–10

Cohort studies are the cornerstone of epidemiological research, providing an understanding of risk factors for disease based on findings in thousands of participants over many years. Disease prevention guidelines used by nurses and other healthcare professionals across the globe are based on the evidence from high-profile studies, such as the British Doctors Study, the Framingham Heart Study and the Nurses’ Health Study. However, cohort studies offer opportunities outside epidemiology: in nursing research, the approach is useful in exploring areas such as factors that influence students’ progression through their programme or nurses’ progression through their career.

This approach to research does bring with it some important challenges—often related to their size, complexity and longevity. However, with careful planning and implementation, cohort studies can make valuable contributions to the development of evidence-based healthcare.

  • Colditz GA ,
  • Philpott SE ,
  • Hankinson SE
  • Galatsch M ,
  • Siegrist J , et al
  • Landsperger JS ,
  • Semler MW ,
  • Wang L , et al
  • Aspland J ,
  • Barrett D , et al
  • von Elm E ,
  • Altman DG ,
  • Egger M , et al
  • Rochon PA ,
  • Gurwitz JH ,
  • Sykora K , et al
  • Critical Appraisal Skills Programme

Competing interests None declared.

Patient consent for publication Not required.

Provenance and peer review Commissioned; internally peer reviewed.

Read the full text or download the PDF:

Cohort Study (Retrospective, Prospective): Definition, Examples

Statistics Definitions > Cohort Study

A Cohort study, used in the medical fields and social sciences, is an observational study used to estimate how often disease or life events happen in a certain population . “Life events” might include: incidence rate, relative risk or absolute risk .

Cohort Study Classification: Prospective, Retrospective, Case

Cohort studies can be grouped in several ways:

  • Prospective : none of the subjects have the disease (or other outcome) being measured when the study commences; data analysis happens after a period of time has elapsed.
  • Retrospective (Historical) : the researcher looks at historical data for a group. Some of the people in this group have developed the disease, and some have not. This can result in finding out who has the disease and when they developed it.
  • Case-control nested within a cohort : a smaller group is chosen from within the cohort for a deeper look. These investigations may include genotyping , collecting tissue samples or other factors.
  • Case-cohort : similar to case-control nested within a cohort. The difference is that in a case-cohort study, participants are evaluated for outcome risk factors at any time before the first outcome (i.e. the first incidence of disease).

What is a Prospective Cohort Study ?

A prospective cohort study takes a group of similar people (a cohort) and studies them over time. At the time the baseline data is collected, none of the people in the study have the condition of interest. This is in contrast to a retrospective cohort study, which takes a group of people who already have the condition and then attempts to piece together the reasons why. The now famous Framingham Heart Study is one example of a prospective cohort study; the researchers have, to date, studied three generations of Framingham residents in order to understand the causes of heart disease and stroke.

Although none of the participants actually have the disease of interest in a prospective cohort study, some of the cohort are expected to develop the disease in the future. For example, a cohort of thirty-year-old people in a certain town might be studied to see who develops lung cancer. Half of the cohort might be smokers and half may not. This enables comparisons between the two groups.

Once the prospective cohort study has been established, researchers follow up with the participants and track their progress. Follow ups can be:

  • In-person interviews.
  • Imaging tests.
  • Internet questionnaires.
  • Mail questionnaires.
  • Phone interviews.
  • Physical exams.

A combination of the above methods may also be used.

Advantages and Disadvantages

  • One major advantage of a prospective cohort study is that researchers don’t have to tackle with the ethical issues surrounding randomized control trials (i.e. who receives a placebo and who gets the actual treatment).
  • Incidence and prevalence of a disease can be easily calculated.
  • Multiple diseases and outcomes can be studied at the same time.

Disadvantages

  • Selection bias and confounding variables can be a problem.
  • Cohort studies can be expensive and time consuming.
  • Sample sizes required are usually very large.

More: Prospective Study: Definition and Examples.

What is a Retrospective Cohort Study?

A retrospective cohort study (also known as a historic study or longitudinal study) is a study where the participants already have a known disease or outcome. The study looks back into the past to try to determine why the participants have the disease or outcome and when they may have been exposed. In a retrospective cohort study the researcher:

  • Uses historical data to identify members of a population who have been exposed (or not exposed) to a disease or outcome.
  • Assembles a group to be studied.
  • Determines the current status of the disease or outcome in the participants.

retrospective cohort study

Prospective vs. Retrospective Cohort Study

In a retrospective cohort study, the group of interest already has the disease/outcome. In a prospective cohort study, the group does not have the disease/outcome, although some participants usually have high risk factors.

Retrospective example : a group of 100 people with AIDS might be asked about their lifestyle choices and medical history in order to study the origins of the disease. A Second group of 100 people without AIDS are also studied and the two groups are compared. Prospective example : a group of 100 people with high risk factors for AIDS are followed for 20 years to see if they develop the disease. A control group of 100 people who have low risk factors are also followed for comparison.

A retrospective cohort study can be combined with a prospective cohort study: the researcher takes the retrospective study groups, and then follows the cohort in the future.

What is the Cohort Effect?

cohort effect

Cohort Effect Example

Lets say you were conducting cross sectional research (a method that compares different age groups at the same point in time) to find out how basic mathematics ability improves with age. You give the same basic math standardized test to groups of students who are 7-years-old, 14-years-old, and 21-years-old. You get the following mean results:

  • 7-years-old: 24% correct
  • 14-years-old: 48% correct
  • 21-years-old: 72% correct

You might conclude that every 7 years that pass makes a difference of 24% in scores. However, what you haven’t accounted for is the cohort effect . The students differ not only in age, but they belong to different cohorts (in this case, groups of people born around the same time), some of which may have grown up when basic mathematics was strongly emphasized in schools. If the 21-year-old cohort in the above study experienced strong emphasis on basic math, it’s a possibility that they could have achieved 72% when they were 14-years old or even 7-years-old.

The problems associated with the cohort effect can be lessened by testing the same cohort over a period of time, a method called longitudinal research . In the above example, you would test a group of 7-year-olds, then test the same group every 7 years. A disadvantage to longitudinal research is that it’s costly, and dropout rates can affect the results.

Image: SUNY Downstate. Retrieved April 1, 2018 from: http://library.downstate.edu/EBM2/2400.htm

Healthtalk

What are cohort studies and why are they important?

We spoke to people who had taken part in or been invited to join cohort studies. The term ‘cohort study’ was often unfamiliar and could be confusing.

What are cohort studies?

A cohort study identifies a group of people and follows them over a period of time. The aim is to look at how a group of people are exposed to different risk factors which may affect their lives. Cohort studies can look at many different aspects of people’s lives, including their health and/or social factors. Cohort studies can be prospective (meaning that data are collected as individual lives unfold), or retrospective (meaning that researchers look at historical data about individuals after a certain outcome, such as diagnosis with a disease, has occurred).

Cohort studies can collect data on people that covers a long period of time. The longest running is a British birth cohort study which started in 1946 when study members were born and has now been following-up participants for over 70 years.

Jenny, a senior researcher, describes what a cohort study is.

Gender Female

View profile

A cohort study is a type of medical research study. it’s different to a randomised trial which is, in which people may be given a drug or type of intervention and it’s more of a long-term study where you’re just following a group of participants for a long time. it’s particularly useful to find out risk factors for certain, developing certain diseases and often you want to follow the people until an outcome occurs and this may be developing a disease, it may be dying, so you may be following people up for life or it may be something else.

A cohort study will follow people up and be looking more at risk factors for, for developing a disease. So, for example, you may want to work out if smoking causes cancer and this can’t really be done in any kind of other study design, you can’t ask people to start smoking so you can see what effect this has.

So you’ll follow up a group of people and compare people who smoke with those who don’t smoke and look at their outcomes long-term and those outcomes, because they’re likely to be health outcomes like developing cancer or respiratory illness or dying means that you’ll have to follow those people up for a long time.

Sign up to access embed codes.

There are various designs of cohort studies. Some are referred to as ‘birth cohorts’, because the people within the group being researched were born in the same time period (a week, a year, a few years) and/or location (city, region). Some cohort studies are interested in a particular health condition or event and follow people who have experience of these (such as having motor neurone disease/MND, or having had a transient ischaemic attack/TIA/’mini stroke’ or stroke). Other studies involve healthy volunteers. Across the people we spoke to, there were experiences of all of these different types of cohort study and other types of medical research.

A number of biobanking studies recruit large groups of people across the population, collect samples from them over time, and then look at the data to try and work out why some people go on to develop particular illnesses when others don’t. Some biobanking studies recruit healthy people from across the population. Others look at particular conditions, such as motor neurone disease (MND), and recruit people already diagnosed with the condition and sometimes healthy volunteers too. You can find more about biobanking studies here .

The length of time that studies can last varies. For some people we spoke to, the study they were involved in was ongoing. For others, the study had closed and/or their participation in it had stopped. Most people we spoke to were comfortable with the length of time their study was running for, although some people who had been in birth cohort studies for many decades were surprised that the research was still going.

Jenny, a senior researcher, talks about the difference between a prospective and retrospective cohort study.

Yes there are-, cohort studies can be prospective or retrospective generally and a prospective study means that you recruit a group of people who consent to be in the study and you follow that group up over time you collect data throughout the follow up generally, I mean sometimes you’ll just collect data at baseline and then follow them up, found out what their outcomes are longer term but you may want to do annual follow ups where you give people questionnaires to find out about their lifestyles, to find out about their smoking habits whether they’ve developed other illnesses, take some measurements, potentially find out their weight and height so you can get their BMI, find out if they smoke, find out if they’ve got another disease which may affect your outcome. A retrospective cohorts study is the other possibility and that’s using generally a dataset that already exists so you already have all the data that you need and you don’t need that long term follow up because it may be hospital records, it may be GP records and you can look at similarly a group of participants, for example, a group of people with and without diabetes and do some of them go on to develop more complicated health problems, but you’ve got all that in the records already. So, you’re doing it retrospectively, so you don’t have to follow them up for 20 years to find out what happens.

Alan Z is part of a five year renal (kidney) study. He doesn’t feel this is too long a commitment, and that it is a good length of time for the researchers to gather data.

Age at interview 86

Gender Male

Yeah well, I had to do it over a certain number of years because to get a comparison of how you’re functioning over a period of time. I suppose two years would not be enough; three years would hardly be very much about it. I suppose five years, you’ve got a broader landscape to see how the actual performance of the kidneys is actually bearing up.

We spoke to people who had taken part in a number of different cohort studies. These studies varied in terms of how long they had been running, what taking part involved , what the research topics and study aims were, the characteristics of the participants, and so on. In this Healthtalk resource, we haven’t directly named the specific cohort studies that people took part in but some details about study aims and activities mean that the studies are identifiable.

For some people, there were certain features of the cohort study that were meaningful to them. This included studies looking at people born in a particular time frame (the same week, or same few years) or location (a city or region). Linda, who took part in a birth cohort study, reflected on her upbringing and how times had changed. Barbara said she “now feels quite privileged to be part of this study” because it was unique and pioneering in the length of time it had been running. Margaret also said she felt “very proud” to be part of her study. Nadera and Mr S took part in a study focused on their home city and they talked about there being high rates of health problems, like diabetes, and obesity.

A few people, like Emily and Keith, had taken part in more than one cohort study. Brian, for example, was part of a birth cohort study and a biobanking study. Sometimes they could remember distinct differences between them; other times, their experience of studies blurred together. A few people found it difficult to separate out their experiences of health care in general and their experiences of health research, sometimes they felt that the two overlapped or were the same thing.

Why are cohort studies important?

In a general sense, most people we talked to felt that the studies they had taken part in were important. As Iram said, these studies are ultimately being done “to help people”. Often people said they hoped the studies would lead to improved knowledge about health. Margaret highlighted some important findings that had come out of these types of studies, such as the links between smoking and lung cancer. Ian was in a biomarker study (looking at particular characteristics of the samples) which “seems to be on the verge of some exciting results, and we’ll wait and see what happens.”

Jenny, a senior researcher, describes why cohort studies are important.

So the reason that cohort studies are important is that they can answer health questions which you couldn’t answer in another way. So, for example, the study I work on, chronic kidney disease, you can’t do a randomised trial of whether chronic kidney disease gives you results in dialysis or results in cardiovascular health problems.

So you have to just follow up a population and look within that population at what other risk factors they may have, which then may make them more likely to go onto develop long-term health problems or cardiovascular disease. So, and similarly with smoking, you can’t ask some people to smoke and not to smoke, so you take a population, find out who smokes, who doesn’t smoke and follow them both simultaneously, and that could apply to a lot of things, that could apply to weight or other lifestyle dietary issues, dietary behaviours. To, yeah, long-term find out how this affects their health outcomes.

Some people emphasised that cohort studies had a unique and special role in improving understandings about health. Keith talked about cohort studies gathering “information on a grand scale and over a long time.” Teresa felt it was valuable to have data “that spans a lifetime.” Barbara felt that, as the study she participated in had “gone on and on, it just seems more and more important.”

People talked about cohort studies allowing researchers to make ‘links’ or ‘connections’ and to see ‘patterns’ and ‘trends’ associated with certain outcomes. Lucy saw cohort studies as important for being able to “isolate socioeconomic or other factors that might be influential.” Douglas recognised that there could be unexpected factors and that researchers “might pick up something that is important which might not be important to me but might be important to them” and their research aims.

Alan Z sees cohort studies as having an important role in understanding health in relation to lifestyle and diet.

Well because throughout my life I’ve obviously heard or witnessed many sort of medical advances which even my own family have benefited from and obviously it’s all for the good of, say, everyone, you know, not just in the UK but throughout the world, these wonderful breakthroughs in medical technology are achieved and a lot of it’s done by, obviously lots and lots of painful long studies of humans and guinea pigs shall we say, and obviously they can see a pattern. I mean, I can’t think of an example but I have read where due to all, this intensive research, I suppose, like, in ongoing heart-related diseases and cancer diseases, not only is it just working in laboratories but a lot of it is done by this type of study where they get the demographics of a people or a race of people even and find out what makes them tick, something comes up they’re taking a certain type of lifestyle. Like, they hear quite often Mediterranean lifestyle where people are on a very high diet of olive oil, you know, and fruit based, vegetable based and drink probably a lot of wine, [laughs] but that’s part of it, they don’t have a highly intensive sort of fat based diet, like we in the UK do, and they seem to avoid a lot of heart-related issues.

But yeah, so, obviously, looking at all those issues they can determine certain traits and then they can narrow down onto it and, I suppose, take the study forward through doing various other measures, you know, research in laboratories etc.

Margaret feels birth cohort studies like hers are needed to improve our understanding about medical and social issues.

Age at interview 73

And what’s your general understanding of why this study is needed?

I think for various reasons actually. it’s to advance medical research. it’s to advance social research. it’s important because it’s got such a widespread, and because it’s gone on for so long, they can see changes and they will. After all it’s coming in at the beginning of the National Health Service almost. And if you look at, because they’ve now written a book about it, there’s one or two things that we didn’t know before, like, for instance, they’ve got a list of things that mothers had before they had their babies and you think of the, all the luxury items now and what it was there and then the questions that the mothers were asked. Apparently, they were asked, If they had any pain relief during labour?’ Who looked after your husband after you had the baby?’ [Laughs] which is an interesting question and, you know, How soon they were back doing their normal life and so on?’

Mr S describes cohort studies as allowing links to be made between risk factors and outcomes in health.

Age at interview 35

The reason being is so that they can obviously try to pinpoint and try to evaluate and conclude situations regarding conditions and illnesses that have occurred. Whether anything is linked to a particular diet or linked to a particular genes or – for example. So, it’s basically what they’re trying, it’s just so they have all that information and that information can then be fed through. Obviously, the more information you have, the more likely you are to get an accurate record of what’s been said, what’s been done. Does that information then link to anything towards a particular condition, is it because of this for example, is this particular condition linked because of this? But you won’t know that until you’ve got the information there. But that’s why it’s very important to collect the information because then you can take out something that’s not going to be linked with that. Oh well we’ve found out this, but it’s not in relation to this. So, we can remove that from that and just see if it’s linked to something else,’ you know what I mean? So, it’s very important.

Others were unclear how cohort studies were different to other types of medical research. For some, the differences between types of medical research wasn’t important; most felt that, providing the outcomes of research were to improve health and healthcare practice, then it was a good thing to support.

Often people thought future generations would benefit from these studies and didn’t expect there to be any direct benefit from the findings for them. Some hoped though that they and other people like them would also benefit. Eisha felt there were lots of everyday things that cohort studies could make better, such as improving green spaces, advising on oral hygiene, and encouraging dads to be more involved with their children.

By improving knowledge about health, many people felt that these studies could then improve the practices of healthcare professionals and treatment options. Alan Y explained, “It’s knowledge for them [doctors] and the more knowledge they have, the better healthcare they can give down to people like me.”

John is pleased to be a data point and to see research moving forward.

Age at interview 68

Well, if I can provide a data point in their research fantastic, just to be a data point is, is hopefully of use to them because the more people that they assess, the more understanding will emerge from that and the team is quite often in the broadsheet press, as it were, about recommendations and research findings. Oh, that’s interesting. they’re still extremely active and our understanding of stroke/TIAs/the whole, you know, working of the brain is moving forward rapidly only because of research. So, if I’m a data point in that research that’s good news rather than my own experience being completely of no use to anybody, much better.

Jenny, a senior researcher, explains why it is often a long time before findings can be drawn out of a cohort study and shared with others.

So, this is one of the downsides of being a participant in a cohort study, is that if it’s long term health outcomes it’s going to be a long time before you hear anything. Part of the reason for this is the outcomes are actually not occurring for many years, but another part is that you can’t really feedback too much to the patients until they’ve had, until the paper, until the work has been published and seen by the scientific community and approved by a journal. So that can take a long, long time.

We do try to give newsletters saying, you know, feeding back that participation is, the participants contribution is really valuable and we value it and we want them to keep going. And then obviously when we do have results that are published, we’ll try and present those in a format which is meaningful to the participants, the study participants so that they understand exactly what we have done and why we have done it.

Some people had a clear grasp of the aims of the specific cohort studies they had taken part in. Others were unsure, sometimes because they had begun participating many years ago and had now forgotten this information. Most people in this situation were not worried and they felt confident that the research was being done for “a good reason”, whatever that reason was.

Derek couldn’t recall the details of why the study was being carried out, but he felt sure that it was “for good purposes”. Ronald felt that the study “must have ended up helping people somehow. But I wouldn’t know how”, otherwise it wouldn’t have been continued. Margaret Ann said, “If someone somewhere down the line thinks this is interesting, well I’m quite happy to answer the questions.” As a child, Jade withdrew her participation in a twins study but is now thinking about re-joining. She thinks it would have helped if a researcher had explained the value of this type of study at the time and the types of questions they hoped the research might answer.

Whilst Nadera felt research was important to enhance knowledge about health, she was also keen to see more action to improve people’s lives, including around poverty and access to healthy foods. She felt that some studies don’t lead to any actual change.

Some people, like Ian, felt that the researchers running these studies were themselves unsure about how they would use or share the data collected in the future. Ian thought that there might be new and unexpected uses of his data, which he saw as a positive thing, and that data sharing with other researchers “could lead to some breakthrough.”

Ian had blood samples taken as part of a motor neurone disease (MND) study. He isn’t certain how the researchers will use these samples to meet their research aims but feels okay about this.

Age at interview 54

Age at diagnosis 51

Probably a little more vague on the blood samples, because I think they’re still struggling to understand how they’re going to use them. But yeah, very open about the fact, You will be part of a blood bank and we may use them, we may actually share them with other people. Do you have any problem with that No, I don’t, you know. It may not be so accurate on, this blood test will do this and this blood test will do tha but as they went along, they, if they took four, five, six phials of blood they would sort of tell me roughly why they were going to use them.

Lucy knows she has been in a study since she was a baby but doesn’t know the name or the aim of it.

Age at interview 30

So, this is a little bit of a tricky one because I have no idea what the name of the study that I’ve been part of what the name of it is. I’ve been in it for over 30 years, as far as I’m aware I’m still enrolled in it. It was one that I was enrolled into when I was a newborn by my mum, and she has no idea what the name of the study is, she’s got no paperwork on it. She doesn’t really know what the aims of it were [laughs].

I don’t really remember receiving a leaflet or anything about the study, maybe it was just assumed because I’d been in it for a while that I did know and that it had been communicated to me by my mum, but I don’t think it had, I think I just accepted [laugh] that it was a thing I did and that was alright. And I don’t-, I don’t know if it would be in medical notes either. I did think about-, because I spent quite a long time trying to find it online actually, this study. And to work out whether it was run by paediatrics, or dermatology, or oncology, or- or exactly where it was coming from and I couldn’t find anything. Which could be because, you know, if it’s been running for 30 years or it’s stopped a while ago, I don’t know.

However, not everyone we interviewed felt this way and some had concerns . Richard declined to take part in a biobanking study because it wasn’t clear exactly what the study was hoping to achieve. He explained, “Probably I felt some misgiving about the nature of it, in as much as it’s a long-term study without specific aims that I could identify… the non-specificity of it.” After a long gap without communication from the research team, Isobel had attended an event where they shared some findings. However, because she didn’t know much about the background, she said it made her feel “out of my depth” and “ill-informed” about the purpose of the study.

Often people said they would like to understand more about the study they had taken part in, especially what it aimed to find out and what is has found out so far (see also sections on communication with research staff and key messages to research staff ). Lucy has been in a study since she was a baby and found it an “annoyance actually that I just don’t know anything about the study.” She couldn’t remember if it had been explained to her when she was older or if the researchers had “just assumed” she already knew.

The future of cohort studies

A few people talked about the future of the specific studies they were part of. Some people taking part in birth cohort studies wondered when their study would end and said that they hadn’t thought it would have lasted as long as it has.

Jess remembers a question from the survey she filled in when she was a child. She didn’t expect the study to still be following her up so many decades later.

Age at interview 67

Well, it’s not, it’s just useful to do and I think if I’d thought, I wouldn’t have thought 50 years ago, and my mum and dad probably didn’t even think there would be a follow up. It would just be information that goes into not a computer then, but some sort of thing that would be followed up that, so a percentage of so and so had one brother or five brothers or whatever and they were, and the conditions they were living in and they, I think there was a question in the survey about income and my dad was self-employed and I don’t know, I can’t remember the figure that he put in, but I remember thinking, Gee, that’s a lot of money that you’re, that you’re having. So I thought, you know, when I think about it now, they were probably looking at our conditions, our income, our social status, our what might happen, what would happen to us in the future, but I didn’t know that, I didn’t expect or ever think that anything would come of it.

Some cohort studies involved sub-projects with new interests or aims that have developed over time. Some aims had evolved based on new knowledge, technology, the availability of funding and interest from other researchers who would like to use the data.

Barbara is in a birth cohort study which has been extended several times. She thinks that securing funding to continue data collection has always been an issue.

I mean obviously this started with our mothers it was a study initially to look at pregnant mothers actually, pregnant women pre- sort of birth or whatever and I guess it’s always been part of my life. I, from a toddler just going to have these special check-ups so it’s just always been there. And it’s, at some points I think they, I mean this whole study, I think funding’s always been an issue and there’ve been gaps when I think once we got to seven or eight that was going to be the end of the study but then the Medical Research Council, whoever they were then actually got the extra funding to go onto maybe teenage and then it’s always, we’ve always thought that was going to be the end of it but it’s gone on and on.

A few people we spoke to had been involved in advising research teams on future sub-studies they were planning. Salma liked learning about and “shaping the research.” Brian thought it was good that the researchers were “consulting us before they start the research.”

Some people who had been taking part in a study for many years talked about noticing when the research took a particular interest in a topic or a new line of inquiry. Often this was noticed in questionnaires. Barbara had completed a questionnaire every few years and she recalled that “there was one year, the survey was all about smoking and cancer; then another year you can just see that there’s an emphasis on maybe heart [health].” Gill and Brian had also noticed new research interests developing over time in the studies they were part of.

Brian is in a birth cohort study. He talks about how the emphasis of questions has evolved over time, and the focus is now on ageing.

I’m not sure that it’s really changed I suppose that the, what they’re researching is changing, you know, from being I think it started with just sort of health being a baby, how there was concern that a lot of children were dying, I think that’s how this research started. But over the years research has changed significantly, it’s to the point now where it’s all about ageing and how we’re coping with all the problems that arise as you get older but, it’s changed right through, each time there’s a different emphasis on, on what’s on what they’re looking at, so. So that’s the main thing and so I suppose the research that they do and the questions they ask are sort of related to that research so.

meaning of cohort study in research

Cohort studies

In this section you can find out about the experience of taking part in cohort studies by listening to people share their personal stories on...

meaning of cohort study in research

Being asked or volunteering to take part in cohort studies

Apart from those who had started the study when they were babies or young children, most of the people we interviewed could remember being invited...

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Can J Hosp Pharm
  • v.67(5); Sep-Oct 2014

Logo of cjhp

An Introduction to the Fundamentals of Cohort and Case–Control Studies

Associated data, introduction.

As pharmacotherapy experts, pharmacists are continually updating their knowledge about drug effects. In addition to being knowledge users of research findings, pharmacists increasingly play a larger role in observational studies of drug effects. Observational studies are inherently nonexperimental and, unlike randomized clinical trials (RCTs), do not involve any manipulation (such as randomization) of the treatment and control groups by the investigator.

This article reviews for the practising pharmacist the fundamental design elements and foundational methodologic knowledge for conducting cohort and case–control studies, 2 common and robust observational study designs for elucidating drug–outcome associations. Readers interested in learning about other observational study designs, such as cross-sectional studies, ecological studies, case series, case reports, within-person studies, and quasi-experimental designs, or the critical appraisal of such designs, are referred elsewhere. 1 – 6

WHY WE NEED COHORT AND CASE–CONTROL STUDIES

We need well-designed and rigorous cohort and case– control studies because their findings provide knowledge complementary to that garnered from RCTs ( Table 1 ). The design properties of RCTs maximize their ability to estimate the potential causal effects of drugs under ideal circumstances and thereby to estimate the efficacy of those drugs. However, many RCTs involve a relatively limited number of highly selected patients and a limited duration. Indeed, RCTs typically follow patients for only a small fraction of the time that the drug would be used in clinical practice, especially when the medications are for chronic diseases. Moreover, RCTs typically exclude complex patients, they often use irrelevant comparators (e.g., placebo), and they frequently measure outcomes that are not patient-centred (i.e., surrogate end points). 7 Although many of these limitations may be overcome by designing more pragmatic RCTs that do indeed measure effectiveness, 8 cohort and case–control studies are 2 feasible study design alternatives that address the limitations of RCTs ( Table 1 ) without the considerable financial and human resource costs of pragmatic RCTs.

Limitations of Randomized Clinical Trials (RCTs) Potentially Addressed by Cohort and Case–Control Studies

Use of a strict study protocol that is often not representative of typical careUsually representative of settings of routine medical care
Exclusion of key patient populations, such as children, pregnant women, and elderly peopleMay focus on vulnerable and under-represented populations
Limited sample sizeMay include large number of patients, especially if secondary data sources are used, thereby allowing rare events to be detected
Short durationMay follow patients for long periods of time (e.g., years)
Evaluation of irrelevant treatment comparisonsMay compare several relevant therapies
Outcomes measured may not be important to the patient (e.g., surrogate end points)May include any outcome that is measurable within the data source
High costRelatively low cost

COHORT STUDIES

A cohort is a group of people who share a common experience or characteristic. The term “cohort” first appeared in the medical literature in the 1930s in an article by epidemiologist W H Frost. 9 Interestingly, the word “cohort” has military roots, originating from the Latin word “cohors”. 10 The term was first used in the Roman military, where a group of 300 to 600 soldiers constituted a cohort. 11

A cohort study compares the experience of 2 or more groups of patients who are followed concurrently forward in time ( Figure 1 ). This prospective tracking, from exposure to outcome, is in fact one of the defining features of a cohort study. 11 The temporal sequence involved in following a group of patients who are exposed to a certain factor (the treatment group) and a group of patients who are not exposed to that factor (the control group) is akin to that of a clinical trial, where instead of chance determining a patient’s exposure status (as occurs in an RCT), choice or happenstance determines exposure status.

An external file that holds a picture, illustration, etc.
Object name is cjhp-67-366f1.jpg

Schematic for the design of cohort and case–control studies.

Selecting the Study Cohort

For any cohort study, a source population must be defined, from which the eligible study cohort is derived through application of various inclusion and exclusion criteria. At a minimum, patients entering the study cohort must be free of the outcome of interest. For example, in a cohort study designed to measure the association between atypical antipsychotics and diabetes mellitus, patients with diabetes would have to be excluded from the study cohort because they are not at risk of the outcome. Often, other restrictions are put in place to minimize the risk of bias. For example, restriction to new users of a medication will ensure avoidance of multiple biases. 12 Inclusion of prevalent or current drug users can lead to significant bias because patients who experience early intolerance or adverse effects of a drug may discontinue the drug, and the remaining cohort will consist of a healthier and usually more adherent group. 13 Risk that varies over time, whereby new users have a higher risk of an adverse event, has been observed for numerous associations, including those between nonsteroidal anti-inflammatory drugs and upper gastrointestinal bleeding, 14 oral contraceptives and venous thromboembolism, 15 benzodiazepines and falls, 16 and angiotensin-converting enzyme inhibitors and angioedema. 17

Defining Drug Exposure Groups

Once the study cohort has been created, 2 or more exposure groups must be clearly defined, 1 of which must serve as the control or reference group. The reference group should be clinically relevant. For example, in a comparative safety or effectiveness study, patients taking a drug within the same therapeutic class or receiving usual care may serve as the reference group. If clinically and scientifically relevant, a group with no therapeutic exposure may be the reference group. Drug exposure may be measured in terms of persons or person-time (the time for which a person is exposed to a particular drug). Drug exposure is often categorized in a binary fashion (i.e., yes or no), based on either a minimum number of prescription records (e.g., at least 3 records) or a specified duration of exposure (e.g., at least 90 days’ exposure), or a combination of these 2 factors (i.e., cumulative exposure). Irrespective of how exposure is defined, it is essential that follow-up time be properly categorized following entry into the cohort to avoid time-related bias. 18 Furthermore, the definition of exposure should be coherent with the study hypothesis. For example, a certain amount of time or a certain dose of drug may be required to elicit an effect, or a drug may continue to have an effect once discontinued (e.g., bisphosphonates). Moreover, decisions about when to discontinue drug exposure must be made. There are 2 common approaches: “as treated”, whereby drug exposure is recorded as being stopped when a person no longer meets the definition of exposure; and “intention-to-treat”, whereby a person is considered exposed from the time of first meeting the study’s exposure definition until experiencing the outcome of interest or the end of the study, irrespective of changes in actual exposure status. There is no consensus on how to best define drug exposures, and hence the definitions of exposure often vary considerably among cohort studies assessing identical drug–outcome associations.

Measuring Occurrence of Outcomes

Complete and accurate measurement of the outcome of interest is essential to ensure the validity of study results. When subjective outcome data (e.g., diagnosis of pneumonia) are being collected during the study period, exposure status should be blinded for the outcome assessors and adjudicators, to prevent responder bias. When previously collected data (i.e., secondary data) are being used, investigators should ideally use outcome definitions that have been validated in previous studies. For example, Hux and others 19 validated definitions of diabetes by comparing International Classification of Diseases codes obtained from administrative health care databases in Ontario with diagnostic data from primary care charts.

Quantifying the Drug–Outcome Association

For cohort studies, the drug–outcome association is usually expressed as a relative risk, a relative rate, or a hazard ratio. Advanced statistical techniques are used to account for factors other than the drug exposure of interest that might distort the drug–outcome association. These factors or potential confounders are often handled simultaneously with multivariable regression models. Because these statistical models account for measured variables, it is crucial that the data source capture as many potential confounding variables (or proxies of confounders) as possible. Potential confounders should usually be measured before entry into the cohort, to avoid adjustment for factors in the causal pathway.

Strengths and Weaknesses

One of the major strengths of a cohort study is that the temporal sequence—drug exposure preceding outcome—is explicit in the study design. The incidence of a particular outcome among persons exposed to a certain drug can be directly calculated using a cohort design. Cohort studies are also relatively efficient for studying rare exposures, and multiple outcomes may be assessed for a single exposure. However, cohort studies with long observation periods may be more susceptible to losses to follow-up and to inaccurate measurement of exposures and outcomes. Large numbers of patients may be required to precisely estimate meaningful drug–outcome associations, especially for rare outcomes or outcomes that take a long time to occur.

CASE-CONTROL STUDIES

The first case–control study using the design with which we are familiar today was published in 1926. However, the concept of case–control studies has its origins in the investigation of disease etiologies through detailed histories and examination of patients. 20

In a case–control study, a number of cases and noncases (controls) are identified, and the occurrence of one or more prior exposures is compared between groups to evaluate drug–outcome associations ( Figure 1 ). A case–control study runs in reverse relative to a cohort study. 21 As such, study inception occurs when a patient experiences an outcome and is thus designated a “case”. A modern conceptual view holds that the case–control study can be thought of as an efficient cohort design. Essentially, patients who would have experienced the outcome of interest in a cohort study are the cases in a case–control study. Similarly, patients who were at risk but did not experience the outcome of interest in a cohort study are the controls in a case–control study. The potential data sources for a case–control study are identical with those for a cohort study, and the investigator may collect data after study inception or may use previously collected data. An extension of the case–control study is the nested case–control study, which is a case–control study conducted within a cohort. Details regarding this design are beyond the scope of this article and are reviewed elsewhere. 22 , 23

Selection of Cases

The first step in a case–control study is to identify the cases through application of explicitly defined inclusion and exclusion criteria. Ideally, cases should be directly sampled from the source population in a manner that is unrelated to the drug exposures of interest; however, the source population that gave rise to the cases is often unknown and difficult to identify (except in a nested-case control study, where the source population is known). The case-selection process and the data sources from which cases were selected should be described in detail, especially if cases are from a variety of sources, such as hospital and community-based sources. Selecting only hospital-based cases may lead to systematic error related to hospital admission practices, whereby exposed cases may be more likely to be admitted and therefore selected into the study (a phenomenon known as Berksonian bias). Furthermore, only new (incident) cases should be selected, as nonincident cases usually over-represent long-term survivors, and diagnostic practices may change over time, introducing potential bias. When cases are selected from a secondary data source, the case definitions should be supported by previous validation studies.

Selection of Controls

The selection of controls in a case–control study is fraught with difficulty and is often the source of significant bias. Essentially, the controls should be selected from the same source population as the cases. 24 In other words, controls should be at risk of becoming cases and should come from a population with the same exposure distribution as the cases. Multiple controls are usually selected for each case, to increase the statistical efficiency of the study; however, the gains are minimal beyond 3 or 4 controls per case. Nonetheless, modern case–control studies involving large databases often use much higher control–case ratios to maximize study precision. To control for potential confounding, cases and controls are often matched on one or more patient characteristics, such as age or sex (although it may not always be appropriate to match on these variables). The study investigator must be careful not to match on too many factors or on factors that are not confounders, as doing so might lead to overmatching and bias. Furthermore, matching should not involve variables that the investigator is interested in examining in association with an outcome. The selection of controls is one of the most difficult aspects of epidemiologic research, and readers are encouraged to consult additional resources. 24 – 28

Similar to the situation for a cohort study, the drug exposures of interest and their definitions should be clearly specified in the methods. Because exposure in a case–control study is determined after the cases have been identified, a period before occurrence of the case, called the “look-back period” or “look-back window”, must be defined. A comparable look-back period must be defined for the control group. Look-back periods should consider the study hypothesis and thus may vary considerably from one study to another. For example, Abdelmoneim and others 29 specified a 120-day look-back period before the date of their cases (patients with acute coronary syndrome) to assess recent exposure to glyburide and gliclazide. Azoulay and others 30 specified an exposure window of any time prior to a year before the date of cases in their study evaluating the association between pioglitazone and bladder cancer. If the investigators are collecting exposure data themselves, then outcome status should be blinded to study personnel.

In a case–control study, the odds ratio is the usual measure of association reported. This measure is the ratio of the odds of an exposure between cases and controls and in most cases approximates the relative risk. As in a cohort study, the analytic plan for a case–control study typically involves advanced statistical methods to adjust for multiple potential confounders.

The major strengths of the case–control design are statistical efficiency (i.e., uses fewer data to quantify a drug–outcome association than would be required in a cohort study), efficiency for studying rare outcomes, efficiency for studying conditions with long latency periods, efficiency for handling the time-varying nature of drug exposures, and relatively low cost. The weaknesses of case–control studies include inefficiency for studying rare exposures, difficulty of selecting unbiased controls, and inability to directly calculate incidence rates of outcomes.

LIMITATIONS OF COHORT AND CASE–CONTROL STUDIES

Bias and confounding.

Observational studies are methodologically difficult, susceptible to bias and confounding, and difficult to interpret, given the many types of bias potentially at play. For these reasons, observational studies are limited to studying drug–outcome associations and cannot be used to measure the causal effects of drugs. Recent methodologic advances in design and analytic techniques in pharmacoepidemiology have helped to combat the various types of selection bias, information bias, and confounding at play in cohort and case–control studies (see Appendix 1 , available online at www.cjhp-online.ca/index.php/cjhp/issue/view/104/showToc ). Many of these techniques can account for multiple potential confounders simultaneously. A comprehensive review of these techniques is beyond the scope of this article, but such reviews may be found elsewhere, 25 , 31 – 33 Bias and confounding result in spurious drug–outcome associations and are introduced at both the design and analysis stages. Appendix 2 (available online at www.cjhp-online.ca/index.php/cjhp/issue/view/104/showToc ) illustrates the origin of bias in relation to the cohort design, and Appendix 3 (available online at www.cjhp-online.ca/index.php/cjhp/issue/view/104/showToc ) lists common types of bias that occur in cohort and case–control studies of drug effects.

Study of Intended Drug Effects

Cohort and case–control studies are powerful approaches for estimating the association between drugs and unintended outcomes 34 ; however, their use for studying the intended effects of drugs has spurred debate in the past and remains controversial today. 35 – 37 This controversy has arisen because the propensity for bias and confounding is much higher when estimating the intended effects of drugs (i.e., benefits). 37 This higher propensity for bias is in turn due to the nonrandom nature of prescribing practices and is often referred to as “confounding by the reason for the prescription” or simply “confounding by indication”. Confounding by indication is expected with these types of studies, as it is good medical practice to prescribe intentionally and rationally, as opposed to prescribing according to a random process. 38 Some authors strongly recommend against using observational studies to study intended effects, suggesting instead that we consider restricting our research questions to those of unintended effects because confounding by indication introduces uncontrollable bias. 31 , 34 , 39 , 40 The literature contains numerous examples of confounding by indication. A most striking example is the distorted 27-fold increased risk of thrombotic events associated with use of warfarin, when in fact warfarin prevents thrombotic events. 39 Another example of confounding by indication is the observed relationship between short-acting ß-receptor agonists (e.g., salbutamol) and increased risk of death from asthma. 41 Of course confounding by indication is not verifiable, but it must be considered when studying the intended effects of drugs.

GENERAL CONSIDERATIONS IN CONDUCTING A COHORT OR CASE–CONTROL STUDY

Protocol and study team.

Cohort and case–control studies aim to quantitatively estimate the association between a drug exposure and outcome. Before embarking on a cohort or case–control study, the investigators must develop a well-articulated and focused research question. 42 Furthermore, the study protocol, including a detailed methodologic and analytic plan, should be consistent with international guidelines. 43 , 44 The study team should have appropriate clinical and methodologic expertise. Clinical expertise is essential for developing exposure and outcome definitions, as well as for understanding the overall clinical context of how the research question fits into the current body of knowledge. Methodologic expertise is critical for ensuring that robust methods are used, to minimize bias and confounding.

Data Sources

To estimate a drug–outcome association in a cohort or case– control study, accurate and comprehensive data must be collected on the drug exposures and outcomes of interest. Study investigators may collect data after study inception or may use previously collected data. The major advantage of prospectively collecting the data (primary data collection) is that the investigators have control over what information is collected; in contrast, when previously collected data are used (secondary data collection), the investigators are limited to the information already collected. Data may often be missing from or inaccurately recorded in secondary data sources, which creates challenges when the data are used for research purposes. Although previously collected data are considered retrospective to study inception, the data themselves are often collected prospectively; therefore, use of the terms “retrospective” and “prospective” may be misleading and usually does not provide any clarity in terms of important design characteristics. 25 There are 3 main sources of existing data: administrative data, medical records, and surveys. Special considerations and the advantages and disadvantages of these secondary data sources are discussed elsewhere. 45 , 46 For studying drug effects, secondary data sources are more commonly used than primary data collection, primarily because of gains in time, cost, and statistical efficiency. Furthermore, use of secondary data sources avoids the Hawthorne effect, whereby knowledge of participation in a study changes the behaviour of study participants and may lead to bias.

CONCLUSIONS

Pharmacists use knowledge from cohort and case–control studies to inform patients, clinicians, and the general public about drug effects. At a basic level, cohort and case–control studies quantitatively estimate the relation between exposures and outcomes. They represent rigorous study designs for answering drug safety and effectiveness questions, with case–control studies being more prone to bias. The methodologic rigour of cohort and case–control studies evaluating drug–outcome associations is advancing, and approaches are being developed and refined that limit the generation of misleading study results. Indeed, both RCTs and observational studies are necessary, and neither is sufficient to learn about the totality of drug effects in the population.

Acknowledgments

John-Michael Gamble is supported by a New Investigator Award in drug safety and effectiveness from the Canadian Institutes of Health Research and a Clinician Scientist Award from the Canadian Diabetes Association.

This article is the sixth in the CJHP Research Primer Series, an initiative of the CJHP Editorial Board and the CSHP Research Committee. The planned 2-year series is intended to appeal to relatively inexperienced researchers, with the goal of building research capacity among practising pharmacists. The articles, presenting simple but rigorous guidance to encourage and support novice researchers, are being solicited from authors with appropriate expertise.

Previous article in this series:

Bond CM. The research jigsaw: how to get started. Can J Hosp Pharm . 2014;67(1):28–30.

Tully MP. Research: articulating questions, generating hypotheses, and choosing study designs. Can J Hosp Pharm . 2014;67(1):31–4.

Loewen P. Ethical issues in pharmacy practice research: an introductory guide. Can J Hosp Pharm. 2014;67(2):133–7.

Tsuyuki RT. Designing pharmacy practice research trials. Can J Hosp Pharm . 2014;67(3):226–9.

Bresee LC. An introduction to developing surveys for pharmacy practice research. Can J Hosp Pharm . 2014;67(4):286–91.

Competing interests: None declared.

Prospective Cohort Study Design: Definition & Examples

Julia Simkus

Editor at Simply Psychology

BA (Hons) Psychology, Princeton University

Julia Simkus is a graduate of Princeton University with a Bachelor of Arts in Psychology. She is currently studying for a Master's Degree in Counseling for Mental Health and Wellness in September 2023. Julia's research has been published in peer reviewed journals.

Learn about our Editorial Process

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

A prospective study, sometimes called a prospective cohort study, is a type of longitudinal study where researchers will follow and observe a group of subjects over a period of time to gather information and record the development of outcomes.

Prospective Cohort Study 1

The participants in a prospective study are selected based on specific criteria and are often free from the outcome of interest at the beginning of the study. Data on exposures and potential confounding factors are collected at regular intervals throughout the study period.

By following the participants prospectively, researchers can establish a temporal relationship between exposures and outcomes, providing valuable insights into the causality of the observed associations.

This study design allows for the examination of multiple outcomes and the investigation of various exposure levels, contributing to a comprehensive understanding of the factors influencing health and disease.

How it Works

Participants are enrolled in the study before they develop the outcome or disease in question and then are observed as it evolves to see who develops the outcome and who does not.

Cohort studies are observational, so researchers will follow the subjects without manipulating any variables or interfering with their environment.

Similar to retrospective studies , prospective studies are beneficial for medical researchers, specifically in the field of epidemiology, as scientists can watch the development of a disease and compare the risk factors among subjects.

Before any appearance of the disease is investigated, medical professionals will identify a cohort, observe the target participants over time, and collect data at regular intervals.

Weeks, months, or years later, depending on the duration of the study design, the researchers will examine any factors that differed between the individuals who developed the condition and those who did not.

They can then determine if an association exists between an exposure and an outcome and even identify disease progression and relative risk.

Determine cause-and-effect relationships

Because researchers study groups of people before they develop an illness, they can discover potential cause-and-effect relationships between certain behaviors and the development of a disease.

Multiple diseases and conditions can be studied at the same time

Prospective cohort studies enable researchers to study causes of disease and identify multiple risk factors associated with a single exposure. These studies can also reveal links between diseases and risk factors.

Can measure a continuously changing relationship between exposure and outcome

Because prospective cohort studies are longitudinal, researchers can study changes in levels of exposure over time and any changes in outcome, providing a deeper understanding of the dynamic relationship between exposure and outcome.

Limitations

Time consuming and expensive.

Prospective studies usually require multiple months or years before researchers can identify a disease’s causes or discover significant results.

Because of this, they are often more expensive than other types of studies. Recruiting and enrolling participants is another added cost and time commitment.

Requires large subject pool

Prospective cohort studies require large sample sizes in order for any relationships or patterns to be meaningful. Researchers are unable to generate results if there is not enough data.

  • Framingham Heart Study: Studied the effects of diet, exercise, and medications on the development of hypertensive or arteriosclerotic cardiovascular disease in residents of the city of Framingham, Massachusetts.
  • Caerphilly Heart Disease Study: Examined relationships between a wide range of social, lifestyle, dietary, and other factors with incident vascular disease.
  • The Million Women Study: Analyzed data from more than one million women aged 50 and over to understand the effects of hormone replacement therapy use on women’s health.
  • Nurses’ Health Study: Studied the effects of diet, exercise, and medications on the development of hypertensive or arteriosclerotic cardiovascular disease.
  • Sleep-Disordered Breathing and Mortality: Determined whether sleep-disordered breathing and its sequelae of intermittent hypoxemia and recurrent arousals are associated with mortality in a community sample of adults aged 40 years or older (Punjabi et al., 2009)

Frequently Asked Questions

1. what does it mean when an observational study is​ prospective.

A prospective observational study is a type of research where investigators select a group of subjects and observe them over a certain period.

The researchers collect data on the subjects’ exposure to certain risk factors or interventions and then track the outcomes. This type of study is often used to study the effects of suspected risk factors that cannot be controlled experimentally.

2. What is the primary difference between a randomized clinical trial and a prospective cohort study?

In a retrospective study, the subjects have already experienced the outcome of interest or developed the disease before the start of the study.

The researchers then look back in time to identify a cohort of subjects before they had developed the disease and use existing data, such as medical records, to discover any patterns.

In a prospective study, on the other hand, the investigators will design the study, recruit subjects, and collect baseline data on all subjects before any of them have developed the outcomes of interest.

The subjects are followed and observed over a period of time to gather information and record the development of outcomes.

3. What is the primary difference between a randomized clinical trial and a prospective cohort study?

In randomized clinical trials , the researchers control the experiment, whereas prospective cohort studies are purely observational, so researchers will observe subjects without manipulating any variables or interfering with their environment.

Researchers in randomized clinical trials will randomly divide participants into groups, either an experimental group or a control group.

However, in prospective cohort studies, researchers will identify a cohort and observe the target participants as a whole to examine any factors that differ between the individuals who develop the condition and those who do not.

Euser, A. M., Zoccali, C., Jager, K. J., & Dekker, F. W. (2009). Cohort studies: prospective versus retrospective. Nephron. Clinical practice, 113(3), c214–c217. https://doi.org/10.1159/000235241

Hariton, E., & Locascio, J. J. (2018). Randomised controlled trials – the gold standard for effectiveness research: Study design: randomised controlled trials. BJOG : an international journal of obstetrics and gynaecology, 125(13), 1716. https://doi.org/10.1111/1471-0528.15199

Netherlands Cooperative Study on the Adequacy of Dialysis-2 Study Group de Mutsert Renée r. de_mutsert@ lumc. nl Grootendorst Diana C Boeschoten Elisabeth W Brandts Hans van Manen Jeannette G Krediet Raymond T Dekker Friedo W. (2009). Subjective global assessment of nutritional status is strongly associated with mortality in chronic dialysis patients. The American journal of clinical nutrition, 89(3), 787-793.

Punjabi, N. M., Caffo, B. S., Goodwin, J. L., Gottlieb, D. J., Newman, A. B., O”Connor, G. T., Rapoport, D. M., Redline, S., Resnick, H. E., Robbins, J. A., Shahar, E., Unruh, M. L., & Samet, J. M. (2009). Sleep-disordered breathing and mortality: a prospective cohort study. PLoS medicine, 6(8), e1000132. https://doi.org/10.1371/journal.pmed.1000132

Ranganathan, P., & Aggarwal, R. (2018). Study designs: Part 1 – An overview and classification. Perspectives in clinical research, 9(4), 184–186.

Song, J. W., & Chung, K. C. (2010). Observational studies: cohort and case-control studies. Plastic and reconstructive surgery, 126(6), 2234–2242. https://doi.org/10.1097/PRS.0b013e3181f44abc.

Further Information

  • Euser, A. M., Zoccali, C., Jager, K. J., & Dekker, F. W. (2009). Cohort studies: prospective versus retrospective. Nephron Clinical Practice, 113(3), c214-c217.
  • Design of Prospective Studies
  • Hammoudeh, S., Gadelhaq, W., & Janahi, I. (2018). Prospective cohort studies in medical research (pp. 11-28). IntechOpen.
  • Nabi, H., Kivimaki, M., De Vogli, R., Marmot, M. G., & Singh-Manoux, A. (2008). Positive and negative affect and risk of coronary heart disease: Whitehall II prospective cohort study. Bmj, 337.
  • Bramsen, I., Dirkzwager, A. J., & Van der Ploeg, H. M. (2000). Predeployment personality traits and exposure to trauma as predictors of posttraumatic stress symptoms: A prospective study of former peacekeepers. American Journal of Psychiatry, 157(7), 1115-1119.

Print Friendly, PDF & Email

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • What Is a Retrospective Cohort Study? | Definition & Examples

What Is a Retrospective Cohort Study? | Definition & Examples

Published on February 10, 2023 by Tegan George . Revised on June 22, 2023.

A retrospective cohort study is a type of observational study that focuses on individuals who have an exposure to a disease or risk factor in common. Retrospective cohort studies analyze the health outcomes over a period of time to form connections and assess the risk of a given outcome associated with a given exposure.

Retrospective cohort study

It is crucial to note that in order to be considered a retrospective cohort study, your participants must already possess the disease or health outcome being studied.

Table of contents

When to use a retrospective cohort study, examples of retrospective cohort studies, advantages and disadvantages of retrospective cohort studies, other interesting articles, frequently asked questions.

Retrospective cohort studies are a type of observational study . They are often used in fields related to medicine to study the effect of exposures on health outcomes. While most observational studies are qualitative in nature, retrospective cohort studies are often quantitative , as they use preexisting secondary research data. They can be used to conduct both exploratory research and explanatory research .

Retrospective cohort studies are often used as an intermediate step between a weaker preliminary study and a prospective cohort study , as the results gleaned from a retrospective cohort study strengthen assumptions behind a future prospective cohort study.

A retrospective cohort study could be a good fit for your research if:

  • A prospective cohort study is not (yet) feasible for the variables you are investigating.
  • You need to quickly examine the effect of an exposure, outbreak, or treatment on an outcome.
  • You are seeking to investigate an early-stage or potential association between your variables of interest.

Retrospective cohort studies use secondary research data, such as existing medical records or databases, to identify a group of people with an exposure or risk factor in common. They then look back in time to observe how the health outcomes developed. Case-control studies rely on primary research , comparing a group of participants with a condition of interest to a group lacking that condition in real time.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Retrospective cohort studies are common in fields like medicine, epidemiology, and healthcare.

You collect data from participants’ exposure to organophosphates, focusing on variables like the timing and duration of exposure, and analyze the health effects of the exposure. Example: Healthcare retrospective cohort study You are examining the relationship between tanning bed use and the incidence of skin cancer diagnoses.

Retrospective cohort studies can be a good fit for many research projects, but they have their share of advantages and disadvantages.

Advantages of retrospective cohort studies

  • Retrospective cohort studies are a great choice if you have any ethical considerations or concerns about your participants that prevent you from pursuing a traditional experimental design .
  • Retrospective cohort studies are quite efficient in terms of time and budget. They require fewer subjects than other research methods and use preexisting secondary research data to analyze them.
  • Retrospective cohort studies are particularly useful when studying rare or unusual exposures, as well as diseases with a long latency or incubation period where prospective cohort studies cannot yet form conclusions.

Disadvantages of retrospective cohort studies

  • Like many observational studies, retrospective cohort studies are at high risk for many research biases . They are particularly at risk for recall bias and observer bias due to their reliance on memory and self-reported data.
  • Retrospective cohort studies are not a particularly strong standalone method, as they can never establish causality . This leads to low internal validity and external validity .
  • As most patients will have had a range of healthcare professionals involved in their care over their lifetime, there is significant variability in the measurement of risk factors and outcomes. This leads to issues with reliability and credibility of data collected.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Student’s  t -distribution
  • Normal distribution
  • Null and Alternative Hypotheses
  • Chi square tests
  • Confidence interval
  • Quartiles & Quantiles
  • Cluster sampling
  • Stratified sampling
  • Data cleansing
  • Reproducibility vs Replicability
  • Peer review
  • Prospective cohort study

Research bias

  • Implicit bias
  • Cognitive bias
  • Placebo effect
  • Hawthorne effect
  • Hindsight bias
  • Affect heuristic
  • Social desirability bias

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

meaning of cohort study in research

The primary difference between a retrospective cohort study and a prospective cohort study is the timing of the data collection and the direction of the study.

A retrospective cohort study looks back in time. It uses preexisting secondary research data to examine the relationship between an exposure and an outcome. Data is collected after the outcome you’re studying has already occurred.

Alternatively, a prospective cohort study follows a group of individuals over time. It collects data on both the exposure and the outcome of interest as they are occurring. Data is collected before the outcome of interest has occurred.

Retrospective cohort studies are at high risk for research biases like recall bias . Whenever individuals are asked to recall past events or exposures, recall bias can occur. This is because individuals with a certain disease or health outcome of interest are more likely to remember and/or report past exposures differently to individuals without that outcome. This can result in an overestimation or underestimation of the true relationship between variables and affect your research.

No, retrospective cohort studies cannot establish causality on their own.

Like other types of observational studies , retrospective cohort studies can suggest associations between an exposure and a health outcome. They cannot prove without a doubt, however, that the exposure studied causes the health outcome.

In particular, retrospective cohort studies suffer from challenges arising from the timing of data collection , research biases like recall bias , and how variables are selected. These lead to low internal validity and the inability to determine causality.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

George, T. (2023, June 22). What Is a Retrospective Cohort Study? | Definition & Examples. Scribbr. Retrieved June 27, 2024, from https://www.scribbr.com/methodology/retrospective-cohort-study/

Is this article helpful?

Tegan George

Tegan George

Other students also liked, what is a case-control study | definition & examples, what is an observational study | guide & examples, what is recall bias | definition & examples, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

Power of Cohorts: Public Health Advances from Prospective Cohort Studies

June 24, 2024 , by Jennifer K. Loukissas, M.P.P.

meaning of cohort study in research

DCEG's Commitment to Collaboration

Etiologic discovery of the causes of cancer and other chronic diseases depends upon the power of the prospective cohort study. The intermingled factors that influence risk—heredity, environment, occupation, lifestyle—are challenging enough to tease apart without the limitations of the other primary approach, case-control studies. While swift to produce results, case-control studies offer limited insight and have been documented to produce the wrong results for many exposures.

The multi- and inter-disciplinary, collaborative approaches that cohort studies require are the hallmark of DCEG research. Teams of epidemiologists, geneticists, biostatisticians, and other experts, employ various tools to uncover the causes of cancer.

As part of the intramural research program at the National Cancer Institute, DCEG is a natural incubator for high-risk, high-reward, time-intensive projects, such as cohort studies, that depend on stable, long-term funding. Collaborations are key to their success. Partnerships across the Division and with extramural investigators across the country and around the world have expanded exponentially the value of these resources.

Among the tremendous discoveries and significant public health advances to come from such undertakings are the benefits of exercise for cancer prevention and the association of various exposures to elevated cancer risk, including the determination that smoking causes lung cancer. Cohort studies have informed recommendations like those in Healthy People 2030 ; regulatory guidelines for population-level exposures to potential or known carcinogens ; safety procedures in the workplace; programs to prevent infections and chronic disease; and clinical management following a cancer diagnosis.

The length of longitudinal studies, which may continue for 20 to 30 years, allows researchers to track changes in exposures, lifestyle, or health status over time. Participants contribute maximally when they remain active in the study for decades, providing detailed information repeatedly, from various sources, such as lengthy questionnaires, blood samples, linkage to wearable digital devices, and clinic or home visits for collection of biological samples, like urine. Future studies have plans to collect stool, which will be valuable for examining the microbiome and other metabolic factors.

Person walking across a foot bridge

Physical Activity Associated with Reduced Risk of Seven Cancers

Recommendations for physical activity in US Guidelines is associated with reduced risk of seven cancers.

Participant samples become time capsules. Vials of frozen material stored in biobanks increase in value as the years go by until a future investigator with a novel assay discovers biomarkers unimaginable at the time of collection. For example, ‘omics’ technologies in use today are being applied to data and biological samples collected a generation ago.

The following is an overview of some U.S.-based longitudinal cohort studies utilized and maintained by investigators in DCEG and news about two new, exciting, modern cohorts. Many of these studies have pooled their data as part of the NCI Cohort Consortium and other consortia.

General Population Cohorts Inform Population-Level Prevention

Historically, cancer has been a disease of aging. As such, most cohorts enroll individuals in mid-life or later. Two of the most celebrated —launched in the 1990s—are the NIH-AARP Diet and Health Study, and PLCO, the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Study. In addition to the wealth of knowledge generated by DCEG investigators, the broader scientific community can access data and biospecimens from these studies for their own investigations. Details on accessing that information follow each study description. 

NIH-AARP Diet and Health Study

meaning of cohort study in research

NIH-AARP Diet and Health Study Reaches Milestone

Nearly 25 years ago, the NCI launched the largest prospective in-depth study on diet and health.

The NIH-AARP Diet and Health Study  recruited participants from the membership rolls of AARP, formerly the American Association of Retired Persons, to amass what was then the largest cohort study in the world. Thirty years on, data collected from those half-million individuals are still being analyzed and new findings continue to improve our understanding of patterns of behavior in mid-life and their effect on future cancer risk.

Detailed information from multiple questionnaires has enabled over 900 project proposals resulting in over 600 publications. Using dietary information, investigators in the Metabolic Epidemiology Branch , along with colleagues, have observed many important patterns, such as the safety of coffee consumption—even at five or six cups per day . Other critical observations from this cohort: there is no safe level of exposure to tobacco smoke; even low-intensity smokers benefit from cessation . By mapping the residential histories provided by study participants to air pollution data, investigators in the Occupational and Environmental Epidemiology Branch (OEEB)  linked elevated levels of ultrafine particulate exposure  with increased risk of adenocarcinoma of the lung. Similarly, high levels of fine particulate air pollution were associated with increased breast cancer incidence.  The effort to map participant residences also led to the important observation of an   association between industrial emissions of ethylene oxide and  in situ  breast cancer . A similar pattern was described for ambient dioxin emissions and the risk of non-Hodgkin lymphoma . These studies demonstrate the power of residence history mapping (i.e., geocoding), an important add-on to the cohort. 

Learn more about the NIH-AARP Diet and Health Study and see a summary of select findings .

Researchers interesting in  accessing data can use the NIH-AARP STARS portal to learn about the process and submit their proposals.

PLCO: A Screening Trial that Became a Prospective Cohort

The Prostate, Lung, Colorectal, and Ovarian Cancer Screening Study (PLCO) Cohort began as a screening trial. DCEG investigators and colleagues from NCI and the 10 participating centers transformed it into a large observational cohort study that is still producing critical findings today.

chemical structure of PFOA, Perfluorooctanoic acid

Perfluorooctanoic Acid (PFOA) Associated with Increased Risk of Kidney Cancer

Jonathan Hofmann and collaborators evaluate serum levels of perfluorooctanoic acid (PFOA) in a prospective population-based U.S. cohort.

Survey data and serial biological samples allowed for over 2,000 projects resulting in over 1,400 publications, including the identification of novel biomarkers. For example, investigators in the Infections and Immunoepidemiology Branch (IIB) discovered that human papillomavirus (HPV) antibodies in the blood could be used to predict risk of HPV-positive oropharyngeal cancer years before diagnosis .

In the 2020s, DCEG investigators completed whole genome sequencing of the PLCO samples and made available to the publics summary statistics through The PLCO Atlas – GWAS Explorer .

Investigators seeking access to complete cancer incidence, mortality data, and biospecimens for each subject in the PLCO trial can enter those requests online.

A Modern Cohort: Connect for Cancer Prevention

Exposures and lifestyles change with time among individuals and at the population level. To protect the health of today's generation and prevent the cancers of the future, epidemiologists must embark upon the construction of new cohorts. Beginning in the mid-2010s, DCEG investigators began planning the Connect for Cancer Prevention Study . To sow the seeds of future research discoveries, they are recruiting Gen X and Millennials whose lifestyles include entirely novel practices and experiences, compared to their parents of the Silent and Baby Boom generations.

Connect began enrollment in 2021 and as of June 2024, surpassed 40,000 participants. The aims is to enroll 200,000 adults between the ages of 30 and 70, who have not previously been diagnosed with cancer, and who receive their health care from one of 10 partner health care systems. With this latter criterion, participants can readily share access to their electronic medical records (EMRs)—a component missing from the general population cohorts described above. Furthermore, EMR integration will aid with long-term follow up and increase the completeness of the participant data.

Infographic describing the Connect Study

Consented participants in Connect complete extensive, online questionnaires and biospecimen collections—blood, urine, and saliva—at enrollment and periodically throughout the duration of follow-up. Over the course of the study, tissue collected from biopsies and invasive cancers will also be shared with Connect investigators for molecular studies. Passive follow-up via tumor registries, the National Death Index, and EMRs will provide additional outcome information for cancers and their precursors.

Connect is a digital-first cohort, built with a Findable, Accessible, Interoperable, and Reproducible (FAIR) data infrastructure that allows for sharing and collaboration on scales legacy cohorts could not achieve. This state-of-the-art cohort is built with an efficient, flexible, and integrated data infrastructure that makes the most of modern interoperability standards to serve as a research resource for future generations of scientists at the NCI and across the broader scientific community.

Connect for Cancer Prevention Study

Update: Connect for Cancer Prevention Study

Latest news and accomplishments: recruitment & retention, biospecimen collection, and more

Additionally, by incorporating a diverse Participant Advisory Board and partnering with health systems that serve diverse communities, Connect can enhance recruitment of populations typically underrepresented in research. While there are several studies in the U.S. that have sought to address these gaps, including the Southern Community Cohort, the Multiethnic Cohort, and the Black Women’s Health Study, historically, most cohorts recruited from a relatively narrow segment of the population—predominately White, cis-gender, well-educated, higher-income individuals—limiting the generalizability of the findings.

Learn more about Connect on the GitHub site.

Browse the Connect for Cancer Prevention Study participant recruitment website.

Exposure-based Cohorts

DCEG also prioritizes research in populations with unique exposures, such as workers exposed in occupational settings, or individuals with unique health conditions or medical exposures. The discoveries from such investigations benefit not only the populations studied but also the general population, which may experience similar exposures, though typically at lower rates or doses. 

Cohorts to Study the Health of Workers

Experts in OEEB and the Radiation Epidemiology Branch (REB) have studied worker populations for over 40 years. These cohorts provided some of the earliest data on the potential harms from industrial chemicals and ionizing radiation.

Industry and Manufacturing: Formaldehyde, Diesel Exhaust, and Acrylonitrile

photograph of bottle on lab desk labeled formaldehyde

Occupational Formaldehyde Exposure and Cancer Risk

Studies informed classification of formaldehyde as a human carcinogen

Workers whose jobs involve the use of toxic chemicals and other potentially harmful substances are often exposed at levels well above those of the general public. With well-designed questionnaires, reliable exposure assessment, careful participant recruitment, and long-term follow up, occupational cohort studies can profoundly influence safety in the workplace and regulations to protect public health.

For example, OEEB investigators have conducted countless studies resulting in important discoveries, from dry cleaners exposed to solvents to workers whose jobs involved exposure to formaldehyde . Data from these cohorts informed the classification of those exposures as carcinogenic to humans by the International Agency for Research on Cancer (IARC) Monograph Programme and the National Toxicology Program Report on Carcinogens.

The Diesel Exhaust in Miners Study (DEMS) , launched in 1992 in collaboration with the National Institute for Occupational Safety and Health, enrolled workers at eight non-metal mines across the country. DEMS captured comprehensive exposure and lifestyle data, which allowed the investigators to clarify the relationship between exposure to diesel engine exhaust and the risk of death from lung cancer. The findings played a critical role in the classification of diesel exhaust as a Group 1 carcinogen by IARC in 2012  and have important implications for miners, tens of millions of workers in the U.S. and worldwide who are exposed to diesel exhaust in the workplace, and people who live in cities with high levels of diesel exhaust.

Bulldozer that runs on diesel in a mine.

Diesel Exhaust in Miners Study II Reveals New Insights

Extended follow-up shows elevated lung cancer risk remained 20 or more years after diesel exhaust exposure ceased.

Acrylonitrile is a chemical used in the production of synthetic fibers and many other products. While results from animal bioassays suggested it might cause cancers at multiple sites, findings from early epidemiologic studies were inconsistent and inconclusive due to small sample size and poor exposure characterization. The NCI Acrylonitrile Cohort , the largest to date, found workers with the highest cumulative exposure experienced excess lung cancer more than 20 years after first exposure. An additional 21 years of mortality data showed an exposure-response relationship for lung cancer death and positive associations for death from bladder cancer and for non-malignant respiratory disease. IARC’s Monograph Programme evaluated this exposure in June 2024.

Farmers and Pesticide Applicators

The Agricultural Health Study (AHS) works to understand how agricultural, lifestyle, and genetic factors affect the health of farming populations. Since its inception, AHS investigators have evaluated agricultural practices and pesticide use, other occupational exposures, and a broad range of factors as they relate to risk for cancer and other outcomes. Data from the AHS have contributed to determinations of carcinogenicity of agricultural exposures as well as regulatory decisions in the U.S. and internationally.

Collecting biospecimens for use in the BEEA Study

Biomarkers of Exposure and Effect in Agriculture

BEEA investigates biological mechanisms of pesticides and cancer risk.

More recently, DCEG investigators have led a molecular epidemiologic initiative known as the Biomarkers of Exposure and Effect in Agriculture (BEEA) study. Within BEEA, biospecimens and updated exposure information are being used to investigate the biologic mechanisms underlying associations between agricultural exposures and risk of cancer and other chronic diseases.

More information about AHS and BEEA can be found on the AHS website.

Medical Radiation Workers

The U.S. Radiologic Technologists Study (USRT or Rad Tech) has expanded our understanding of the radiation-related health effects for medical workers who administer diagnostic and therapeutic medical exams. This nationwide study began in 1982 with more than 110,000 current and former radiologic technologists, certified by the American Registry of Radiologic Technologists, who completed one or more questionnaires about their work history, health status, and other factors.

The Rad Tech Study has yielded important findings related to health risks from repeated exposure to relatively low doses of ionizing radiation, including associations between cumulative lifetime radiation exposure and risks of female breast cancer , lung cancer, and cataracts. Additional analysis demonstrated that cataract risk was particularly high for technologists who were positioned closer to the radiation source  while risk was much lower for those who used personal protection equipment (room shields, lead glasses). The cohort has also been a valuable resource for investigating the effects of ultraviolet light exposure and other lifestyle factors on cancer and other health outcomes .

Individuals with Specific Medical Exposures or Diagnoses

meaning of cohort study in research

The DES Story: Lessons Learned

Dr. Robert Hoover discusses a followup study of diethylstilbestrol (DES), a drug once prescribed to pregnant women. (Video produced and edited by Natalie Giannosa)

Multi-Generation Study of DES-Exposed Individuals

From the mid-1940s through the early 1970s, diethylstilbestrol (DES)—the first synthetic estrogen—was given to millions of pregnant women, exposing daughters and sons while in utero . It was thought to prevent miscarriage. Instead, DES was later identified as a human carcinogen and the first known trans-placental carcinogen.

In 1971, the first study was published connecting a mother’s prescription for DES during pregnancy and the occurrence of vaginal cancer in her daughter, prompting the FDA to revoke the use of DES in pregnant women. Several field studies were launched across the country. In 1992, NCI investigators and collaborators brought together those individual study centers to create the NCI Follow-up of Combined DES Cohorts . With the greater statistical power of the combined studies, investigators identified a constellation of adverse health outcomes in three generations , including an increased frequency of problems of the reproductive tract, changes in the tissue of the vagina, infertility, and poor pregnancy outcomes in daughters. As DES-exposed offspring reach the age when cancer rates begin to rise, it is important to continue to monitor the long-term risk of cancer and other adverse health outcomes in this unique population.

Graph showing the risk of cardiomyopathy/heart failure in the years since breast cancer diagnosis for patients that received anthracyclines and/or trastuzumab or other chemotherapies compared to those who did not receive chemotherapy. Those who received anthracyclines and/or trastuzumab had increased risk of cardiomyopathy/heart failure at 1-5 years, 5-10 years, and 10+ years since breast cancer diagnosis.

Some Breast Cancer Treatments Linked to Long-term Cardiovascular Disease Risk

This study may inform long-term and age-specific cardiovascular disease follow-up for breast cancer survivors.

Cancer Survivors

Over 18 million Americans are survivors of one or more cancers. Survivors of cancer are at risk for a second primary malignancy either because of their exposures in life, genetic predisposition, or adverse effects of their treatment.

To investigate these risks, investigators in REB and the Integrative Tumor Epidemiology Branch convened a retrospective record-linkage cohort, the Kaiser Permanente (KP) Breast Cancer Survivors Cohort , a transdisciplinary resource to investigate treatment patterns over time and the risk of second cancers, cardiovascular disease, and mortality.

Among their findings to date, they observed breast cancer patients who received radiotherapy, had breast-conserving surgery, and had a history of hypertension or diabetes at the time of their breast cancer diagnosis had elevated risks for thoracic angiosarcoma .

A New Cohort of Children Treated for Cancer

As therapies to treat cancer continue to evolve, it is important to monitor short- and long-term adverse health outcomes. The Pediatric Proton and Photon Therapy Comparison Cohort , supported by the Childhood Cancer Data Initiative since 2020, is a multi-center retrospective cohort to compare the risk of second cancers among childhood cancer patients treated with proton radiotherapy to those treated with photon radiotherapy.

Doctor conferring with mother and daughter

Pediatric Proton and Photon Therapy Comparison Cohort

A comparison of second cancer risk following proton versus photon therapy for pediatric cancer.

Investigators in REB and collaborators from Massachusetts General Hospital are assembling patient and radiotherapy treatment data from participating study centers across the United States and Canada. REB experts are developing state-of-the-art dosimetry methods to quantify radiation doses to exposed organs and tissues. Investigators will examine dose-response and assessment of dose-volume effects for the most common and radiosensitive second cancer sites (brain tumors, sarcomas, breast and thyroid cancer). The study is expected to continue for decades in order to capture the range of the late effects that may be associated with these therapies.

International Cohorts

The encyclopedic breadth of research within and across cohort studies in the Division could not begin to fit in the length of this article; the focus here was limited to projects in the United States. In collaboration with international partners, we have assembled many cohorts of truly unique populations. For example, the Shanghai Women’s Health Study , in collaboration with Vanderbilt University and the Shanghai Cancer Center, is a population-based prospective cohort of about 75,000 mostly never-smoking women recruited between 1997 and 2000 with blood and urine sample collection and followed via multiple in-person surveys and record linkages with population-based registries.

In Costa Rica, where DCEG has been studying cervical cancer for over 40 years, the Guanacaste HPV Natural History Study has followed over 10,000 women since 1993. It has yielded many critical insights into HPV natural history, including the evidence to establish the performance of then-novel HPV and cytologic screening techniques.

Collaborations in Ukraine have advanced our understanding of the health effects of low-dose exposure to ionizing radiation . Among a cohort of individuals exposed to radioactive fallout following the accident at the Chernobyl power plant, investigators have quantified the relationship between internal exposure to radiation in childhood and thyroid cancer detected through standardized screening procedures.

The Alpha-Tocopherol, Beta-Carotene Cancer Prevention (ATBC) Study , conducted in southwestern Finland, has been an integral research resource for NCI for over three decades. It was designed to test nutritional approaches to cancer prevention and the biological and anti-neoplastic properties of two antioxidants micronutrients , beta-carotene and vitamin E, among nearly 30,000 male smokers.

See an inventory of cohorts in DCEG on our website.

  • Fellowships & Training
  • Linkage Newsletter
  • People in the News
  • Research Highlights
  • Research article
  • Open access
  • Published: 26 June 2024

Metformin use correlated with lower risk of cardiometabolic diseases and related mortality among US cancer survivors: evidence from a nationally representative cohort study

  • Yukun Li   ORCID: orcid.org/0000-0003-0516-5480 1 , 2 ,
  • Xiaoying Liu 1 , 2 ,
  • Wenhe Lv 1 , 2 ,
  • Xuesi Wang 1 , 2 ,
  • Zhuohang Du 1 , 2 ,
  • Xinmeng Liu 1 , 2 ,
  • Fanchao Meng 1 , 2 ,
  • Shuqi Jin 1 , 2 ,
  • Songnan Wen 4 ,
  • Rong Bai 3 ,
  • Nian Liu 1 , 2 &
  • Ribo Tang 1 , 2  

BMC Medicine volume  22 , Article number:  269 ( 2024 ) Cite this article

Metrics details

In the USA, the prolonged effective survival of cancer population has brought significant attention to the rising risk of cardiometabolic morbidity and mortality in this population. This heightened risk underscores the urgent need for research into effective pharmacological interventions for cancer survivors. Notably, metformin, a well-known metabolic regulator with pleiotropic effects, has shown protective effects against cardiometabolic disorders in diabetic individuals. Despite these promising indications, evidence supporting its efficacy in improving cardiometabolic outcomes in cancer survivors remains scarce.

A prospective cohort was established using a nationally representative sample of cancer survivors enrolled in the US National Health and Nutrition Examination Survey (NHANES), spanning 2003 to 2018. Outcomes were derived from patient interviews, physical examinations, and public-access linked mortality archives up to 2019. The Oxidative Balance Score was utilized to assess participants’ levels of oxidative stress. To evaluate the correlations between metformin use and the risk of cardiometabolic diseases and related mortality, survival analysis of cardiometabolic mortality was performed by Cox proportional hazards model, and cross-sectional analysis of cardiometabolic diseases outcomes was performed using logistic regression models. Interaction analyses were conducted to explore the specific pharmacological mechanism of metformin.

Among 3995 cancer survivors (weighted population, 21,671,061, weighted mean [SE] age, 62.62 [0.33] years; 2119 [53.04%] females; 2727 [68.26%] Non-Hispanic White individuals), 448 reported metformin usage. During the follow-up period of up to 17 years (median, 6.42 years), there were 1233 recorded deaths, including 481 deaths from cardiometabolic causes. Multivariable models indicated that metformin use was associated with a lower risk of all-cause (hazard ratio [HR], 0.62; 95% confidence interval [CI], 0.47–0.81) and cardiometabolic (HR, 0.65; 95% CI, 0.44–0.97) mortality compared with metformin nonusers. Metformin use was also correlated with a lower risk of total cardiovascular disease (odds ratio [OR], 0.41; 95% CI, 0.28–0.59), stroke (OR, 0.44; 95% CI, 0.26–0.74), hypertension (OR, 0.27; 95% CI, 0.14–0.52), and coronary heart disease (OR, 0.41; 95% CI, 0.21–0.78). The observed inverse associations were consistent across subgroup analyses in four specific cancer populations identified as cardiometabolic high-risk groups. Interaction analyses suggested that metformin use as compared to non-use may counter-balance oxidative stress.

Conclusions

In this cohort study involving a nationally representative population of US cancer survivors, metformin use was significantly correlated with a lower risk of cardiometabolic diseases, all-cause mortality, and cardiometabolic mortality.

Is metformin use associated with diminished risk of cardiometabolic diseases and related mortality in cancer survivors? If so, what mechanisms contribute to this inverse association with cardiometabolic risk?

In this cohort study of 3995 cancer survivors over a median period of 6.42 years, metformin use was correlated with decreased risks of cardiometabolic diseases, all-cause and cardiometabolic mortality, likely due to its oxidative stress antagonism.

These study findings indicated that metformin use was associated with improved cardiometabolic health, leading to enhanced overall survival and quality of life in cancer survivors. Moreover, targeting oxidative stress may be crucial in developing cardiometabolic protective drugs for patients with cancer in the future.

Peer Review reports

Cardiometabolic disease (CMD) and cancer are two major global public health concerns [ 1 , 2 ]. Recent advancements in cancer therapies including chemotherapy, radiotherapy, targeted therapy, and immunotherapy, have led to an expanding population of cancer survivors (CS). Two-thirds of patients diagnosed with cancer survive beyond 5 years post-diagnosis. However, the extended lifespan of CS presents new challenges for long-term care and comorbidity management. CMD has emerged as the primary comorbidity in patients with cancer, ranking as the leading cause of noncancer deaths in the CS population [ 3 , 4 , 5 ]. This increased risk of CMD and related mortality arose from various factors, including direct effects of cancers, anticancer treatments (including radiation and chemotherapy), pre-existing cardiometabolic risk factors (such as dysglycemia, dyslipidemia, and obesity), and physical deconditioning [ 3 ].

Apart from the previously mentioned risk factors, cancer is hypothesized as a type of metabolic disease. The link between cancer and CMD can possibly be explained by the abnormal metabolic phenotype of cancer cells (known as the Warburg effect) and elevated levels of oxidative stress [ 6 , 7 ], particularly considering that the cardiovascular system is highly energy-consuming and sensitive to altered metabolic patterns. Currently, there are no specific guidelines for managing and preventing cardiometabolic complications in this highly metabolically challenged group, relying instead on recommendations extrapolated from general populations. In this context, urgent attention is required for the development of prevention strategies aimed at alleviating the CMD burden in cancer survivors.

Both experimental and clinical data indicated that metformin, the primary oral antihyperglycemic agent with pharmacological adenosine 5’ monophosphate-activated protein kinase (AMPK) activation, could improve the cardiometabolic status in populations with obesity, diabetes, or psychiatric disorders [ 8 ]. This improvement was attributed to mechanisms including oxidative stress inhibition and redox rebalance [ 9 , 10 , 11 ]. To date, scarce studies have investigated the correlations of metformin use with CMD risk and CMD-related mortality in patients with cancer. Therefore, this study aimed to investigate the correlations between metformin use and the risk of CMD / CMD-related mortality, as well as the role of the antioxidative stress mechanism in a nationally representative sample of US cancer survivors with sufficient follow-up time. Our study findings offered crucial insights into the clinical application, mechanistic understanding, and future development of effective interventions to mitigate the increasing trend of cardiometabolic dysfunction among cancer survivors.

Study sample and population

This cohort study utilized a nationally representative population of cancer survivors from the National Health and Nutrition Examination Survey (NHANES) [ 12 , 13 ]. This study’s analysis adhered to the analytical guidelines of NHANES, which adopted a multi-stage stratified systematic sampling design. The sampling and testing processes in NHANES have been thoroughly documented in previously published articles. Briefly, the survey conducted health-related interviews and examinations, encompassing participants from diverse geographical locations and racial/ethnic backgrounds to ensure its nationwide representativeness. The NHANES protocols received approval from the National Center for Health Statistics research ethics review board. Written informed consent was obtained from all participants involved in the survey.

NHANES is an ongoing series of cross-sectional studies conducted in 2-year cycles. Except for mortality data, which was obtained from the 2019 Public-Use Linked Mortality Files, all other participant data, including the exposure of interest (metformin use) and covariates of our study, were collected during the survey cycle in which the participants were enrolled. Specifically, NHANES gathered health-related information through a combination of health interviews, medical measurements, and laboratory assessments. One or more individuals per household were selected to participate. Data was collected from each participant via face-to-face interviews conducted in the respondents’ homes. Medical measurements were performed in specially equipped mobile centers that travel to various locations throughout the country. Subsequently, participants were invited to provide biological samples and undergo laboratory assessments. In the Cox proportional hazards regression analysis, follow-up was calculated using person-months from the date of the interview to either the date of death or the deadline of the 2019 Public-Use Linked Mortality Files (December 31, 2019). Comprehensive methodology and protocols can be found on the NHANES website.

In this study, we analyzed data of sociodemographic variables, lifestyle factors, and medical and medication history from adult cancer survivors with follow-up records spanning eight cycles of NHANES from 2003 to 2018. Cancer diagnosis and type information were obtained through in-person interviews, encompassing details regarding specific cancer type(s), recodes of up to three cancer diagnoses, and the age at each diagnosis. Participants were asked, “Have you ever been told by a doctor or other health professional that you had cancer or a malignancy of any kind?” Those answering “yes” were identified as cancer survivors and further asked “What kind of cancer was it?” and “How old were you when this cancer was first diagnosed?”. The value of variable “years since first cancer diagnosis” was calculated by subtracting the age at the first cancer diagnosis from the participant’s current age. During household interviews, participants were questioned about their prescription medication use in the past month. If their response was affirmative, the participants were further asked to present medication containers for recording. Medication names, upon entry, were systematically aligned with an existing prescription drug database. In cases where medication containers were unavailable, participants were requested to report the medication name verbally. Metformin use was obtained from participants’ self-reports during the in-home questionnaire. Figure  1 illustrates the participant enrollment process in a flowchart.

figure 1

Flowchart diagram of the screening and enrollment of study participants

Oxidative balance score

This study computed the Overall Oxidative Balance Score (OBS) by totaling the points allocated to each component. A lower OBS score indicated a higher individual oxidative stress level. Sixteen nutrients and four lifestyle-related components were selected for OBS calculation, including fifteen antioxidants and five prooxidants, based on previous studies examining their relationships with oxidative stress (OS) [ 14 , 15 ].

The scoring allocation scheme for OBS components is detailed in Supplementary Table 1 (Additional file 1: Table S1). For alcohol consumption, the scoring was as follows: nondrinkers received 2 points, nonheavy drinkers (0–15 g/day for women and 0–30 g/day for men) received 1 point, and heavy drinkers (≥ 15 g/day for women and ≥ 30 g/day for men) received 0 points. Notably, smoking exposure was quantified using serum cotinine levels, reflecting both direct tobacco use and environmental tobacco smoke exposure. Other components were categorized into three groups based on gender-specific tertiles. Antioxidants were scored from 0 to 2, progressively assigned from the first to the third tertile. However, prooxidants were scored inversely, with the highest tertile receiving 0 points and the lowest tertile receiving 2 points. Patients with cancer were classified into high OS and low OS groups based on OS level, defined by the lower and higher 50% of OBS values, respectively.

Outcome definition

In this study, mortality data were obtained from the 2019 public-access linked mortality archives up to December 31, 2019. All-cause and cardiometabolic mortality statuses were acquired from the publicly accessible dataset mentioned above. For participants identified as “deceased,” death cases were coded according to the Tenth Revision of the International Classification of Diseases (ICD-10). The underlying causes of death were categorized into the following 10 types: diseases of the heart; malignant neoplasms; chronic lower respiratory diseases; accidents (unintentional injuries); cerebrovascular diseases; Alzheimer’s disease; diabetes mellitus; influenza and pneumonia; nephritis, nephrotic syndrome, and nephrosis; and all other causes (residual).

Cardiometabolic mortality outcome was defined as the combination of death events resulting from three primary causes: diseases of the heart, cerebrovascular diseases, and diabetes mellitus. Death from accidents (unintentional injuries) was defined as a negative control outcome, which was unlikely to be influenced by the metformin treatment.

Although there is no consensus on the exact scope of CMD, prior studies typically encompassed coronary heart disease, stroke, hypertension, dyslipidemia, and diabetes mellitus within this category. Notably, all metformin users in our study had a history of diabetes comorbidity. Based on previous NHANES-related research on CMD and considering the sample sizes of certain diseases in the NHANES database, along with clinical practice, four common CMDs were included to assess the cardiometabolic diseases risk of cancer survivors: total cardiovascular disease, stroke, hypertension, and coronary heart disease (CHD). Hypertension diagnosis was based on the response “yes” to self-reported hypertension questions (“Have you ever been told by a doctor or other health professional that you had hypertension, also called high blood pressure?”), antihypertension drug usage (“Because of your high blood pressure/hypertension, have you ever been told to take prescribed medicine?”), or abnormal average values in three blood pressure measurements (systolic blood pressure greater than 130 mmHg or diastolic blood pressure greater than 80 mmHg). Stroke was diagnosed among those who responded “yes” to the self-reported stroke question (“Has a doctor or other health professional ever told you that you had a stroke?”). CHD was diagnosed among individuals responding “yes” to the self-reported CHD question (“Has a doctor or other health professional ever told you that you had coronary heart disease?”). The diagnosis of total cardiovascular disease included individuals diagnosed with any or a combination of the following conditions: coronary heart disease, congestive heart failure, heart attack, stroke, or angina.

Study covariates

We constructed a directed acyclic graph (DAG) to visualize the hypothesized associations of the primary exposure (metformin treatment) with the outcomes of interest (cardiometabolic outcomes of cancer survivors), and potential covariates. The selection of clinical and biochemical covariates incorporated into the DAG was guided by pragmatic considerations and prior mechanistic insights into the pathophysiology of cardiometabolic diseases. The resulting DAG is presented in Supplementary Fig. 1 (Additional file 1: Figure S1).

In our study, covariates including age, gender, ethnicity/race (categorized as non-Hispanic white, non-Hispanic black, Mexican American, other Hispanic, and others), educational level (categorized as less than high school, high school or equivalent, and college or above), and socioeconomic status measured by the poverty to income ratio (PIR = Family income / Poverty threshold for family size and composition) were extracted from interviews and physical examinations. The PIR index was stratified into three levels: < 1.30, 1.30–3.49, and ≥ 3.5. Body mass index (BMI) was calculated as weight (kg) divided by the square of height (m 2 ) and categorized into three subgroups based on the cutoff values of 25 and 30, with BMI ≥ 30 indicating obesity. Smoking status was assessed and classified as “never” for individuals who smoked less than 100 cigarettes in their lifetime, “former” for those who had smoked more than 100 cigarettes but currently do not smoke, and “now” for individuals who smoked more than 100 cigarettes in their lifetime and smoke some days or every day. Alcohol consumption status was grouped into five categories: (1) never (< 12 drinks in a lifetime), (2) former (≥ 12 drinks in 1 year and did not drink last year, or did not drink last year but drank ≥ 12 drinks in a lifetime), (3) current mild alcohol use (< two drinks per day for women, < three drinks per day for men), (4) current moderate alcohol use (≥ two drinks per day for women, ≥ three drinks per day for men, or binge drinking ≥ two days per month), and (5) current heavy alcohol use (≥ three drinks per day for women, > four drinks per day for men, or binge drinking on ≥ 5 days per month).

The metabolic equivalent (MET) measured energy metabolism during various daily activities. Physical activity was assessed through the Physical Activity Questionnaire (PAQ) section and quantified as PA(MET-h/week) = MET * weekly frequency * duration of each of physical activity. Distinct MET values were assigned for diverse physical activities by NHANES guidelines, including vigorous work-related activity (MET = 8.0), vigorous leisure-time physical activity (MET = 8.0), moderate work-related activity (MET = 4.0), walking or bicycling for transportation (MET = 4.0), and moderate leisure-time physical activity (MET = 4.0). Participants were categorized into two subgroups based on their PA scores: low-intensity physical activity (PA < 48 MET-h/week) and high-intensity physical activity (PA > 48 MET-h/week). The history of hyperlipidemia, diabetes, and depression (Patient Health Questionnaire-9 [PHQ9] ≥ 10) was identified through questionnaires section. The use of antihypertensive and antihyperlipidemic agents was ascertained based on participants’ self-reported medication usage in the in-home questionnaire. Multiple imputation methods were employed for missing covariate values.

Statistical analysis

All data analyses incorporated the complex stratified survey design and NHANES sampling weights to ensure national representativeness. Continuous variables in this study were reported as means and standard error of mean (SE), while categorical variables were presented as weighted percentages. Statistical tests were conducted two-sided, and a significance threshold of p  < 0.05 was applied. Data analysis was conducted from May 1 to August 1, 2023, using R and R Studio (R Foundation for Statistical Computing, Version 4.2.0). Multiple logistic regression models were employed in this study to evaluate the correlation between metformin use and the risk of CMD. Furthermore, multivariable Cox proportional hazards regression models were utilized to assess the impact of metformin use on risks of all-cause mortality and cardiometabolic mortality. The fully adjusted models were adjusted for a range of covariates, including age, gender, race and ethnicity, educational level, PIR, BMI, smoking status, alcohol use, intensity of physical activity, health conditions (including histories of hyperlipidemia, depression, and diabetes), and medication history (antihypertensive agents usage and antihyperlipidemic agents usage) and years since first cancer diagnosis. The final reported outcomes from these analyses were the adjusted odds ratios / hazard ratios and their corresponding 95% confidence intervals (95% CI). The robustness of all logistic regression and Cox proportional-hazards models was further evaluated by calculating the E-value [ 16 ], which represents the minimum strength of relationship, on the OR/HR scale, that an unmeasured confounding variable would need to have with both metformin use and cardiometabolic outcomes to entirely suppress the observed correlations, after adjusting for the measured covariates.

To further investigate whether metformin exerted its cardiometabolic protective effect by counteracting oxidative stress, the distinctive nature of interaction analyses was considered in logistic regression and Cox proportional hazards regression models. Notably, Rothman et al. have highlighted that the interaction terms (a*b) in these models only reflected the multiplicative interaction from a statistical perspective, rather than translating into a biologically interpretable additive interaction effect [ 17 ]. Accordingly, new variables were created, including four exclusive categories based on combinations of metformin use and oxidative stress levels: (1) metformin users in the low OS group, (2) metformin nonusers in the low OS group, (3) metformin users in the high OS group and (4) metformin nonusers in the high OS group. This categorization allowed us to quantify the additive interaction effect between metformin use and OS levels using the indicator of Relative Excess Risk due to Interaction (RERI), as recommended by the STROBE statement [ 18 ]. The RERI was computed using the formula RERI = RR 11  − RR 10  − RR 01  + 1, where RR 11 represents the relative risk for those exposed to both factors (metformin nonuser + high OS), RR 10 for exposure to one factor (metformin user + high OS), and RR01 for exposure to the other factor. The 95% CI for these estimations was derived using the delta method described by Hosmer and Lemeshow [ 19 ]. This analysis was adjusted for the same covariates as in the fully adjusted multivariable model above.

A series of sensitivity analyses were conducted to assess the robustness of the findings. First, we excluded deaths occurring in the first year of follow-up to minimize the potential for reverse causation. Second, accidental death was applied as a negative control outcome (deaths from “Accidents (unintentional injuries) (V01-X59, Y85-Y86)”). Third, considering that patients taking metformin all had comorbid diabetes, we examined the impact of sulfonylureas, another commonly used medication for cancer survivors with type 2 diabetes mellitus (T2DM) in our study cohort, on cardiometabolic outcomes to rule out the confounding effect of the presence of T2DM. Fourth, we further adjusted for the following covariates: variates reflecting the severity of diabetes (HbA1c, diabetic retinopathy) and the use of glucose-lowering medications with potential cardiometabolic benefits including GLP-1 receptor agonists and SGLT-2 inhibitors. Fifth, patients who underwent renal dialysis within the past 12 months were excluded from the analysis. Lastly, our study was conducted across the overall population and within specific subgroups stratified by age, gender, BMI, and race.

In this study involving 3995 cancer survivors (representative of a national weighted population of 21,671,061 individuals, weighted mean [SE] age, 62.62 [0.33] years, 53.04% women), the ethnic composition was diverse. The majority, 2727 participants (68.26%), identified as Non-Hispanic White. Five hundred eighty five individuals (14.64%) were Non-Hispanic Black, 260 (6.51%) were Mexican American, 220 (5.51%) were of other Hispanic backgrounds, and 203 (5.08%) belonged to other racial groups, including American Indian/Native Alaskan/Pacific Islander, Asian, and multiracial categories. There were a total of 3547 metformin nonusers and 448 metformin users. Table  1 presents the baseline profile of these participants, categorized based on their usage of metformin.

Among 3995 cancer survivors, a total of 1233 deaths occurred during a median follow-up of 6.42 years (77 months), including 481 deaths from cardiometabolic diseases. Table 2 illustrates the notable reduction in both all-cause and cardiometabolic mortality among cancer survivors undergoing metformin therapy. In the minimally adjusted model, hazard ratios (HRs) with 95% CIs for all-cause and cardiometabolic mortality were 0.60 (95% CI, 0.45–0.81) and 0.62 (95% CI, 0.42–0.93), respectively. Upon further adjustments in the fully adjusted model for covariates, the HRs for all-cause and cardiometabolic mortality were 0.62 (95% CI, 0.47–0.81, E-value = 2.61) and 0.65 (95% CI, 0.44–0.97, E-value = 2.45), respectively, among cancer survivors receiving metformin treatment compared to nonusers.

In our study cohort, 981 participants experienced cardiovascular diseases (CVDs), representing a weighted prevalence of 19.66% (95% CI, 17.64–21.69%). Furthermore, stroke occurred in 356 participants (weighted prevalence: 6.54% [95% CI, 5.51–7.56%]), hypertension in 2885 (weighted prevalence: 67.96% [95% CI, 63.49–72.43%]), and coronary heart disease in 390 (weighted prevalence: 8.01% [95% CI, 6.81–9.22%]). Among these cancer survivors, a total of 448 individuals were identified as metformin users. Survey-weighted logistic regression analysis on the cross-sectional data in Table  3 showed an inverse association between metformin use and cardiometabolic diseases in cancer survivors. The odds ratio (OR) for the incidence of total CVD in those undergoing metformin use, compared to those not receiving it, was 0.41 (95% CI, 0.28–0.59) after full adjustment. The ORs for the outcomes of stroke, hypertension, and coronary heart disease were 0.44 (95% CI, 0.26–0.74), 0.27 (95% CI, 0.14–0.52), and 0.41 (95% CI, 0.21–0.78), respectively, in the fully adjusted model. The E-values ranged from 2.49 to 4.31, which indicated that an unmeasured confounder would need to have at least an OR of 2.49 to explain the observed associations.

Previous research has indicated an elevated risk of cardiometabolic diseases in patients with certain types of tumors. In our study involving a nationally representative cohort of cancer survivors, the inverse association between cardiometabolic risk and metformin use was validated in specific cancer types. Fully adjusted analyses, as illustrated in Fig.  2 , show a reduction in all-cause mortality risk in survivors of hematologic (HR, 0.58; 95% CI, 0.40–0.82), breast (HR, 0.63; 95% CI, 0.45–0.89), and colorectal (HR, 0.57; 95% CI, 0.38–0.85) cancers treated with metformin. Decreased risk of cardiometabolic mortality was also noted in survivors of hematologic (HR, 0.56; 95% CI, 0.33–0.97) and breast (HR, 0.58; 95% CI, 0.36–0.94) cancers, but not among those with colorectal (HR, 0.70; 95% CI, 0.38–1.29) and prostate (HR, 0.78; 95% CI, 0.41–1.51) cancers when compared to individuals without metformin use. Regarding the risk of cardiometabolic diseases, a significant reduction in total CVD risk was observed in patients with all four specific cancer types treated with metformin. In breast, colorectal, and prostate cancer survivor subgroups, metformin demonstrated significantly beneficial effects on the risks of stroke and hypertension, with the protective effect of metformin against coronary heart disease primarily observed in survivors of hematologic and breast cancers.

figure 2

Association of metformin use with the risk of all-cause mortality ( A ), cardiometabolic mortality ( B ), and cardiometabolic diseases ( C - F ) in the overall cohort of cancer survivors and four specific cancer subgroups with high cardiometabolic risk. Metformin nonusers group was defined as the reference. Hazard ratios (depicted by solid symbols) with corresponding 95% CIs (represented by error bars) of metformin use for all-cause mortality ( A ) and cardiometabolic mortality ( B ) were estimated using weighted multivariable Cox regression models. Odds ratios (indicated by solid symbols) with corresponding 95% CIs (represented by error bars) of metformin use for the total cardiovascular diseases ( C ), stroke ( D ), hypertension ( E ), and coronary heart disease ( F ) were estimated using weighted multivariable logistic regression models. Both the multivariable Cox and logistic regression models were adjusted for age, gender, race/ethnicity, educational level, family poverty income ratio, BMI, smoking status, alcohol use, physical activity, hyperlipidemia, diabetes, depression, antihyperlipidemic drug use, antihypertensive drug use, and years since the first cancer diagnosis. HR, Hazard Ratio; OR, Odds Ratio; CI, Confidence interval; CVD, Cardiovascular disease; BMI, body mass index

Oxidative stress is a known hallmark of cardio-oncology, and metformin plays an important role in regulating the oxidant-antioxidant system. We further investigated whether antioxidant properties could explain the inverse relationship between metformin use and cardiometabolic risk in cancer survivors. Participants’ exposure to OS-related damage was evaluated by oxidative balance scores, and the interaction analysis was conducted to examine the antagonistic effect of metformin on OS. The interaction effects between metformin use and oxidative stress levels on all the cardiometabolic outcomes are shown in the Table  4 . Compared to cancer survivors with metformin use and low OS level (reference), low-OS survivors without metformin use (“single-hit”), and high-OS survivors with metformin use (“single-hit”), those with no metformin usage but in high OS level (“double-hit”) exhibited the highest risks of all-cause/cardiometabolic mortality and cardiometabolic diseases (Table  4 ). A significant additive interaction was observed in the outcomes of all-cause mortality (RERI, 0.47; 95% CI, 0.21 to 0.73), cardiometabolic mortality (RERI, 0.53; 95% CI, 0.24 to 0.82), total CVD (RERI, 0.29; 95% CI, 0.06 to 0.52), stroke (RERI, 0.79; 95% CI, 0.28 to 1.30), and hypertension (RERI, 0.66; 95% CI, 0.09 to 1.23).

In the stratified analyses detailed in Supplementary Table 2 (Additional file 1: Table S2), based on gender, race, and baseline BMI, the cardiometabolic protective influence of metformin on the risk of total CVD, hypertension, and CHD was more significant among older individuals. The study revealed no significant variations in the negative relationship of metformin use with cardiometabolic outcomes across diverse genders, races, or obesity categories. The robustness of our findings was further validated by performing a series of sensitivity analyses. In order to minimize the potential reverse causation, we excluded deaths that occurred within the initial 1-year follow-up period and the results remained significant (Additional file 1: Table S3, 4). The analysis of the negative control outcome of accidental death revealed the absence of significant association of metformin use and the risk of accidental death ( N  = 34; HR, 2.59; 95% CI, 0.27–24.78). When sulfonylureas use, instead of metformin use, was considered as the exposure, the correlations with all-cause mortality and cardiometabolic outcomes were non-significant (Additional file 1: Table S5, 6). Further adjustments for HbA1c, diabetic retinopathy, GLP-1 receptor agonist use, and SGLT-2 inhibitor use did not substantially alter the results (Additional file 1: Table S7, 8). After excluding participants who underwent renal dialysis within the past 12 months, the robust inverse correlation between cardiometabolic risk and metformin use remained evident (Additional file 1: Table S9, 10).

In this study, conducted within a nationally representative cohort of cancer survivors in the USA, with a median follow-up duration of 6.4 years, metformin use as compared to non-use was inversely associated with the risk of cardiometabolic diseases, all-cause mortality, and cardiometabolic mortality in cancer survivors. This inverse association was observed not only in the overall population of cancer survivors, but also in patients with specific cancer types associated with a higher cardiometabolic risk. The potential mechanisms underlying this inverse association of metformin were further explored. Considering the pivotal role of oxidative stress in cardio-oncology and the widely reported antioxidative capacity of metformin, the OBS system was implemented for a quantitative assessment of oxidative stress levels within our cohort. The findings suggested t hat metformin use might exert cardiometabolic protection in patients with cancer by antagonizing oxidative stress. These findings remained consistent across diverse clinical subgroups and were corroborated by sensitivity analyses.

Challenges in managing cardiometabolic risks among cancer survivors

In the current landscape of oncological advancements, the USA has witnessed an increase in the number of cancer survivors, reaching nearly 17 million, many of whom continued to receive long-term cancer treatment post-diagnosis. Despite extended survival and reduced mortality due to cancer treatment innovations, the growing burden of CMD and associated mortality risks among cancer survivors is gaining attention. Early screening and treatment approaches in breast cancer cohort have elevated the 5-year survival rate above 90% [ 20 ]. However, this is contrasted sharply by an increased risk of heart disease, diabetes, and cardiovascular-related deaths in these patients [ 21 , 22 ]. A common reason is the reduced adherence to cardiometabolic medications following cancer diagnosis. Research indicated a notable decline in statin adherence in patients with breast cancer, from 67% pre-diagnosis to just 35% 2 years post-diagnosis, which was also evident in antihypertensive and antidiabetic medications [ 23 , 24 ]. Furthermore, treatment regimens for breast cancer, such as anthracycline therapy and endocrine therapy, have been associated with heightened adverse effects on cardiac well-being and metabolic diseases [ 22 , 25 ]. These factors, coupled with reduced physical activity during cancer treatment contributed to a detrimental cycle that heightens CMD risk in patients with breast cancer [ 26 , 27 ]. This cycle is recognized as prevalent across various genders and cancer types. Addressing and actively preventing the potential CMD risk in patients with cancer is crucial, not only for cancer cure or chronic management but also for maximizing long-term health and productivity.

Interplay of cancer and cardiometabolic diseases: the key role of oxidative stress

Cancer and cardiometabolic diseases are intricately linked and mutually exacerbating. Specifically, the treatment pattern and lifestyle factors in patients with cancer can elevate the risk of CMD, as discussed above. And vice versa, an elevated risk of cancer incidence and cancer-related mortality has also been observed in populations with cardiometabolic diseases [ 28 ]. These two conditions appear to share common mechanisms related to metabolic disorders [ 29 ]. The “Warburg effect” underscores that cancer cells display a distinct metabolic phenotype, marked by augmented glucose uptake compared to normal cells. Advances in sequencing technology have also unveiled the significant role of metabolic dysregulation in tumor growth and metastasis. The observed metabolic abnormalities, such as fumarate in renal cell carcinoma, glycine in breast cancer suggested that somatic mutations may emerge as downstream effects of disruptions in cellular energy metabolism [ 30 , 31 ].

Existing research has highlighted the pivotal role of oxidative stress in the field of Cardio-oncology. The cancer metabolic theory posited that the crux of metabolic dysregulation in cancer centers on defects in mitochondrial oxidative phosphorylation (OXPHOS) [ 32 ]. Impaired OXPHOS complements glycolysis in tumor metabolism. Concurrently, decreased respiratory efficiency in the OXPHOS pathway produces heightened reactive oxygen species (ROS). This resulting oxidative stress, characterized by elevated ROS levels, possesses notable mutagenic and carcinogenic properties. Oxidative stress increases the mutation rate in cells with for its DNA-damaging capacity [ 33 ]. Furthermore, chemotherapeutic agents like doxorubicin have also been reported to exhibit substantial cardiotoxicity via oxidative stress [ 34 ]. Shifting the focus to cardiometabolic diseases, an imbalance between oxidants and antioxidants is also a common characteristic. Markers of redox imbalance were found to be elevated in models of hypertension [ 35 , 36 ]. Moreover, Niemann et al. observed increased OS markers in cardiomyocytes of patients undergoing coronary artery bypass graft surgery [ 37 ].

In summary, oxidative stress emerges as a significant potential comorbid mechanism in the relationship between cancer and cardiometabolic diseases. Interventions targeting oxidative stress might play a key role in breaking the vicious cycle of “cancer-cardiometabolic disease” interplay.

Metformin: a potential game-changer in cardiometabolic disease management among cancer survivors

Metformin, acknowledged as a crucial first-line agent in managing T2DM, primarily functions by activating AMPK pathway in cells and curtailing hepatic glucose production. Beyond its conventional hypoglycemic efficacy, the pharmacological versatility of metformin across various human systems has garnered significant clinical interest. Apart from regulating glucose and lipid metabolism in cardiomyocytes, metformin lowered advanced glycation end products and ROS levels in the endothelium, offering substantial protection for cardiometabolic health [ 38 ]. Ongoing clinical trials affirmed metformin’s beneficial impact on diverse cardiometabolic outcomes in diabetic population, including coronary death, primary cardiovascular disease, body weight, and so on [ 39 ]. Furthermore, Zheng et al., by assessing the genetically proxied effects of metformin targets on a comprehensive array of cardiometabolic outcomes, have demonstrated its effectiveness in improving cardiometabolic conditions such as CHD, BMI, and blood pressure in the general population [ 40 ].

Apart from its prominent regulatory role in CMD, metformin may be promising in combating malignancies in organs such as the breast, kidney, and endometrium. Although its anti-carcinogenic effect has yet to be proven clinically [ 41 ], properties found in preclinical studies, including inhibiting growth and inducing cell death of cancerous cells, supported its anticancer potential [ 42 ]. With the increasing risk of CMD in the cancer survivors, it is regrettable that medication targeting the underlying mechanisms has yet gained approval for treatment. Our research findings suggested that metformin might be crucial in disrupting the detrimental “cancer-cardiometabolic disease” cycle. Findings from our research revealed that patients with cancer using metformin experience significant reductions in the risk of all-cause mortality and those associated with CMD and related mortality, when compared to those not using metformin.

To further evaluate the robustness of all regression models, E-value was also applied to explore the impact of unmeasured confounding variables on our findings. The E-values of all-cause and cardiometabolic mortality outcomes were 2.61 and 2.45, respectively. These findings implied that an unobserved confounder need to exhibit stronger associations with both metformin use and all-cause/cardiometabolic mortality than the measured confounder (HR of the confounder diabetes were 1.56 and 2.38, respectively), in order to fully explain away the observed HR of metformin use. The same was true for E-values of specific cardiometabolic diseases. The existence of an unobserved confounder that would have a stronger association with both metformin use and cardiometabolic outcomes than the confounder diabetes seems unlikely.

Oxidative stress mechanism and robust cardiometabolic protection across cancer subgroups

The mechanisms through which metformin intervenes to modulate cardiometabolic risk among cancer survivors were further investigated. As mentioned earlier, oxidative stress is a key shared pathogenic mechanism in both cancer and cardiometabolic diseases. Previous studies have already confirmed the potential of metformin in counteracting oxidative stress [ 43 , 44 , 45 ]. Expanding on our research affirming the protective impact of metformin on cardiometabolic health of cancer survivors, the OBS was utilized to assess OS levels in these individuals quantitatively. Subsequent interaction analysis revealed that metformin enhances cardiometabolic outcomes by exerting an antagonistic effect on the pathological process of oxidative stress. This observation shed light on the potential mechanism underlying metformin’s cardiometabolic protective properties.

In our comprehensive cancer-patient cohort, metformin was found to significantly reduce the risks of all-cause mortality, cardiometabolic mortality, and the specific risk of cardiometabolic diseases. However, the risk of subsequent CMD varied among patients with different cancer types. A recent cohort study involving 126,120 cancer survivors indicated an increased risk of cardiovascular events, such as CHD and stroke. Subgroup analysis further revealed elevated cardiovascular event risks in patients with hematologic malignancies and increased stroke risks in patients with breast cancer [ 46 ]. Another study of cancer survivors in UK Biobank suggested the highest hypertension comorbidity risk in prostate and colorectal cancer subgroups [ 47 ]. A subgroup analysis was conducted on patients with hematologic cancer, breast cancer, colorectal cancer, and prostate cancer who are at high risk for the CMD as mentioned above. The results of subgroup analysis similarly supported that metformin effectively reduces the risk of CMD and associated mortality risks in these specific cancer subgroups.

Strengths and limitation

As our understanding of the pathogenesis and treatment principles of cancer deepens, the therapeutic objectives for cancer survivors are evolving beyond merely “curing” cancer, aiming instead to foster a prolonged and productive lifestyle. With increasing oncologic survival, cancer survivors are at growing risk for various chronic conditions. Effective management of cancer survivors with concurrent CMD is critically essential. However, no medication has been widely recognized as effective in treating CMD among cancer survivors. Our current study addressed gaps in existing literature concerning the pharmacological management of cardiometabolic comorbidities in cancer survivors. It offered concrete evidence linking metformin use to enhanced cardiometabolic health post-cancer. One key strength of our study lies in the novel finding that metformin significantly reduces the risk of cardiometabolic diseases, all-cause mortality, and cardiometabolic mortality in cancer survivors. Moreover, even within subgroups of patients with four cancer types at higher cardiometabolic risk, the robust protective effect of metformin on cardiometabolic outcomes persisted. Crucially, our analysis utilized a large, nationally representative sample of cancer survivors in the USA, encompassing cancers with relatively high 5-year survival rates, such as breast, colorectal, and prostate cancers, and those with less favorable outcomes, including hematologic and ovarian cancers. This diverse representation enhances the clinical applicability of our findings, offering promising aspects for CMD treatment in cancer survivors with metabolic vulnerability and suggesting a new avenue for expanding the therapeutic spectrum of the classic drug metformin. Lastly, by systematically assessing oxidative stress levels in patients and conducting interaction analyses with metformin use, our results elucidated the potential pharmacological mechanism of metformin’s cardiometabolic protective effect by antagonizing oxidative stress in cancer survivors. This insight holds significance for future interventional target of oxidative stress and the development of CMD treatments among patients with cancer.

Despite these strengths, our study has several limitations that merit consideration. First, due to the observational nature of the NHANES database, it was challenging to incorporate all residual covariates into the adjusted model. However, we determined E-values to illustrate that the influence of unmeasured confounding factors is unlikely to be sufficient to nullify the observed correlations. Second, although adjustments for tumor-related factors, such as “years since the first cancer diagnosis,” were incorporated, NHANES lacked detailed data on specific cancer staging and treatment.

Third, information on cardiometabolic disease outcomes in this study was derived from cross-sectional data. Therefore, we could not definitively ascertain the temporal relationship between metformin use and reported cardiometabolic diseases. We also concur that reliance on self-reported data for cardiometabolic disease outcomes and cancer status might lead to misclassification, which possibly resulted in non-differential or differential bias. Non-differential misclassification, which occurs when the misclassification of cardiometabolic outcome is independent of the exposure status, could potentially attenuate the observed correlations towards the null. In certain cases, this bias may lead to false positive results, especially when the misclassification interacts with other variables. Although the NHANES survey employed a stringent data collection protocol to minimize misclassification, we cannot fully rule out the possibility of misclassification. To mitigate this concern, future studies could incorporate medical records or clinical assessments to validate the self-reported diagnoses, thereby reducing the likelihood of misclassification and providing more precise correlation estimation.

Fourth, another limitation of this study was that we only had binary information on whether participants used metformin or not, but lacked data on the duration of medication use, dosage, and other relevant details. Fifth, information on metformin use (exposure) was obtained by reviewing participants’ medication use in the past month during the household interview. However, the NHANES database lacked survival data of cardiometabolic outcomes for the period between metformin treatment initiation and the start of the follow-up (household interview). Consequently, our findings were possibly influenced by immortal time bias. Although we conducted a sensitivity analysis by excluding cancer survivors who died early in the follow-up period and found that the results remained robust, the potential impact of immortal time bias should be considered when applying our findings.

Sixth, due to the strong association of the exposure with the negative control outcome (i.e., accidental deaths), we cannot exclude the possibility of a certain degree of bias in our main analyses. Lastly, the oxidative balance score was computed based on the assumption that all prooxidants and antioxidants linearly correlated with oxidative stress levels, without considering the threshold effect of antioxidants. Studies have indicated that specific antioxidants, such as carotenoids and copper, might demonstrate prooxidative effects when administered in elevated concentrations. Future research should consider these factors or introduce more robust biomarkers related to oxidative stress to provide more reliable validation of the cardiometabolic protective mechanisms of metformin.

In this prospective cohort study encompassing a nationally representative sample of US cancer survivors, metformin use as compared to non-use was inversely associated with the risks of all-cause mortality, cardiometabolic disease, and associated mortality. This study provided novel evidences and perspectives on the pharmaceutical management of cancer survivors in the field of cardio-oncology, highlighting potential directions for the design and development of cardiometabolic protective drugs specifically beneficial for cancer populations. Considering the lack of detailed data on specific cancer staging and treatment as well as the cross-sectional nature of cardiometabolic disease outcomes, subsequent longitudinal and more comprehensive studies are urgently needed to further elucidate the practical application and pharmacological mechanisms of metformin in the management of cardiometabolic health among cancer survivors.

Availability of data and materials

The NHANES data supporting the results of this study are available online through https://wwwn.cdc.gov/nchs/nhanes/Default.aspx .

Abbreviations

95% Confidence intervals

Adenosine 5’ monophosphate-activated protein kinase

Body mass index

Coronary heart disease

  • Cardiometabolic disease

Cancer survivors

Cardiovascular diseases

Directed acyclic graph

Hazard ratio

The tenth revision of the international classification of diseases

Metabolic equivalent

National health and nutrition examination survey

  • Oxidative stress

Oxidative phosphorylation

Physical activity questionnaire

Patient health questionnaire-9

Poverty to income ratio

Relative excess risk due to interaction

Reactive oxygen species

Standard error of mean

Type 2 diabetes mellitus

GBD 2017 Causes of Death Collaborators. Global, regional, and national age-sex-specific mortality for 282 causes of death in 195 countries and territories, 1980–2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet. 2018;392(10159):1736–88.

Article   Google Scholar  

Freisling H, Viallon V, Lennon H, Bagnardi V, Ricci C, Butterworth AS, et al. Lifestyle factors and risk of multimorbidity of cancer and cardiometabolic diseases: a multinational cohort study. BMC Med. 2020;18(1):5.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Zullig LL, Sung AD, Khouri MG, Jazowski S, Shah NP, Sitlinger A, et al. Cardiometabolic comorbidities in cancer survivors: JACC: CardioOncology State-of-the-Art review. JACC CardioOncol. 2022;4(2):149–65.

Article   PubMed   PubMed Central   Google Scholar  

Leerink JM, de Baat EC, Feijen EAM, Bellersen L, van Dalen EC, Grotenhuis HB, et al. Cardiac disease in childhood cancer survivors: risk prediction, prevention, and surveillance: JACC CardioOncology State-of-the-Art Review. JACC CardioOncol. 2020;2(3):363–78.

Maras AF, Penedo FJ, Ramirez AG, Worch SM, Ortiz MS, Yanez B, et al. Cardiometabolic comorbidities in Hispanic/Latino cancer survivors: prevalence and impact on health-related quality of life and supportive care needs. Support Care Cancer. 2023;31(12):711.

Article   PubMed   Google Scholar  

Cairns RA, Harris IS, Mak TW. Regulation of cancer cell metabolism. Nat Rev Cancer. 2011;11(2):85–95.

Article   CAS   PubMed   Google Scholar  

Lei K, Xia Y, Wang XC, Ahn EH, Jin L, Ye K. C/EBPβ mediates NQO1 and GSTP1 anti-oxidative reductases expression in glioblastoma, promoting brain tumor proliferation. Redox Biol. 2020;34: 101578.

Geng Y, Wang Z, Xu X, Sun X, Dong X, Luo Y, et al. Extensive therapeutic effects, underlying molecular mechanisms and disease treatment prediction of Metformin: a systematic review. Transl Res. 2024;263:73–92.

Chow E, Yang A, Chung CHL, Chan JCN. A clinical perspective of the multifaceted mechanism of metformin in diabetes, infections, cognitive dysfunction, and cancer. Pharmaceuticals (Basel). 2022;15(4):442.

Karmanova E, Chernikov A, Usacheva A, Ivanov V, Bruskov V. Metformin counters oxidative stress and mitigates adverse effects of radiation exposure: an overview. Fundam Clin Pharmacol. 2023;37(4):713–25.

Teng X, Brown J, Morel L. Redox homeostasis involvement in the pharmacological effects of metformin in systemic lupus erythematosus. Antioxid Redox Signal. 2022;36(7–9):462–79.

Centers for Disease Control and Prevention (CDC). National health and nutrition examination survey data. 2003–2018. Available from: https://www.cdc.gov/nchs/nhanes/index.htm .

Kim S, Cho J, Shin DW, Jeong SM, Kang D. Racial differences in long-term social, physical, and psychological health among adolescent and young adult cancer survivors. BMC Med. 2023;21(1):289.

Li H, Song L, Cen M, Fu X, Gao X, Zuo Q, et al. Oxidative balance scores and depressive symptoms: Mediating effects of oxidative stress and inflammatory factors. J Affect Disord. 2023;334:205–12.

Zhang W, Peng SF, Chen L, Chen HM, Cheng XE, Tang YH. Association between the Oxidative Balance Score and Telomere Length from the National Health and Nutrition Examination Survey 1999–2002. Oxid Med Cell Longev. 2022;2022:1345071.

PubMed   PubMed Central   Google Scholar  

Gaster T, Eggertsen CM, Støvring H, Ehrenstein V, Petersen I. Quantifying the impact of unmeasured confounding in observational studies with the E value. BMJ Med. 2023;2(1): e000366.

Rothman KJ, Greenland S, Lash TL. Modern epidemiology. Vol. 3. Wilkins Philadelphia: Wolters Kluwer Health/Lippincott Williams; 2008.

Google Scholar  

Vandenbroucke JP, von Elm E, Altman DG, Gøtzsche PC, Mulrow CD, Pocock SJ, et al. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explanation and elaboration. PLoS Med. 2007;4(10): e297.

Hosmer DW, Lemeshow S. Confidence interval estimation of interaction. Epidemiology. 1992;3(5):452–6.

Siegel RL, Miller KD, Wagle NS, Jemal A. Cancer statistics, 2023. CA Cancer J Clin. 2023;73(1):17–48.

Bradshaw PT, Stevens J, Khankari N, Teitelbaum SL, Neugut AI, Gammon MD. Cardiovascular Disease Mortality Among Breast Cancer Survivors. Epidemiology. 2016;27(1):6–13.

Thomas NS, Scalzo RL, Wellberg EA. Diabetes mellitus in breast cancer survivors: metabolic effects of endocrine therapy. Nat Rev Endocrinol. 2024;20(1):16–26.

Calip GS, Elmore JG, Boudreau DM. Characteristics associated with nonadherence to medications for hypertension, diabetes, and dyslipidemia among breast cancer survivors. Breast Cancer Res Treat. 2017;161(1):161–72.

Calip GS, Boudreau DM, Loggers ET. Changes in adherence to statins and subsequent lipid profiles during and following breast cancer treatment. Breast Cancer Res Treat. 2013;138(1):225–33.

Glen C, Morrow A, Roditi G, Hopkins T, Macpherson I, Stewart P, et al. Cardiovascular sequelae of trastuzumab and anthracycline in long-term survivors of breast cancer. Heart. 2023;heartjnl-2023–323437.

Cespedes Feliciano EM, Kwan ML, Kushi LH, Weltzien EK, Castillo AL, Caan BJ. Adiposity, post-diagnosis weight change, and risk of cardiovascular events among early-stage breast cancer survivors. Breast Cancer Res Treat. 2017;162(3):549–57.

Jones LW, Courneya KS, Mackey JR, Muss HB, Pituskin EN, Scott JM, et al. Cardiopulmonary function and age-related decline across the breast cancer survivorship continuum. J Clin Oncol. 2012;30(20):2530–7.

Ren QW, Yu SY, Teng THK, Li X, Cheung KS, Wu MZ, et al. Statin associated lower cancer risk and related mortality in patients with heart failure. Eur Heart J. 2021;42(32):3049–59.

Hoang G, Nguyen K, Le A. Metabolic intersection of cancer and cardiovascular diseases: opportunities for cancer therapy. In: The heterogeneity of cancer metabolism. 2nd edition. Springer; 2021. Available from: https://www.ncbi.nlm.nih.gov/books/NBK573679/ .

Gyamfi J, Kim J, Choi J. Cancer as a Metabolic Disorder. Int J Mol Sci. 2022;23(3):1155.

Yang M, Soga T, Pollard PJ. Oncometabolites: linking altered metabolism with cancer. J Clin Invest. 2013;123(9):3652–8.

Seyfried TN, Shelton LM. Cancer as a metabolic disease. Nutr Metab (Lond). 2010;27(7):7.

Taucher E, Mykoliuk I, Fediuk M, Smolle-Juettner FM. Autophagy, Oxidative Stress and Cancer Development. Cancers (Basel). 2022;14(7):1637.

Jiang H, Zuo J, Li B, Chen R, Luo K, Xiang X, et al. Drug-induced oxidative stress in cancer treatments: Angel or devil? Redox Biol. 2023;63: 102754.

Montezano AC, Dulak-Lis M, Tsiropoulou S, Harvey A, Briones AM, Touyz RM. Oxidative stress and human hypertension: vascular mechanisms, biomarkers, and novel therapies. Can J Cardiol. 2015;31(5):631–41.

Griendling KK, Camargo LL, Rios FJ, Alves-Lopes R, Montezano AC, Touyz RM. Oxidative Stress and Hypertension. Circ Res. 2021;128(7):993–1020.

Niemann B, Chen Y, Teschner M, Li L, Silber RE, Rohrbach S. Obesity induces signs of premature cardiac aging in younger patients: the role of mitochondria. J Am Coll Cardiol. 2011;57(5):577–85.

Ding Y, Zhou Y, Ling P, Feng X, Luo S, Zheng X, et al. Metformin in cardiovascular diabetology: a focused review of its impact on endothelial function. Theranostics. 2021;11(19):9376–96.

Zhu J, Yu X, Zheng Y, Li J, Wang Y, Lin Y, et al. Association of glucose-lowering medications with cardiovascular outcomes: an umbrella review and evidence map. Lancet Diabetes Endocrinol. 2020;8(3):192–205.

Zheng J, Xu M, Yang Q, Hu C, Walker V, Lu J, et al. Efficacy of metformin targets on cardiometabolic health in the general population and non-diabetic individuals: a Mendelian randomization study. EBioMedicine. 2023;96: 104803.

Lord SR, Harris AL. Is it still worth pursuing the repurposing of metformin as a cancer therapeutic? Br J Cancer. 2023;128(6):958–66.

Dutta S, Shah RB, Singhal S, Dutta SB, Bansal S, Sinha S, et al. Metformin: A Review of Potential Mechanism and Therapeutic Utility Beyond Diabetes. Drug Des Devel Ther. 2023;17:1907–32.

Darbandi N, Moghadasi S, Momeni HR, Ramezani M. Comparing the acute and chronic effects of metformin and antioxidant protective effects of N-acetyl cysteine on memory retrieval and oxidative stress in rats with Alzheimer’s disease. Pak J Pharm Sci. 2023;36(3):731–9.

CAS   Google Scholar  

Jaikumkao K, Thongnak L, Htun KT, Pengrattanachot N, Phengpol N, Sutthasupha P, et al. Dapagliflozin and metformin in combination ameliorates diabetic nephropathy by suppressing oxidative stress, inflammation, and apoptosis and activating autophagy in diabetic rats. Biochim Biophys Acta Mol Basis Dis. 2024;1870(1): 166912.

Kamel AM, Ismail B, Abdel Hafiz G, Sabry N, Farid S. Effect of metformin on oxidative stress and left ventricular geometry in nondiabetic heart failure patients: a randomized controlled trial. Metab Syndr Relat Disord. 2024;22(1):49–58.

Strongman H, Gadd S, Matthews A, Mansfield KE, Stanway S, Lyon AR, et al. Medium and long-term risks of specific cardiovascular diseases in survivors of 20 adult cancers: a population-based cohort study using multiple linked UK electronic health records databases. Lancet. 2019;394(10203):1041–54.

Raisi-Estabragh Z, Cooper J, McCracken C, Crosbie EJ, Walter FM, Manisty CH, et al. Incident cardiovascular events and imaging phenotypes in UK Biobank participants with past cancer. Heart. 2023;109(13):1007–15.

Download references

Acknowledgements

We thank our colleagues for helpful discussions related to the study. We thank Home for Researchers and Bullet Edits Limited for providing linguistic support for editing and proofreading the manuscript.

This work was supported by the National Natural Science Foundation of China [grant numbers 81870243, 82170318, 82170310].

Author information

Authors and affiliations.

Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, 100012, China

Yukun Li, Xiaoying Liu, Wenhe Lv, Xuesi Wang, Zhuohang Du, Xinmeng Liu, Fanchao Meng, Shuqi Jin, Nian Liu & Ribo Tang

National Clinical Research Center for Cardiovascular Diseases, Beijing, 100012, China

Banner University Medical Center Phoenix, College of Medicine University of Arizona, Phoenix, AZ, 85123, USA

Department of Cardiovascular Medicine, Mayo Clinic, Scottsdale, AZ, 85259, USA

Songnan Wen

You can also search for this author in PubMed   Google Scholar

Contributions

XL1, WL, and XW collected data and related articles. YL analyzed data and wrote this manuscript. ZD, SJ, XL2, and FM contributed to the revision of the manuscript. SW, RB, NL, and RT designed the study and interpreted results. All authors read and approved of the final manuscript.

Corresponding authors

Correspondence to Rong Bai , Nian Liu or Ribo Tang .

Ethics declarations

Ethics approval and consent to participate.

The National Center for Health Statistics Ethics Review Committee granted ethics approval of NHANES (Protocol No. 98–12 for the 2003–2004 cycle, Protocol No. 2005–06 for the 2005–2010 cycles, Protocol No. 2011–17 for the 2011–2016 cycles, Protocol No. 2018–01 for the 2017–2018 cycle), as detailed on their official site ( https://www.cdc.gov/nchs/nhanes/irba98.htm ). NHANES provides open-access data for public use. The analysis in our study is a secondary analysis of NHANES data. Secondary analyses of publicly accessible de-identified datasets are exempt from review by the Research Ethics Committee at our institution (human research ethics committees of Beijing Anzhen Hospital). All individuals signed written informed consent before participating in the NHANES study.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

12916_2024_3484_moesm1_esm.doc.

Additional file 1: Figure S1. The hypothetical directed acyclic graph used to select potential covariates. Table S1. Components of the oxidative balance score. Table S2. Association of Metformin Use with All-Cause Mortality and Cardiometabolic Outcomes Among US Cancer Survivors by Age, Gender, BMI, and Race, NHANES 2003 to 2018. Table S3. Association of Metformin Use with All-Cause and Cardiometabolic Mortality Risk Among US Cancer Survivors, NHANES 2003 to 2018. Table S4. Relative excess risk of all-cause mortality, cardiometabolic mortality due to antagonistic interaction effect of metformin use and oxidative stress levels in cancer survivors. Table S5. Association of Sulfonylurea Use with All-Cause and Cardiometabolic Mortality Risk Among US Cancer Survivors, NHANES 2003 to 2018. Table S6. Correlations of Sulfonylurea Use with Four Specific Cardiometabolic Diseases Risk Among US Cancer Survivors, NHANES 2003 to 2018. Table S7. Association between Metformin Use and All-Cause/Cardiometabolic Mortality Risk with further adjustment of HbA1c, Diabetic Retinopathy, GLP-1 Receptor Agonists Use and SGLT-2 Inhibitors Use. Table S8. Correlations between Metformin Use and Four Specific Cardiometabolic Diseases Risk with further adjustment of HbA1c, Diabetic Retinopathy, GLP-1 Receptor Agonists Use and SGLT-2 Inhibitors Use. Table S9. Association between Metformin Use and All-Cause/Cardiometabolic Mortality Risk Among US Cancer Survivors after Excluding Patients Receiving Dialysis in Past 12 Months, NHANES 2003 to 2018. Table S10. Correlations between Metformin Use and Four Specific Cardiometabolic Diseases Risk Among US Cancer Survivors after Excluding Patients Receiving Dialysis in Past 12 Months, NHANES 2003 to 2018.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Li, Y., Liu, X., Lv, W. et al. Metformin use correlated with lower risk of cardiometabolic diseases and related mortality among US cancer survivors: evidence from a nationally representative cohort study. BMC Med 22 , 269 (2024). https://doi.org/10.1186/s12916-024-03484-y

Download citation

Received : 21 January 2024

Accepted : 13 June 2024

Published : 26 June 2024

DOI : https://doi.org/10.1186/s12916-024-03484-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Cardio-oncology

BMC Medicine

ISSN: 1741-7015

meaning of cohort study in research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 18 June 2024

Plasma proteomics identify biomarkers predicting Parkinson’s disease up to 7 years before symptom onset

  • Jenny Hällqvist   ORCID: orcid.org/0000-0001-6709-3211 1 , 2   na1 ,
  • Michael Bartl   ORCID: orcid.org/0000-0002-7752-2443 3 , 4   na1 ,
  • Mohammed Dakna 3 ,
  • Sebastian Schade   ORCID: orcid.org/0000-0002-6316-6804 5 ,
  • Paolo Garagnani   ORCID: orcid.org/0000-0002-4161-3626 6 ,
  • Maria-Giulia Bacalini 7 ,
  • Chiara Pirazzini 6 ,
  • Kailash Bhatia   ORCID: orcid.org/0000-0001-8185-286X 8 ,
  • Sebastian Schreglmann   ORCID: orcid.org/0000-0002-4129-5808 8 ,
  • Mary Xylaki   ORCID: orcid.org/0000-0002-7892-8621 3 ,
  • Sandrina Weber 3 ,
  • Marielle Ernst 9 ,
  • Maria-Lucia Muntean 5 ,
  • Friederike Sixel-Döring 5 , 10 ,
  • Claudio Franceschi   ORCID: orcid.org/0000-0001-9841-6386 6 ,
  • Ivan Doykov 1 ,
  • Justyna Śpiewak 1 ,
  • Héloїse Vinette   ORCID: orcid.org/0009-0000-4360-1293 1 , 11 ,
  • Claudia Trenkwalder 5 , 12 ,
  • Wendy E. Heywood   ORCID: orcid.org/0000-0003-2106-8760 1 ,
  • Kevin Mills 2   na2 &
  • Brit Mollenhauer   ORCID: orcid.org/0000-0001-8437-3645 3 , 5   na2  

Nature Communications volume  15 , Article number:  4759 ( 2024 ) Cite this article

25k Accesses

2135 Altmetric

Metrics details

  • Parkinson's disease

Parkinson’s disease is increasingly prevalent. It progresses from the pre-motor stage (characterised by non-motor symptoms like REM sleep behaviour disorder), to the disabling motor stage. We need objective biomarkers for early/pre-motor disease stages to be able to intervene and slow the underlying neurodegenerative process. Here, we validate a targeted multiplexed mass spectrometry assay for blood samples from recently diagnosed motor Parkinson’s patients ( n  = 99), pre-motor individuals with isolated REM sleep behaviour disorder (two cohorts: n  = 18 and n  = 54 longitudinally), and healthy controls ( n  = 36). Our machine-learning model accurately identifies all Parkinson patients and classifies 79% of the pre-motor individuals up to 7 years before motor onset by analysing the expression of eight proteins—Granulin precursor, Mannan-binding-lectin-serine-peptidase-2, Endoplasmatic-reticulum-chaperone-BiP, Prostaglaindin-H2-D-isomaerase, Interceullular-adhesion-molecule-1, Complement C3, Dickkopf-WNT-signalling pathway-inhibitor-3, and Plasma-protease-C1-inhibitor. Many of these biomarkers correlate with symptom severity. This specific blood panel indicates molecular events in early stages and could help identify at-risk participants for clinical trials aimed at slowing/preventing motor Parkinson’s disease.

Similar content being viewed by others

meaning of cohort study in research

Plasma GFAP as a prognostic biomarker of motor subtype in early Parkinson’s disease

meaning of cohort study in research

Prediction of motor and non-motor Parkinson’s disease symptoms using serum lipidomics and machine learning: a 2-year study

meaning of cohort study in research

Parkinson’s progression prediction using machine learning and serum cytokines

Introduction.

Parkinson’s disease (PD) is a complex and increasingly prevalent neurodegenerative disease of the central nervous system (CNS). It is clinically characterised by progressive motor and non-motor symptoms that are caused by α-synuclein aggregation predominantly in dopaminergic cells, which leads to Lewy body (LB) formation 1 . The failure of neuroprotective strategies in preventing disease progression is due, in part, to the clinical heterogeneity of the disease—it has several phenotypes—and to the lack of objective biomarker readouts 2 . To facilitate the approval of neuroprotective strategies, governing agencies and pharmaceutical companies need regulatory pathways that use objectively measurable markers—potential therapeutical targets as well as state and rate biomarkers—directly associated with PD pathophysiology and clinical phenotypes 3 .

The recently emerged α-synuclein seed amplification assays (SAA) can identify α-synuclein pathology in vivo and support stratification purposes but still rely on cerebrospinal fluid (CSF) obtained through relatively invasive lumbar punctures 4 . Therefore, this test remains specialised and not readily suitable for large-scale clinical use. As peripheral fluid biomarkers are less invasive and easier to obtain, they could be used in repeated and long-term monitoring, which is necessary for population-based screenings for upcoming neuroprotective trials. While the only emerged serum biomarker in the last years, axonal marker neurofilament light chain (NfL), increases longitudinally and correlates with motor and cognitive PD progression 5 , it is non-specific to the disease process.

Growing data support evidence of PD pathology in the peripheral system, which increases the likelihood of finding a source of matrices for less invasive biomarkers. We know α-synuclein aggregation induces neurodegeneration, which is propagated throughout the CNS. Evidence indicates that additional inflammatory events are an early and potentially initial step in a pathophysiological cascade leading to downstream α-synuclein aggregation that activates the immune system 6 . Inflammatory risk factors in circulating blood (i.e. C-reactive-protein and Interleukin-6 and α-synuclein-specific T-cells) are associated with motor deterioration and cognitive decline in PD 7 , 8 . These inflammatory blood markers can even be identified in plasma/serum samples of individuals with isolated REM sleep behaviour disorder (iRBD), the early stage of a neuronal synuclein disease (NSD), and the most specific predictor for PD and dementia with Lewy bodies (DLB) 6 . NSD was recently proposed as a biologically defined term, for a spectrum of clinical syndromes, including iRBD, PD and DLB, that follow an integrated clinical staging system of progressing neuronal α-synuclein pathology (NSD-ISS) 9 .

In this study, we used mass spectrometry-based proteomic phenotyping to identify a panel of blood biomarkers in early PD. In the initial discovery stage, we analysed samples from a well-characterised cohort of de novo PD patients and healthy controls (HC) who had been subjected to rigorous collection protocols 10 . Using unbiased state-of-the-art mass spectrometry, we identified putatively involved proteins, suggesting an early inflammatory profile in plasma. We thereafter moved on to the validation phase by creating a high-throughput and targeted proteomic assay that was applied to samples from an independent replication cohort, consisting of de novo PD, HC and iRBD patients. Finally, after refining the targeted proteomic panel to include a multiplex of only the biomarkers which were reliably measured, an independent analysis was performed on a larger and independent cohort of longitudinal, high-risk subjects who had been confirmed as iRBD by state-of-the-art video-recorded polysomnography (vPSG), including follow-up sampling of up to 7 years.

In summary, using a panel of eight blood biomarkers identified in a machine-learning approach, we were able to differentiate between PD and HC with a specificity of 100%, and to identify 79% of the iRBD subjects, up to 7 years before the development of either DLB or motor PD (NSD stage 3). Our identified panel of biomarkers significantly advances NSD research by providing potential screening and detection markers for use in the earliest stages of NSD for subject identification/stratification for the upcoming prevention trials.

Proteomic discovery phase 0

We performed a bottom-up proteomics analysis of plasma, which had been depleted of the major blood proteins, using two-dimensional in-line liquid chromatography fractionation into ten fractions and label-free mass spectrometric analysis by QTOF MS E . The discovery cohort consisted of ten randomly selected drug-naïve patients with PD and ten matched HC from the de novo Parkinson’s disease (DeNoPa 10 ) cohort (details can be found in Supplementary Table  1 ). This analysis identified 1238 proteins when restricting identification to originate from at least one peptide per protein and at least two fragments per peptide. After excluding proteins with less than two unique peptides or with an identification score below a set threshold (see method section below), 895 distinct proteins remained. Of these proteins, 47 were differentially expressed between the de novo PD and control groups on a nominal significance level of 95%. Pathway analysis suggested enrichment in several inflammatory pathways. Workflow and Results are shown in Fig.  1 , and 2 Supplementary Figs.  1 , 2.

figure 1

The study included three phases. Phase 0 consisted of discovery proteomics by untargeted mass spectrometry to identify putative biomarkers, followed by phase I in which targets from the discovery phase were transferred to a targeted, mass spectrometric MRM method and applied to a new and larger cohort of samples, and finally phase II in which the targeted MRM method was refined and a larger number of samples were analysed to evaluate the clinical feasibility of the targeted protein panel.

figure 2

The circle radii in the Volcano plot represent the identification certainty, where large radii represent proteins identified by at least two unique peptides and an identification score >15, smaller radii are given for proteins identified by two or more unique peptides or a confidence score >15. The horizontal axis shows log 2 of the average fold-change and the vertical axis shows −log 10 of the p values. The significantly different proteins are annotated by gene name and coloured in pink, while the non-significant proteins are coloured in grey. GO annotations for the significant proteins are shown, the dashed line represents p  = 0.05. Disease and function annotations from IPA are shown, divided into annotations with a positive or negative activation score. Source data are provided as a Source Data file.

Selection of proteins for the targeted proteomic assay

We next developed a validatory, high-throughput and multiplexed, mass spectrometric targeted proteomic assay based on the potential biomarkers identified in the discovery phase. Additional proteins were also included in the assay, several of which had been identified in previous discovery studies of PD, Alzheimer’s disease (AD), and ageing 11 . In addition, we also included several known pro- and anti-inflammatory proteins identified in the literature 12 , 13 , 14 , 15 , which had been previously developed into an in-house targeted proteomic neuroinflammatory panel. Using this approach, we created a targeted proteomic panel, including biomarkers from current scientific developments and preliminary findings from our own work 16 , 17 . This targeted proteomic and multiplexed assay included 121 proteins and aimed to validate biomarkers and probe the pathways identified as being perturbed in the discovery phase. Details can be found in Supplementary Table  2 and Fig.  3 .

figure 3

Workflow and overview of the results of the targeted proteomic analysis of de novo Parkinson’s disease (PD) subjects, healthy controls (HC), and the validation cohorts of other neurological disorders (OND) and isolated REM sleep behaviour disorder (iRBD). A A targeted mass spectrometric proteomic assay was developed and optimised. The assay was then applied to plasma samples from cohorts comprising de novo PD ( n  = 99) and HC ( n  = 36), and validated in patients with OND ( n  = 41) and prodromal subjects with iRBD ( n  = 18). The protein expression difference between the groups was compared using Mann–Whitney’s two-sided U -test with Benjamini–Hochberg FDR adjustment at 5%. The lollipop charts show the log 10 p values, signed according to fold-changes. Pink icons represent a protein upregulated in an affected group and grey represents a protein upregulated in controls. B Significantly differentially expressed proteins in the comparison between de novo PD and healthy controls. C Significantly differentially expressed proteins between iRBD, OND and HC. Source data are provided as a Source Data file.

Demographics-targeted proteomic validation phase (phase I)

For the targeted proteomics analysis, we used plasma samples, independent from the proteomic discovery step, from 99 individuals recently diagnosed with de novo PD (48 men, 50%, mean age 67 years) and 36 healthy controls (HC; 20 men, 57%, mean age 64 years). This was the main cohort, to which we added further samples for validation that consisted of a heterogeneous group of 41 patients with other neurological diseases (OND) (29 men, 71%, mean age 70 years) and 18 patients with vPSG-confirmed iRBD (10 men, 56%, mean age 67 years). Further details can be found in Table  1 and Fig.  3 .

The identification of biomarkers that were significantly and differentially expressed biomarkers between patients with de novo Parkinson’s disease and healthy controls- Targeted proteomic validation phase (phase I)

Our targeted proteomic assay was developed for 121 proteins, 32 of which we consistently and reliably detected in plasma. Of these 32 markers, 23 were confirmed as being significantly and differentially expressed between PD and HC. We identified six differentially expressed proteins in the comparison between iRBD patients and HC and between OND and HC (Fig.  3 ). Both the de novo PD and iRBD groups demonstrated an upregulated expression of the serine protease inhibitors SERPINA3, SERPINF2 and SERPING1, and of the central complement protein C3. Granulin precursor protein was shown to be downregulated in all three patient groups (PD, iRBD and OND) compared to HC. The OND and PD groups had a shared and upregulated expression of the proteins PTGDS, CST3, VCAM1 and PLD3. Detailed information about the diagnoses of the OND group can be found in Table  1 , and detailed information about the proteins can be found in Supplementary Table  2 . Figure  4 shows the significantly different proteins as Box-scatter plots.

figure 4

The data are displayed as Box and Whisker plots overlaid with scatter plots of the individual measurements. The whiskers show the minimum and maximum, and the boxes show the 25th percentile, the median and the 75th percentile. The protein expression difference between the groups was compared using Mann–Whitney’s two-sided U -test with Benjamini–Hochberg multiple testing correction (FDR adjustment at 5%). ns not significant, * p  < 0.05, ** p  < 0.01, *** p  < 0.001 and **** p  < 0.0001. The proteins are represented by gene names. Source data are provided as a Source Data file.

The biological significance of the differentially expressed proteins- Targeted proteomic validation phase (phase I)

The involvement of the differentially expressed proteins and their impact on biological processes were evaluated using pathway analysis (Ingenuity Pathway Analysis [IPA], Qiagen). The significantly differentially expressed proteins between PD and HC were used as input, with a fold-change set as the expression observation. We considered pathways as significant if they had an enrichment p value <0.05. At least two of the input proteins were included. Three major pathway clusters were identified and consisted of (i) the expression of serine protease inhibitors or serpins and complement and coagulation components, (ii) endoplasmic reticulum (ER) stress/heat shock-related proteins and (iii) the expression of VCAM1, SELE and PPP3CB. The highest enrichment scores were identified in the pathways acute phase response signalling ( p  = 7.8 E −10 ), coagulation system ( p  = 7.4 E −6 ), complement system ( p  = 8.1 E −6 ), LXR/RXR activation ( p  = 9.1 E −6 ), FXR/RXR activation ( p  = 9.8 E −6 ) and glucocorticoid receptor signalling ( p  = 2.0 E −5 ). These are all pathways involved in inflammatory responses. We also identified pathways related to the unfolded protein response ( p  = 0.004) and neuroinflammation ( p  = 0.04), although with lower enrichment scores. For details, see Supplementary Fig.  1 .

Inflammation-related pathways (including both the complement system and the acute phase response) demonstrated the highest significance levels, followed by pathways regulating protein folding, ER stress, and heat shock proteins. A network representation of proteins and pathways showed clusters consisting of inflammation/coagulation/lipid metabolism (FXR/RXR and LXR/RXR), heat shock proteins/protein misfolding, and more heterogenous pathway clusters related to Wnt-signalling and extracellular matrix proteins. Figure  5 illustrates the potential detrimental and protective mechanisms suggested to be taking place based on the protein expressions observed in this study, leading to oligomerisation and accumulation of α-synuclein in neuronal Lewy body inclusions and, finally, dopaminergic neuronal cell loss.

figure 5

Oligomerisation and accumulation of α-synuclein in Lewy body inclusions is a key process in the pathophysiology of neuronal synuclein disease, i.e. Parkinson’s disease and dementia with Lewy bodies from aggregation and accumulation, the pathological pathway includes different steps finally leading to the loss of dopaminergic neurons. Protective and detrimental mechanisms influence these processes, based on the differently expressed protein profiles, assessed by targeted mass spectrometry in our study. Detailed information about the proteins can be found in Supplementary Table  2 .

Multivariate analysis shows differences between the proteomes of Parkinson’s disease and controls- Targeted proteomic validation phase (phase I)

Principal component analysis (PCA) demonstrated that the HC and PD groups formed two clusters separate from each other over the first and second principal components (PC), attributed with 23.5% and 13.9% of the model’s total variance, respectively. The iRBD group was situated in the middle of HC and PD, and the OND group varied considerably with no evident clustering, as expected due to the heterogeneity of diseases. The corresponding loadings of PC1 and PC2 demonstrated that those with PD correlated with lower levels of PPP3CB, DKK3, SELE and GRN, and higher levels of most of the other proteins. The loadings plot had a high level of covariation in the expression of the PPP3CB, DKK3 and SELE proteins, which were all downregulated in PD. These proteins correlated negatively with the expression of SERPINs, complement C3 and HPX, which all showed a high degree of covariation, and were upregulated in the PD group. Data are displayed in Supplementary Fig.  2 .

The use of multiplexed protein panels of protein biomarkers for the prediction of de novo Parkinson’s disease- Targeted proteomic validation phase (phase I)

We next applied machine learning to construct a discriminant OPLS-DA model using the PD and HC samples from the validation phase. The samples clustered into two distinct and well-separated classes, and evaluation of the model showed that it was highly significant ( p  = 2.3E −27 permutations p  = <0.001). The proteins with the greatest influence on the class separations were GRN, DKK3, C3, SERPINA3, HPX, SERPINF2, CAPN2, SERPING1 and SELE. We predicted the iRBD samples in the model, which resulted in 13 subjects classified as PD (72%) and five not belonging to either group. None of the iRBD samples were classified as controls. We additionally predicted the OND samples, out of which nine were classified as HC, 12 as PD and 19 were not classified as belonging to either group. The 12 samples predicted as PD did not demonstrate enrichment according to the OND groups. The random distribution of the OND samples between PD and HC indicates that the heterogenous group of OND individuals does not share a distinct protein expression with either the HC or PD groups. The iRBD samples that were classified as PD, and not as HC, strongly suggest a shared proteomic profile between iRBD and the protein expression observed in the newly diagnosed PD patients.

We subsequently explored if the observed protein expressions could be used to build a regression model capable of predicting whether individuals belonged to the PD or HC groups. We identified a panel of proteins that discriminated between PD and HC with 100% accuracy and then constructed a linear support vector classification model and applied recursive feature elimination to pinpoint the most discriminating variables. The data were divided into two parts: one consisting of 70% for model training and one containing 30% for testing. The proportion of PD and control samples was maintained in each part. The number of features included in the model was determined by feature ranking with cross-validated recursive feature elimination in the training dataset. The feature selection resulted in a model with eight predictors: GRN, MASP2, HSPA5, PTGDS, ICAM1, C3, DKK3 and SERPING1. The training data were predicted in the model and resulted in all samples being classified in the correct class. We further constructed receiver operating characteristic (ROC) and precision-recall (PR) curves to illustrate the ability of each protein to distinguish between PD and HC and compared this with the ability of the combined multiplexed protein panel. The combined panel achieved an AUC of 1.0 on both ROC and PR curves. The AUC of the individual predictors ranged from 0.53 to 0.92 in the ROC curve, and from 0.79 to 0.96 in the PR curve (Fig.  6 ). We further evaluated the whole dataset by performing repeated cross-validation with six splits of the data and 40 repetitions. The resulting classification metrics (Supplementary Fig.  3 ) demonstrated average and standard deviation for precision, recall, F1 score, and balanced accuracy score of 0.87 ± 0.09, 0.87 ± 0.08, 0.86 ± 0.09 and 0.82 ± 0.12, respectively, thereby indicating a highly robust classification model. Testing the model’s specificity for PD, we predicted the heterogenous group of OND, resulting in 26 of the 42 samples being classified as PD-like. Prediction of the prodromal iRBD group resulted in 17 of 18 samples being classified as PD-like. We compared the prediction of the OND and iRBD samples between the OPLS-DA and SVM models, finding that most of the samples were classified in the same group in both models (out of the samples with a classification in the OPLS-DA model: 82% in OND and 100% in iRBD). The proportion of iRBD samples classified as PD in our models (72% in the OPLS-DA model and 94% in the SVM model) is in line with clinical evidence based on longitudinal cohort studies, reporting that over 80% of iRBD subjects will develop an advanced NSD with motor impairment and/or cognitive decline 18 . We evaluated the influence of age and sex on the proteins included in the support vector model and found that neither influenced the model’s classification ability (see Supplementary Methods  2 for details).

figure 6

The model was trained on 70% of the samples to establish the most discriminating features. Applying cross-validated recursive feature elimination, the top predictors were determined as a granulin precursor, mannan-binding lectin-serine peptidase 2, endoplasmic reticulum chaperone-BiP, prostaglandin-H2 d -isomerase, intercellular adhesion molecule-1, complement C3, dickkopf-3 and plasma protease C1 inhibitor. The remaining 30% of samples were predicted in the model and resulted in 100% prediction accuracy. Receiver operating characteristics (ROC) and precision-recall (PR) curves of the individual and combined proteins in the test set demonstrated that the individual proteins achieved ROC area under the curve (AUC) values 0.53–0.92 and PR values 0.79–0.96, while the combined predictors reached an area under the curve = 1.0. Source data are provided as a Source Data file.

Development of a rapid and refined LC-MS/MS method and evaluation of an independent and longitudinal iRBD cohort (Independent replication cohort-phase II)

To evaluate the results from the initial prediction models focusing on at-risk subjects, we developed and refined our targeted and multiplexed proteomic test to quantitate only those proteins that were readily and reliably detectable from the initial targeted proteomic assay ( n  = 32). Next, we analysed an additional set of 146 longitudinal samples from an independent cohort of 54 individuals with iRBD. This cohort was available from continuing recruitment at the same centre and consisted of longitudinally followed iRBD subjects. Deep phenotyping revealed 100% (54/54) had RBD on PSG, 88.9% (48/54) had hyposmia as identified with the Sniffin’ Stick Identification Test, and 91.7 % (22/24) had neuronal α-synuclein positivity as shown by α-synuclein Seed Amplification Assay (SAA) in cerebrospinal fluid (CSF) 19 . Longitudinal follow-up was available for up to 10 years, during which 16 subjects (20%) phenoconverted to either PD ( n  = 11) or dementia with Lewy bodies (DLB; n  = 5). Since only serum samples were available from the independent replication cohort (further details can be found in Supplementary Table  3 ), we investigated how the proteins in our assay correlated between plasma, serum, and CSF and found good correlations between plasma and serum, but poor correlations between these blood matrices and CSF. The limited correlations between blood and CSF proteins correspond to those of other studies comparing the protein expression between plasma/serum and CSF 20 , 21 and underscore that our test does not necessarily reflect a prodromal and PD-specific proteomic signature of the protein expression in the CSF in proximity to the brain, but rather shows an earlier change in the blood protein expression between healthy status and very early PD patients (Details from this comparison can be found in Supplementary Methods  1 and Supplementary Fig.  4 ).

We applied all available longitudinal iRBD samples ( n  = 146) from phase II to the two machine-learning models (OPLS-DA and support vector machine) constructed in phase I (PD vs. HC). The OPLS-DA model, based on all 32 detected proteins, identified 70% of the iRBD samples as PD, while the SVM model, which was based on a panel of eight proteins, identified 79% of the samples as PD. As mentioned above, at the time of analysis, 16 of the 54 subjects in our longitudinal iRBD validation cohort had developed PD/DLB. The earliest correct classification was 7.3 years prior to phenoconversion and the latest was 0.9 years prior to diagnosis (average 3.5 ± 2.4 years). Detailed information can be found in Fig.  7 and Supplementary Methods  3 .

figure 7

146 new serum samples from individuals diagnosed with iRBD, several with longitudinal follow-up samples, were predicted in the OPLS-DA model. 70% of the samples were predicted as Parkinson’s disease (PD), and 23 of 40 individuals had all their longitudinal samples predicted as PD. In the more refined support vector machine (SVM) model, 79% of the 146 new samples were predicted as PD and 27 of 40 individuals consistently had all their longitudinal samples predicted as PD. Source data are provided as a Source Data file.

The correlation between differentially expressed protein biomarkers and patients’ clinical data in the targeted proteomic validation phase (phase I)

We next evaluated the relationship between proteins and clinical data by correlating the protein expression in PD and HC (from phase I) with clinical scores (Mini-Mental State Examination [MMSE], Hoehn & Yahr stage [H&Y] and UPDRS [Unified Parkinson’s Disease Rating Scale; I–III, and total score]). We found negative correlations for GRN, DKK3, PPP3CB, and SELE with H&Y and UPDRS parts II, III, and total score, possibly indicating a connection between a more severe clinical (especially motor) impairment and lower expression of markers in the Wnt-signalling pathways (DKK3 and PPP3CB). Higher Cystatin C plasma levels correlated with higher numbers in UPDRS part III (motor performance) and UPDRS total score. The same was found for PTGDS plasma levels, which were also negatively correlated with MMSE. The central complement cascade protein, C3, negatively correlated with MMSE, and positively correlated with H&Y, UPDRS part III, and total score. The UPR-regulating protein BiP (HSPA5) correlated negatively with MMSE, and positively with H&Y and UPDRS parts II, III, and total score. The ERAD-associated proteins, HSPAIL and adiponectin, were positively correlated with H&Y, and UPDRS parts II, III, and total score. SERPINs (SERPINA3, SERPINF2 and SERPING1) and hemopexin (HPX) correlated negatively with MMSE and positively with H&Y and UPDRS parts II, III, and total score. In general, the MMSE score was inversely correlated with H&Y stage and UPDRS scores. For detailed information, see Fig.  8 and Table  2 .

figure 8

The correlation was performed using Spearman’s procedure, and the clustering method was set to average. The clustering metric was Euclidean. The heatmap is coloured by correlation coefficient where red represents positive and blue negative correlations. The proteins are represented by gene names. Detailed information about the protein correlations can be found in Supplementary Table  3 . De novo Parkinson’s disease ( n  = 99) and healthy controls ( n  = 36). MMSE mini-mental state examination, UPDRS unified Parkinson’s disease rating Scale. Source data are provided as a Source Data file.

Comparison of clinical outcomes and measurements in the longitudinal iRBD cohort-Independent replication cohort-phase II

The longitudinal expression in the iRBD samples was evaluated using linear mixed-effects models. Conditional growth models with random slopes and random intercepts between the individuals were constructed. After adjusting the p values for multiple testing by applying the Benjamini–Hochberg (BH) procedure with alpha = 0.05, we found that Butyrylcholinesterase (BCHE) was significantly decreased over the timepoints in the iRBD individuals ( p  = 0.01). We next focused only on the iRBD samples with at least two timepoints and for which PD had consistently been predicted in the SVM model ( n  = 90). This produced comparable results to the initial model with BCHE significantly related with time since baseline ( p  = 0.01), but also TUBA4A was nominally significantly increased ( p  = 0.04) although not passing the BH FDR threshold. The modelling also demonstrated that the clinical measurements H&Y ( p  = 0.02), UPDRS I–III ( p  = 0.02), and UPDRS I and III ( p  = 0.03 and 0.03, respectively), were significantly related to the time since baseline in the iRBD group post multiple testing correction. PD non-motor symptoms, as measured on the PD NMS sum score, were strongly correlated with longitudinal motor progression ( p  = 5E −8 ). Similarly, the questionnaire for quality of life PDQ-39’s mean values also correlated with longitudinal motor progression ( p  = 0.005). From available routine blood values, cholesterol was associated with longitudinal timepoints ( p  = 0.02). Details can be found in Supplementary Table  4 . Correlating the clinical measurements with the targeted proteomic data, we applied Spearman’s correlation and found that cholesterol was positively correlated with six of the identified proteins (Supplementary Table  5 ), including HSPA8, APOE and MASP2 ( p  = 5E −9 , 0.0003 and 0.003, respectively). Also significantly correlated, but to a lesser degree and not passing the BH FDR threshold, were the PD NMS sum which correlated negatively with TUBA4A (p unadjusted = 0.01) and the PDQ-39 mean values, which correlated negatively with CST3 and PTGDS ( p unadjusted = 0.03 and 0.05, respectively).

PD has emerged as the world’s fastest-growing neurodegenerative disorder and currently affects close to 10 million people worldwide. Consequently, there is an urgent need for disease-modifying and prevention strategies 22 , 23 . The development of such strategies is hampered by two limitations: there are major gaps in our understanding of the earliest events in the molecular pathophysiology of PD, and we lack reliable and objective biomarkers and tests in easily accessible bio-fluids. We, therefore, need biomarkers that can identify PD earlier, preferably a significant time before an individual develops significant neuronal loss and disabling motor and/or cognitive disease. Such biomarkers would advance population-based screenings to identify individuals at risk and who could be included in upcoming prevention trials.

In the last years, CSF SAA emerged as the most specific indicator for NSD, in prodromal stages like iRBD, with an impressively high sensitivity and specificity of up to 74 and 93%, respectively, across various cohorts 9 , 24 . Despite the many questions surrounding SAA that need to be answered, including the ultimate understanding of its functionality, it is a true milestone for advancing prevention trials. It is, however, hampered by having only been shown to be robust in CSF and by the slow development and high variability of SAA in peripheral blood 25 , as well as by the lack of quantification capabilities. An easier and more accessible biofluid test would enable screening large population-based cohorts for at-risk status to develop an NSD. Therefore, the identification of additional biomarkers is needed, as is further knowledge of the biomarkers and pathways of the underlying pathophysiology (e.g. inflammation) during the earliest stage of NSD.

Other emerging multiplex technologies are increasingly used to identify individual proteomic biomarkers. However, these techniques are not true proteomic or ‘eyes open’ methods, as they rely on selected large panels of specific antibodies/and other (e.g. aptamer)-based assay technologies. These techniques, although useful, have not provided consistent results 3 , 26 . Proteomics using mass spectrometry measures all expressed proteins in an unbiased fashion as opposed to those selectively included in a panel that also includes variability due to cross-reactivity. Therefore, proteomic screening using mass spectrometry-based techniques is much more likely to identify pathways or biomarkers and provides more meaningful insights into the disease mechanisms involved in PD. We found a discrepancy between the detected markers during the discovery and the targeted phases. This is a known phenomenon in biomarker translation 27 that is also reflected in the low number of biomarkers having received FDA approval 28 . We addressed this by using previously reported successful improvement strategies in proteomic approaches, namely by refining our panel, reducing the number of markers, and increasing the sample size 29 . Furthermore, the validation of potential biomarkers was performed on a second and different type of mass spectrometer (triple quadrupole), which has the advantage of being available in all large hospitals.

Targeted MS has been previously applied in PD, including by the current authors, but the biological fluid used in the majority of studies is CSF 30 and not peripheral fluids such as blood. Here we demonstrate that even with a very low required volume of plasma/serum (10 µl) targeted proteomic is feasible.

The targeted proteomic assay presented here was developed from proteins identified in an unbiased discovery study, from our previous research, and from the literature. It included several inflammatory markers, Wnt-signalling members, and proteins indicative of protein misfolding. When analysing PD, OND, iRBD and HC in the targeted proteomic validation phase, we identified and confirmed 23 distinct and differentially expressed proteins between PD and HC. Our analysis moreover demonstrated that iRBD possesses a significantly different protein profile compared to HC, consisting of decreased levels of GRN and MASP2 and increased levels of the complement factor C3 and SERPINs (SERPINA3, SERPINF2 and SERPING1), thus indicating early involvement of inflammatory pathways in the initial pathophysiological steps of PD. Comparing these results to previous findings by our and other groups 8 , 31 highlights the link between these proteins and the pathways of complement activation, coagulation cascades, and Wnt-signalling.

By applying machine-learning models, we classified and separated de novo PD or control samples with 100% accuracy based on the expression of eight proteins (GRN, MASP2, HSPA5, PTGDS, ICAM1, C3, DKK3 and SERPING1).

With an independent validation, we added (a) a larger sample set and (b) longitudinal samples from the most interesting subgroup with 54 iRBD subjects and a total of 146 serum samples. We were able to validate our previous panel with a high prediction rate (79%) of these individuals as seen in PD in the targeted approach. Interestingly, the biomarker panel itself did not correlate with longitudinal expression but remained robust after the initial classification of iRBD. So far, 16 of the 54 iRBD subjects converted to PD/DLB (stage 3 NSD). Out of these samples, the SVM model predicted ten individuals with all their timepoints classified as PD, and of the 11 iRBD subjects who converted to PD/DLB, eight were identified as PD by the proteome analysis. Our panel, therefore, identified a PD-specific change in blood up to 7 years before the development of the stage 3 NSD.

The main shortcoming with many previously explored PD biomarkers is weak or no correlation with clinical progression data. So far, outcome measures in clinical trials are primarily based on motor progression, often by a clinical rating scale such as the UPDRS and/or wearable technologies. More objective biomarkers correlating with or reflecting the progression of the pathophysiology and clinical symptoms would be of the utmost importance. We, therefore, calculated correlations with clinical parameters and identified an association with multiple markers, including DKK3, PPP3CB and C3, indicating downregulation of Wnt-signalling pathways. Increased activity of the complement cascade correlated with higher scores in symptom severity (UPDRS part III and total score) and lower scores in cognitive performance (MMSE).

Protein (i.e. α-synuclein) misfolding is a well-known component of PD pathology and is believed to be the key factor behind Lewy body formation 32 . The transport of excessive amounts of misfolded proteins or increased folding cycles can induce ER stress. A cellular defence mechanism to alleviate ER stress is the unfolded protein response (UPR) reducing ER protein influx and increasing protein folding capacity 33 . The UPR is mainly activated by BiP-bound misfolded proteins 34 . The higher expressed markers HSPA5 (UPR-regulating protein BiP) and HSPA1L in our plasma samples of early PD indicate ER stress as a significant factor in the disease process and has been previously linked to PD in both mouse models and brain tissue studies 35 , 36 .

As mentioned by other groups and confirmed in our results, increasing evidence suggests inflammation is a specific feature in early PD. Complement activation has been associated with the formation of α-synuclein and Lewy bodies in PD and deposits of the complement factors iC3b and C9 have been found in Lewy bodies 37 . C3 is a central molecule in the complement cascade and was highly upregulated in blood in both PD and both independent iRBD sample sets analysed in this study. This upregulation in the earliest phase of motor PD (stage 3 NSD), and even in the prodromal phase (stage 2 NSD), clearly indicates inflammation as an early, if not the initial, event in PD neurodegeneration. Complement C3 levels correlated positively with indicators of motor dysfunction (H&Y stage and UPDRS)—indicating a direct connection between high plasma levels of inflammatory proteins and motor symptoms—and negatively with cognitive decline, here with the MMSE.

The protein Mannan-binding serine peptidase 2 (MASP2), an initiator of the lectin part of the complement cascade, was significantly downregulated in PD and iRBD. MASP1 and MASP2 proteins are inhibited by plasma protease C1 inhibitor SERPING1 in the lectin pathway, with SERPING1 modulating the complement cascade as it belongs to the SERPIN family of acute phase proteins 38 . In experimental PD mice models, increased SERPING1 levels are associated with dopaminergic cell death 39 . Acting as a serine/cysteine proteinase inhibitor, SERPING1 can increase serine levels, which could also affect αSyn phosphorylation. This can play a crucial role in PD pathology, as almost 90% of αSyn in Lewy bodies is phosphorylated on Serine129 40 , 41 . We identified increased SERPING1 plasma levels in both PD and iRBD in our analysis (compared to HC), thus contributing to conditions with increased αSyn phosphorylation, consecutive aggregation, Lewy body formation, and finally degeneration of dopaminergic neurons. Furthermore, we observed a strong correlation of SERPING1 plasma levels with UPDRS II, III and total score, as a direct measure of dopaminergic cell loss 39 .

Alpha-2-antiplasmin (SERPINF2) was also significantly upregulated in PD and iRBD. SERPINF2 is a major regulator of the clotting pathway, acting as an inhibitor of plasmin, a serine protease formed upon the proteolytic cleavage of its precursor, plasminogen, by tissue-type plasminogen activator (t-PA) or by the urokinase-type plasminogen activator (u-PA). Plasmin has been reported to cleave and degrade extracellular and aggregated αSyn 42 . Recently, we showed that activation of the plasminogen/plasmin system is decreased in PD, indicated by decreased plasma levels of uPA and its corresponding receptor uPAR, while t-PA was associated with faster disease progression 8 . The upregulation of SERPINF2 observed here is another indicator of decreased plasmin activity. Alpha-1-antichymotrypsin (SERPINA3), a third member of the SERPIN family, was also upregulated in the PD subjects. In the CNS, the primary source of SERPINA3 is astrocytes, where its expression is upregulated by various inflammatory receptor complexes 38 .

Overall, independent upregulation of these three members of the SERPIN (SERPING1, SERPINF2, SERPINA3) family is also indicative of increased inflammatory activity, combined with less activation of the plasmin system, and correlation with motor and non-motor symptom severity. In addition, a strong downregulation of progranulin ( GRN ) was detected, indicating a potential loss of neuroprotection and increased susceptibility to neuroinflammation. GRN may act as a neurotrophic factor, promoting neuronal survival and modulating lysosomal function. Loss-of-function mutations in the GRN gene are a cause of frontotemporal dementia and familial DLB. GRN gene variants are also known to increase the risk of developing Alzheimer’s disease (AD) and PD 43 . The main characteristics of neurodegeneration related to GRN are TDP43(-Transactive response DNA binding protein 43) inclusions, but Lewy body pathology is also very common. Loss of progranulin has further been linked to increased production of pro-inflammatory species such as tumour necrosis factor (TNF) and IL-6 in microglia 15 . A study in mice showed that Grn -/- mice had elevated levels of complement proteins, including C3, even before the onset of neurodegeneration 44 . Additionally, previous studies have found GRN downregulated in serum samples of advanced PD compared to AD and healthy individuals 45 .

As a possible compensatory reaction to the described increased inflammatory markers, the levels of Prostaglandin-H 2 d -isomerase (PTGDS)/Prostaglandin-D 2 synthase (PGDS2), better known as β-trace protein, were upregulated. PDGDS is an important brain enzyme producing prostaglandin D2 (PGD2), which has a neuroprotective and anti-inflammatory function. The upregulation reported here could be a reaction to the amount of neuronal cell loss, which is also seen in the significant correlation with the clinical motor and cognitive scales (see below). Furthermore, β-trace protein is a marker for CSF and is used to identify the fluid in clinical routine diagnostics, thus helping detect CSF leakage 46 . Increased plasma levels could be indicative of a disrupted blood–brain barrier (BBB), often discussed in PD pathology 47 and demonstrated in our cohorts.

Our study shows that the Wnt-related proteins DKK3 and PPP3CB are strongly downregulated in de novo PD. DKK3 is an activator of the canonical Wnt/β-catenin branch and PPP3CB is a component of the non-canonical Wnt/Ca 2+ signalling pathway. Wnts are secreted, cysteine-rich glycoproteins that act as ligands to locally stimulate receptor-mediated signal transduction of the Wnt-pathway 48 . Wnt-signalling is crucial for the development and maintenance of dopaminergic neurons 49 , shows protective effects on midbrain dopaminergic neurons 50 , and seems to be involved in the maintenance of the BBB 48 , 51 . Wnt-ligands and agonists trigger a “Wnt-On” stage, characterised by neuronal plasticity and protection, while the opposite “Wnt-Off” stage, potentially leading to neurodegeneration, triggered by the phosphorylation activity of glycogen synthetase kinase-3β (GSK-3beta) 50 , 52 . Wnt-inhibitors are separated into secreted Frizzled-related proteins (sFRP) and Dickkopf proteins (DKK). DKK1, DKK2 and DKK4 act as antagonists, while DKK3 is an agonist and activator 53 . Adult neurogenesis is primarily governed by canonical Wnt/β-catenin signaling 54 and downregulation of Wnt-signalling promotes dysfunction and/or death of dopaminergic neurons. Restoration of dopaminergic neurons was shown in mice where β-catenin was activated in situ 52 and neural stem cells transplanted to the substantia nigra of medically PD-induced mice induced re-expression of Wnt1 and repair dopaminergic neurons 55 . DKK3 and PPP3CB were strongly downregulated in de novo PD, removing an important line of defence against the detrimental loss of dopaminergic neurons. The downregulation of the Wnt-signalling pathways was further correlated with higher motor scores (UDPRS and H&Y stages).

Wnt-signalling in PD is not only promising as a potential biomarker. In oncology, drugs can modify Wnt-pathways, which is of interest to the PD field 56 . Some substances show no BBB-permeability. As a disrupted BBB seems to be apparent in PD, these drugs may be effective. Furthermore, these substances are also relevant for PD treatment: research points towards a peripheral starting point of PD and future therapies should be administered as early as possible 57 . These promising substances include DKK- as well as GSK inhibitors, but to date, no drugs targeting the Wnt-signalling pathways have been effectively tested in clinical trials, including in those with neurodegenerative diseases. Progress and clinical trials are urgently needed here.

The transfer of multi-omics analysis to clinically meaningful results that directly impact future drug trial planning and biomarker validation, depends fundamentally on correlating these results and altered pathway regulations with established clinical scores. The markers we analysed in our targeted mass spectrometry panel did not only show different expression patterns between HC, PD, and in both of our independent iRBD sample sets, but most of the markers also robustly correlated with important clinical scores (UPDRS and MMSE, see Table  1 ). Cognitive decline correlated negatively with the SERPINs and complement factor C3. The burden of motor and non-motor symptoms and overall symptom severity rated by UPDRS and its subscores correlated positively with the SERPINs, Complement C3, and negatively with DKK3, GRN, and SELE. So, increased inflammatory activity and downregulation of Wnt-signalling seem to strongly affect the clinical picture of PD subjects.

The iRBD subjects showed decreased levels of BCHE over time compared to controls. BCHE has been reported as decreased in serum samples of PD with cognitive impairment 58 . Validation of this easily assessable marker in serum is needed to evaluate its predictive potential.

While we did not find significant differences when we compared paired serum and plasma samples; the analysis of paired samples of plasma/serum and CSF only correlated weakly with the marker concentrations in these peripheral and central compartments. This discrepancy has been reported by several groups 20 , 21 . One reason is that mass spectrometry-based proteome analysis is always biased towards quantification and detection of the most abundant proteins in each sample matrix, and the total protein concentrations in human plasma/serum are more than two orders of magnitude higher than that in CSF. Further, the regulatory function of the blood–brain barrier seems to play a different role for different proteins, as some, like c-reactive protein, show a strong correlation between CSF and plasma, but most of the proteins do not. CSF and blood proteome show complex dynamics influenced by multiple and still mostly unknown factors. The protein shift in samples with a known BBB dysfunction (determined by the CSF/serum albumin index or the CSF/plasma ratio) can not be determined for individual proteins nor the dysfunction be localised by mass spectrometry 20 .

Our model could not correctly predict phenoconversion in all cases. The reasons for this can be varied: The proteome pattern changes over time and the period between sampling and phenconversion may play a role. The three PD phenoconverters that were not predicted as PD neither differ clinically or demographically from the phenoconverters, nor from the non-phenoconverters. iRBD diagnosis in our study was confirmed by vPSG, supported by a high percentage of additional measurements including hyposmia and CSF SAA positivity. Therefore, even those iRBD cases that do not show the PD-proteome pattern still have a high-risk constellation of converting to PD/DLB on three different levels (PSG, olfaction, and SAA). Continuing further longitudinal follow-up of these subjects will elucidate our understanding of when and potentially why conversion occurs/does not occur. It is known that around 80% of iRBD subjects develop NSD, i.e. PD/DLB, with a rate of 6% per year, as shown in a multicenter cohort including ours 59 . To a lesser extent, iRBD subjects develop the intracytoplasmic glial α-synuclein aggregation disorder Multiple Systems Atrophy (MSA) 59 , 60 . Although RBD is common in MSA (summary prevalence of 73% 61 ), none of our iRBD subjects have, as yet converted to MSA. Recruiting and following large longitudinal at-risk cohorts is, therefore, very important and future studies will not only identify biomarkers for phenoconversion from stage 1 or 2 to eventually stage 3 NSD or MSA, but also identify the many possible factors of resilience (including genetics, etc.) of NON-conversion which will be as, if not more important than identifying indicators for phenoconversion. Both direction progression biomarkers from stage 1 and 2 cohorts will have tremendous implications for future neuroprevention trials as phenoconversion itself is (due to the low annual rate) unlikely to be an outcome measure.

A significant strength of our biomarker discovery to translation pipeline is that it allows for the developed test to be easily validated and translated to any clinical laboratory equipped with a tandem LC-MS instrument. One advantage of using triple quadrupole platforms is that additional and better biomarkers can easily be augmented into the test described in this manuscript. Thus, any test could be refined and optimised over time with very little modification to the assay as additional biomarkers are discovered. Clinical testing for neurological disorders is limited to the use of a selected few well-characterised individual markers and translating biomarkers to eventual clinical application is notoriously challenging. The power of using multiplexed biomarker technologies with machine learning enables biomarkers to be evaluated in context with other markers of pathological events, thereby creating a ‘disease profile’ as opposed to individual markers. This approach opens the biomarker discovery field for many disorders and increases the specificity and sensitivity of testing, as demonstrated in this study. The combination of multiplexed analysis of biomarker panels analysed on triple quadrupole platforms can advance biomarker translation to clinical application; this mass spectral technology is already embedded in many clinical diagnostics labs for routine small molecule analyses.

Our peripheral blood protein pattern for PD helps not only to classify but also to predict the earliest stage of the disease. We find differently expressed proteins in pre-motor iRBD and early motor stages of the disease compared to HC. Multiple markers also correlated with the progression of motor and non-motors symptoms. Thus, our blood panel can also identify subjects at risk (stage 2) to develop PD up to 7 years before advancing to motor stage 3. Next steps will be the independent validation in other (and even earlier) non-motor cohorts, e.g. in subjects with hyposmia also at-risk for PD 62 and in our population-based Healthy Brain Ageing cohort in Kassel 63 . It would further be interesting to evaluate the predictive potential of these identified markers with continuing clinical follow-up and together with other established PD progression markers like serum neurofilament light chain 5 and dopamine transporter imaging in a longitudinal analysis.

Our work was predominantly focused on the similarities between PD and iRBD. The authors are unaware of any study that has analysed longitudinally collected samples and prodromal cohorts, including iRBD and phenoconverters. Future work would include (i) validation of our findings in independent cohorts consisting of iRBD and other at-risk subjects for the synuclein aggregation disorders in neurons (PD, DLB) and oligodendrocytes (MSA), (ii) refinement of the panels of biomarkers developed in this study including sensitivity and technical performance, (iii) and using the pipeline described in this manuscript, the identification and validation of additional biomarkers that could distinguish between the different clinical syndromes with the ultimate goal of identifying progression biomarkers as outcome measures for prevention trials.

In summary, instead of single biomarkers, in a univariate approach, we have created a pipeline using a targeted proteomic test of a multiplexed panel of proteins, together with machine learning. This powerful combination of multiple well-selected biomarkers with state-of-the-art machine-learning bioinformatics, allowed us to use a panel of eight biomarkers that could distinguish early PD from HC. This biomarker panel provided a distinct signature of protective and detrimental mechanisms, finally triggering oxidative stress and neuroinflammation, leading to α-synuclein aggregation and LB formation. Moreover, this signature was already present in the prodromal non-motor (stage 2 NSD), up to 7 years before the development of motor/cognitive symptoms (stage 3), supporting the high specificity of iRBD and its high conversion rate to PD/DLB 18 . Most importantly, this blood panel can, in the future, upon further validation help identify subjects at risk of developing PD/DLB and stratify them for upcoming prevention trials.

Patient cohorts and sample collection and processing

Our research complies with all relevant ethical regulations. Institutional review board statements were obtained from the University Medical Centre in Goettingen, Germany, Approval No. 9/7/04 and 36/7/02. The study was conducted according to the Declaration of Helsinki, and all participants gave written informed consent. All plasma, serum and CSF samples from subjects were selected from known cohorts using identical sample processing protocols designed by the Movement Disorder Center Paracelsus-Elena-Clinic.

Patients with de novo PD were diagnosed according to the UK Brain Bank Criteria, without PD-specific medication. Diagnosis in all subjects was supported by (1) a positive (i.e. >30% improvement of UPDRS III after 250 mg of levodopa) acute levodopa challenge testing 64 in all PD subjects, (2) hyposmia by smell identification test (Sniffin Sticks 65 ) in all PD subjects and (3) 1.5-tesla Magnetic Resonance Imaging (MRI) without significant abnormalities or evidence for other diseases in all but three subjects who were excluded (due to significant vascular lesions or evidence for hydrocephalus) from the analysis. Participants not fulfilling the above criteria and meeting criteria for other neurological disorders were named as other neurological disorders (OND). OND consists of subjects with vascular parkinsonism ( n  = 10), essential tremor ( n  = 7), progressive supranuclear palsy; PSP ( n  = 7), multiple system atrophy; MSA ( n  = 3), corticobasal syndrome; CBS ( n  = 2), DLB ( n  = 2), drug-induced tremor ( n  = 2), dystonic tremor ( n  = 2), restless legs syndrome ( n  = 1), hemifacial spasm ( n  = 1), motoneuron disease ( n  = 1), amyotrophic shoulder neuralgia ( n  = 1), and Alzheimer’s disease ( n  = 1). The initial exploratory cohort consisted of ten PD subjects (8 men, mean age 67.1 ± 10.6) and ten healthy controls (5 men, mean age 65,7, SD ± 8,6.). For details, see Supplementary Table  3 ). The validation cohort included 99 PD subjects (49 men, mean age 66,1, SD ± 10,8), 36 healthy controls (20 men, mean age 63.7, SD ± 6,5.) and the described (see above) 41 OND subjects (29 men, mean age 70, SD ± 8.9. For details, see Supplementary Table  1 . The prodromal validation cohort consisted of 54 patients with iRBD (27 men, mean age 67.5, SD ± 8.1, for details, see Supplementary Table  4 ). RBD was diagnosed with two nights of state-of-the-art vPSG. Samples from HC were selected from the DeNoPa cohort 10 and matched for age and sex with the PD patients, had to be between 40 and 85 years old, without any active known/treated CNS condition, and with a negative family history of idiopathic PD. Antipsychotic drugs were an exclusion criterion. The provided data for sex are based on self-report.

The paired sample analysis of CSF, plasma and serum was applied in samples from subjects with OND 7 men, mean age 74 years, SD ± 7; diagnosis: four Alzheimer’s disease, three vascular Parkinsonism, one essential tremor, one multiple system atrophy one progressive supranuclear palsy).

Clinical assessments included the UPDRS subscores (parts I–III), the sum (UPDRS total score), and cognitive screening using the MMSE 10 .

Plasma and serum samples for both cohorts were collected in the morning under fasting conditions using Monovette tubes (Sarstedt, Nümbrecht, Germany) for EDTA plasma and serum collection by venipuncture. Tubes were centrifuged at 2500× g at room temperature (20 °C) for 10  min and aliquoted and frozen within 30 min of collection at −80 °C until analysis 10 , 66 . Single- use aliquots were used for all analyses presented here. For further details, we refer to the following publication 67 .

CSF was collected in polypropylene tubes (Sarstedt, Nümbrecht, Germany) directly after the plasma collection by lumbar puncture in the sitting position. Tubes were centrifuged at 2500× g at room temperature (20 °C) for 10 min and aliquoted and frozen within 30 min after collection at −80 °C until analysis. Before centrifugation, white and red blood cell counts in CSF were determined manually 10 , 66 . CSF β-amyloid 1–42, total tau protein (t-tau), phosphorylated tau protein (p-tau181) and neurofilament light chains (NFL) concentrations were measured by board-certified laboratory technicians, who were blinded to clinical data, using commercially available INNOTEST ELISA kits for the tau and Aβ markers (Fujirebio Europe, Ghent, Belgium) and the UmanDiagnostics NF-light® assay (UmanDiagnostics, Umeå, Sweden) for NFL. Total protein and albumin levels were measured by nephelometry (Dade Behring/Siemens Healthcare Diagnostics) 66 .

For the α-synuclein seeding aggregation assay (αSyn-SAA) the CSF samples were blindly analyzed in triplicate (40 μL/well) in a reaction mixture (0.3 mg/mL recombinant α-Syn (Amprion [California, USA]; catalogue number S2020), 100 mM piperazine- N , N ′-bis(2-ethanesulfonic acid) (PIPES) pH 6.50, 500 mM sodium chloride, 10 μM thioflavin T, and one bovine serum albumin (BSA)–blocked 2.4-mm silicon nitride G3 bead (Tsubaki-Nakashima [Georgia, USA]). Beads were blocked in 1% BSA 100 mM PIPES pH 6.50 and washed with 100 mM PIPES pH 6.50. The assay was performed in 96-well plates (Costar [New York, USA], catalogue number 3916) using a FLUOstar Omega fluorometer (BMG [Ortenberg, Germany]). Plates were orbitally shaken (800 rpm for 1 min every 29 min at 37 °C). Results from the triplicates were considered input for a three-output probabilistic algorithm with sample labelling as “positive,” “negative,” or “inconclusive”, based on the parameters: Maximum fluorescence (Fmax), time to reach 50% Fmax (T50), slope, and the coefficient of determination for the fitting were calculated for each replicate using a sigmoidal equation available in Mars data analysis software (BMG). The time to reach the 5000 relative fluorescence units (RFU) threshold (TTT) was calculated with a user-defined equation in Mars 19 .

Discovery plasma proteomics (phase 0)

In the mass spectrometry-based proteomic discovery analysis of plasma, we depleted the control and de novo PD samples from the twelve most abundant plasma proteins using Pierce Top12 columns (Thermo Fisher Scientific, Waltham, MA, USA) according to the manufacturer’s instructions. The depleted samples were freeze-dried before the addition of 20 µL of lysis buffer (100 mM Tris pH 7.8, 6 M urea, 2 M thiourea, and 2% ASB-14). The samples were shaken on an orbital shaker for 60 min at 1500 rpm. To break disulphide bonds, 45 µg DTE was added, and the samples were incubated for 60 min. To prevent disulphide bonds from reforming, 108 µg IAA was added, and the samples were incubated for 45 min covered in light. About 165 µL MilliQ water was added to dilute the concentration of urea and 1 µg trypsin gold (Promega, Mannheim, Germany) was added before 16 h of incubation at +37 °C to digest the proteins into peptides. To purify the peptides, solid phase extraction was performed using 100 mg C18 cartridges (Biotage, Uppsala, Sweden). The cartridges were washed with two 1 mL aliquots of 60% ACN, and 0.1% TFA before equilibration by two 1 mL aliquots of 0.1% TFA. The concentration of TFA in the samples was adjusted to 0.1%. The samples were loaded, and the flow-through was captured and re-applied. Salts were washed away from the bound peptides by two 1 mL aliquots of 0.1% TFA. The peptides were eluted by two 250 µL aliquots of 60% ACN, and 0.1% TFA. Solvents were evaporated using a vacuum concentrator. The samples were re-suspended in 50 µL 3% ACN, 0.1% FA prior to analysis. About 4 µL was injected into a 2D-NanoAquity liquid chromatography system (Waters, Manchester, UK). All samples were fractionated online into ten fractions over 12 h. The mobile phase in the first chromatographic system consisted of A1: 10 mM ammonium hydroxide titrated to pH 9 and B1: acetonitrile. The second chromatographic system’s mobile phase was A2: 5% dimethylsulfoxide (DMSO) + 0.1% formic acid, B2: acetonitrile with 5% DMSO + 0.1% formic acid. 2D-liquid chromatography fractionation was performed by loading the sample onto a 300 µm × 50 mm, 5 µm Peptide BEH C18 column (Waters). The peptides were eluted from the first column at a flow rate of 2 µL/min. The initial condition of the gradient elution was 3% B, held over 0.5 minutes before linearly increasing the proportion of organic solvent B, fraction per fraction over 0.5 min. The conditions thereafter remained static for 4 min before returning to the initial conditions over 0.5 min and equilibration prior to the next elution for 10 min. The eluted peptides from the first-dimensional column were loaded into a 180 µm × 20 mm, 5 µm Symmetry C18 trap column (Waters) before entering the analytical column, a 75 µm × 150 mm, 1.7 µm Peptide BEH C18 (Waters). The column temperature was +45 °C. The gradient elution applied to the analytical column started at 3% B and was linearly increased to 40% B over 40 min after which it was increased to 85% B over 2 min and washed for 2 min before returning to initial conditions over 2 min followed by equilibration for 15 min before the subsequent injection. The eluted peptides were detected using a Synapt-G2-S i (Waters) equipped with a nano-electrospray ion source. Data were acquired in positive MS E mode from 0 to 60 min within the m/z range 50−2000. The capillary voltage was set to 3 kV and the source temperature to +100 °C. The desolvation gas consisted of nitrogen with a flow rate of 50 L/h, and the desolvation temperature was set to +200 °C. The purge and desolvation gas consisted of nitrogen, operated at a flow rate of 600 mL/h and 600 L/h, respectively. The gas in the IMS cell was helium, with a flow rate of 90 mL/h. The low energy acquisition was performed by applying a constant collision energy of 4 V with a 1-s scan time. High energy acquisition was performed by applying a collision energy ramp, from 15 to 40 V, and the scan time was 1 s. The lock mass consisted of 500 fmol/µL [glu1]-fibrinopeptide B, continuously infused at a flow rate of 0.3 µL/min and acquired every 30 s. The doubly charged precursor ion, m/z 785.8426, was utilised for mass correction. After acquisition, data were imported to Progenesis QI for proteomics (Waters), and the individual fractions were processed before all results were merged into one experiment. The Ion Accounting workflow was utilised, with UniProt Canonical Human Proteome as a database (build 2016). The digestion enzyme was set as trypsin. Carbamidomethyl on cysteines was set as a fixed modification; deamidation of glutamine and asparagine, and oxidation of tryptophan and pyrrolidone carboxylic acid on the N-terminus were set as variable modifications. The identification tolerance was restricted to at least two fragments per peptide, three fragments per protein, and one peptide per protein. A FDR of 4% or less was accepted. The resulting identifications and intensities were exported and variables with a confidence score less than 15 and only one unique peptide were filtered out.

Targeted plasma proteomics (phase I)

The peptides included in the targeted assay were selected from several proteomic screening studies in which we analysed plasma, serum, urine, and CSF in ageing, PD and AD. The analytical method is described by ref. 17 . Furthermore, due to the suggested involvement of inflammation in neurodegenerative diseases, several known pro- and anti-inflammatory proteins identified from the literature were included in the multiplexed assay. The final panel consisted of 121 proteins (Supplementary Table  2 ), out of which a number were measured with two peptides, leading to a total of 167 unique peptides. When possible, the peptides were chosen to have an amino acid sequence length between 7 and 20. The amino acid sequences were confirmed to be unique to the proteins by using the Basic Local Alignment Search Tool (BLAST) provided by UniProt 68 . Synthetic peptide standards were purchased from GenScript (Amsterdam, Netherlands). To establish the most optimal transitions, repeated injections of 1 pmol peptide standard onto a Waters Acquity ultra-performance liquid chromatography (UPLC) system coupled to a Waters Xevo-TQ-S triple quadrupole MS were performed. The most high-abundant precursor-to-product ion transitions and their optimal collision energies were determined manually or using Skyline 69 . Detection was performed in positive ESI mode. The capillary voltage was set to 2.8 kV, the source temperature to 150 °C, the desolvation temperature to 600 °C, and the cone gas and desolvation gas flows to 150 and 1000 L/h, respectively. The collision gas consisted of nitrogen and was set to 0.15 mL/min. The nebuliser operated at 7 bar. Two transitions were chosen, one quantifier for relative concentration determination and one qualifier for identification, totally rendering 334 analyte transitions. Cone and collision energies varied depending on the optimal settings for each peptide. Each peptide was measured with a minimum of 12 points per peak and a dwell time of 10 ms or more to ensure adequate data acquisition. The optimised transitions were distributed over two multiple reaction monitoring (MRM) methods, always keeping the quantifier and qualifier for each peptide in the same MRM segment. Plasma, serum, and CSF samples were depleted from albumin and IgG using Pierce Top2 cartridges (Thermo Fisher Scientific, Waltham, MA, USA) following the manufacturer’s instructions. About 150 µg whole protein yeast enolase (ENO1) was added to the cartridges as an internal standard to account for digestion efficiency. Digestion was performed as described above. Solid phase extraction was carried out on BondElute 100 mg C18 96-well plates (Agilent, Santa Clara, USA) using the same methodology as in the preparation of untargeted proteomic analyses. Quality control samples were prepared from acetone-precipitated plasma, digested and solid phase extracted. Calibration curves ranging from 0 to 1 pmol/μL were constructed in blank and matrix by spiking increasing amounts of peptides into blank and QC samples. Before analysis, the samples were reconstituted in 30 µL 3% ACN, 0.1% FA containing 0.1 μM heavy isotope labelled peptides from the following proteins (annotated by gene name): ALDOA, C3, GSTO1, RSU1 and TSP1. About 5 µL were injected. The peptides were separated and detected on an Acquity UPLC system coupled to a Xevo-TQ-S triple quadrupole mass spectrometer (Waters, Manchester, UK). Chromatographic separation of the peptides was performed using a 1 × 100 mm, 1.7 μm ACQUITY UPLC Peptide CSH C18 column (Waters).

The mobile phase consisted of A: 0.1% formic acid and B: 0.1% formic acid in acetonitrile pumped at a flow rate of 0.2 mL/min. The column temperature was set to +55 °C. The initial mobile phase composition was 3% B, which was kept static for 0.8 min before initialising the linear gradient, running for 7.6 min to 25% B, eluting most of the peptides. B was thereafter linearly increased to 80% over 0.5 min and held for 1.9 min, eluting the most apolar peptides and washing the column before returning to the initial conditions over 0.1 minutes followed by equilibration for 6 min prior to the subsequent injection. Two subsequent injections of each sample were performed, each paired with one of the two MRM acquisition methods.

After acquisition, peak-picking and integration were performed using TargetLynx (version 4.1, Waters) or an in-house application ('mrmIntegrate') written in Python (version 3.8). mrmIntegrate is publicly available to download via the GitHub repository https://github.com/jchallqvist/mrmIntegrate . The application takes text files as input (.raw files are transformed into text files through the application 'MSConvert' from ProteoWizard 70 and applies a LOWESS filter over five points of the chromatogram. The integration method to produce areas under the curve is trapezoidal integration. The application enables retention time alignment and simultaneous integration of the same transition for all samples. Peptide peaks were identified by the blank and matrix calibration curves. The integrated peak areas were exported to Microsoft Excel, where first, the ratio between quantifier and qualifier peak areas were evaluated to ensure that the correct peaks had been integrated. The digestion efficiency was evaluated by monitoring the presence of baker’s yeast ENO1 in the samples, all samples without a signal were excluded from further analysis. After the initial quality assessment, the quantifier area was divided by the area of one of the internal standards, ALDOA or GSTO1 to yield a ratio used for the determination of relative concentrations. Any compound that also showed an intensity signal in the blank samples had the blank signal subtracted from the analyte peak intensity. Pooled plasma quality control samples were additionally evaluated to assess the robustness of the run.

Refined LC-MS/MS method (phase II)

The rapid and refined targeted proteomics LC-MS/MS method contained only peptides from the 31 proteins observed in the original targeted proteomics method (121 proteins). We utilised a Waters Acquity (UPLC) system coupled to a Waters Xevo-TQ-XS triple quadrupole operating in positive ESI mode. The column was an ACQUITY Premier Peptide BEH C18, 300 Å, 1.7 µm, maintained at 40 °C. The mobile phase was A: 0.1% formic acid in water, and B: 0.1% formic acid in acetonitrile. The gradient elution profile was initiated with 5% B and held for 0.25 min before linearly increasing to 40% B over 9.75 min to elute and separate the peptides. The column was washed for 1.6 min with 85% B before returning to the initial conditions and equilibrating for 0.4 min. The flow rate was 0.6 mL/min. The settings of the mass spectrometer and the peak-picking method were the same as described in the prior section. Baker’s yeast ENO1 was utilised to monitor digestion efficiency and as an internal standard.

Statistical methods

Most of the statistical analyses were performed in Python (version 3.8.5). The untargeted and targeted datasets were inspected for outliers and instrumental drift using principal component analysis (PCA) and orthogonal projection to latent variables (OPLS) in SIMCA, version 17 (Umetrics Sartorius Stedim, Umeå, Sweden). Outliers exceeding ten median deviations from each variable’s median were excluded. Instrumental drift was corrected by applying a non-parametric LOWESS filter from statsmodels (version 0.14.0) using 0.5 fractions of the data to estimate the LOWESS curve 71 . The data were evaluated for normal distribution using D’Agostino and Pearson’s method from SciPy (version 1.9.3) 72 . The non-normally distributed variables in the untargeted data were transformed to normality by the Box-Cox procedure using the SciPy function 'boxcox'. Significance testing between the independent groups of HC and PD/OND/iRBD individuals was performed by Student’s two-tailed t -test for the untargeted proteomic data and by Mann–Whitney’s non-parametric U -test (SciPy) for the targeted data. Due to the limited sample numbers, no multiple testing correction was performed in the untargeted data. In the targeted data, the Benjamini–Hochberg multiple testing correction procedure (statsmodels) was applied with an accepted false discovery rate of 5%. Fold-changes were calculated by dividing the means of the affected groups by the control group. Correlation analyses in the targeted data were performed by Spearman’s correlation (SciPy) and the correlation p values were adjusted variable-wise by the Benjamini–Hochberg procedure (FDR = 5%).

We implemented a support vector classifier model to discriminate between PD and HC and to predict new samples. The data were first z-scored protein-wise and any 'not a number'-values were replaced by the median. We used the 'LinearSVC' method from SciKit Learn and applied cross-validated recursive feature elimination to determine the number of variables to use in the model. The most discriminating variables for distinguishing between controls and PD were thereafter chosen by recursive feature elimination 73 . Feature selection and model training were performed on 70% of the data, partitioned using the SciKit Learn function “train_test_split”, and cross-validation was performed using a stratified k-fold with five splits. The remaining 30% of the data were predicted in the model. PR and ROC curves were constructed from the test data and consisted of each predictor and from the combined predictors, the packages precision_recall_curve and roc_curve from SciKit Learn were implemented. Linear mixed models were performed using the R-to-Python bridge software pymer4 (version 0.8.0), where individual was set as a random effect and the correlations between the MS measured proteins and clinical variables were evaluated for significance post Benjamini–Hochberg’s procedure for multiple testing correction. Plots of the data were constructed using the Seaborn and Matplotlib packages (versions 0.12.2 and 3.6.0, respectively) 74 .

All multivariate analyses were performed in SIMCA, version 17. OPLS and OPLS-discriminant analysis (OPLS-DA) models were evaluated for significance by ANOVA p values and by permutation tests applying 1000 permutations, where p  < 0.05 and p  < 0.001 were deemed significant, respectively.

Data were analysed for pathway enrichment using IPA (QIAGEN Inc. Data were analysed for pathway enrichment using IPA (QIAGEN Inc., https://digitalinsights.qiagen.com/products-overview/discovery-insights-portfolio/analysis-and-visualization/qiagen-ipa/ .). Input variables were set to proteins demonstrating a significant difference between PD individuals and HC, with fold-change as expression observation. The accepted output pathways were restricted to p  < 0.05 and at least two proteins were included in the pathways. Gene Ontology (GO) annotations were extracted using DAVID Bioinformatics Resources (2021 build) 75 , 76 . Networks were built in Cytoscape 77 (version 3.8.0) by applying the “Organic layout” from yFiles 77 .

Obtaining biological materials

Patient samples can be provided to other researchers for certain projects after contact with the corresponding authors and upon availability approval of the team in Kassel, Germany.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

The chromatograms from the targeted mass spectrometric data generated in this study have been deposited in the ProteomeXchange database under accession code PXD041419 and in the Panorama repository ( https://panoramaweb.org/DNP_Pub.url , https://doi.org/10.6069/p9cy-h335 ). The integrated targeted mass spectrometric data generated in this study are provided in the Supplementary Information. Source data for all data presented in graphs within the figures are provided in a source data file.  Source data are provided with this paper.

Code availability

Peak-picking and integrations were performed in TargetLynx (part of the MassLynx suite, version 4.1), or using an in-house application written in Python which can be found on GitHub ( https://github.com/jchallqvist/mrmIntegrate ). The data visualisation and statistical analyses were performed in Python (version 3.8.5) using the packages SciPy (version 1.9.3), statsmodels (version 0.14.0), SciKit Learn (version 1.1.2), Seaborn (version 13.0) and Matplotlib (version 3.6.0). The code used can be found on GitHub ( https://github.com/jchallqvist/DNP_Pub/blob/main/DNP_Code , https://doi.org/10.5281/zenodo.11130369 ).

Simuni, T. et al. Baseline prevalence and longitudinal evolution of non-motor symptoms in early Parkinson’s disease: the PPMI cohort. J. Neurol. Neurosurg. Psychiatry 89 , 78–88 (2018).

Article   PubMed   Google Scholar  

Michell, A. W., Lewis, S. J., Foltynie, T. & Barker, R. A. Biomarkers and Parkinson’s disease. Brain 127 , 1693–1705 (2004).

Article   CAS   PubMed   Google Scholar  

Kieburtz, K., Katz, R., McGarry, A. & Olanow, C. W. A new approach to the development of disease-modifying therapies for PD; fighting another pandemic. Mov. Disord. 36 , 59–63 (2021).

Shahnawaz, M. et al. Development of a biochemical diagnosis of Parkinson disease by detection of α-synuclein misfolded aggregates in cerebrospinal fluid. JAMA Neurol. 74 , 163–172, (2017).

Mollenhauer, B. et al. Validation of serum neurofilament light chain as a biomarker of Parkinson’s disease progression. Mov. Disord . https://doi.org/10.1002/mds.28206 (2020).

Lindestam Arlehamn, C. S. et al. α-Synuclein-specific T cell reactivity is associated with preclinical and early Parkinson’s disease. Nat. Commun. 11 , 1875 (2020).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Mollenhauer, B. et al. Baseline predictors for progression 4 years after Parkinson’s disease diagnosis in the De Novo Parkinson Cohort (DeNoPa). Mov. Disord. 34 , 67–77 (2019).

Bartl, M. et al. Blood markers of inflammation, neurodegeneration, and cardiovascular risk in early Parkinson’s disease. Mov. Disord . https://doi.org/10.1002/mds.29257 (2022).

Simuni, T. et al. A biological definition of neuronal α-synuclein disease: towards an integrated staging system for research. Lancet Neurol. 23 , 178–190 (2024).

Mollenhauer, B. et al. Nonmotor and diagnostic findings in subjects with de novo Parkinson disease of the DeNoPa cohort. Neurology 81 , 1226–1234 (2013).

Hällqvist, J. et al. A multiplexed urinary biomarker panel has potential for Alzheimer’s disease diagnosis using targeted proteomics and machine learning. Int. J. Mol. Sci . https://doi.org/10.3390/ijms241813758 (2023).

Hu, W., Ralay Ranaivo, H., Craft, J. M., Van Eldik, L. J. & Watterson, D. M. Validation of the neuroinflammation cycle as a drug discovery target using integrative chemical biology and lead compound development with an Alzheimer’s disease-related mouse model. Curr. Alzheimer Res. 2 , 197–205 (2005).

Notter, T. et al. Translational evaluation of translocator protein as a marker of neuroinflammation in schizophrenia. Mol. Psychiatry 23 , 323–334 (2018).

Jonsson, M., Gerdle, B., Ghafouri, B. & Backryd, E. The inflammatory profile of cerebrospinal fluid, plasma, and saliva from patients with severe neuropathic pain and healthy controls-a pilot study. BMC Neurosci. 22 , 6 (2021).

Article   PubMed   PubMed Central   Google Scholar  

Chen, X. et al. Progranulin does not bind tumor necrosis factor (TNF) receptors and is not a direct regulator of TNF-dependent signaling or bioactivity in immune or neuronal cells. J. Neurosci. 33 , 9202–9213 (2013).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Captur, G. et al. Plasma proteomic signature predicts who will get persistent symptoms following SARS-CoV-2 infection. EBioMedicine 85 , 104293 (2022).

Doykov, I. et al. The long tail of Covid-19’ - The detection of a prolonged inflammatory response after a SARS-CoV-2 infection in asymptomatic and mildly affected patients. F1000Res 9 , 1349 (2020).

Hu, M. T. REM sleep behavior disorder (RBD). Neurobiol. Dis. 143 , 104996 (2020).

Concha-Marambio, L. et al. Accurate detection of α-synuclein seeds in cerebrospinal fluid from isolated rapid eye movement sleep behavior disorder and patients with Parkinson’s disease in the de novo Parkinson (DeNoPa) cohort. Mov. Disord. 38 , 567–578 (2023).

Dayon, L. et al. Proteomes of paired human cerebrospinal fluid and plasma: relation to blood-brain barrier permeability in older adults. J. Proteome Res. 18 , 1162–1174 (2019).

Whelan, C. D. et al. Multiplex proteomics identifies novel CSF and plasma biomarkers of early Alzheimer’s disease. Acta Neuropathol. Commun. 7 , 169 (2019).

Bloem, B. R., Okun, M. S. & Klein, C. Parkinson’s disease. Lancet 397 , 2284–2303 (2021).

Dorsey, E. R., Sherer, T., Okun, M. S. & Bloem, B. R. The emerging evidence of the Parkinson pandemic. J. Parkinsons Dis. 8 , S3–s8 (2018).

Grossauer, A. et al. α-Synuclein seed amplification assays in the diagnosis of synucleinopathies using cerebrospinal fluid-A systematic review and meta-analysis. Mov. Disord. Clin. Pr. 10 , 737–747 (2023).

Article   Google Scholar  

Okuzumi, A. et al. Propagative α-synuclein seeds as serum biomarkers for synucleinopathies. Nat. Med. 29 , 1448–1455 (2023).

Raffield, L. M. et al. Comparison of proteomic assessment methods in multiple cohort studies. Proteomics 20 , e1900278 (2020).

Hernández, B., Parnell, A. & Pennington, S. R. Why have so few proteomic biomarkers “survived” validation? (Sample size and independent validation considerations). Proteomics 14 , 1587–1592 (2014).

Füzéry, A. K., Levin, J., Chan, M. M. & Chan, D. W. Translation of proteomic biomarkers into FDA approved cancer diagnostics: issues and challenges. Clin. Proteom. 10 , 13 (2013).

Bader, J. M., Albrecht, V. & Mann, M. MS-based proteomics of body fluids: the end of the beginning. Mol. Cell Proteom. 22 , 100577 (2023).

Article   CAS   Google Scholar  

Pan, C. et al. Targeted discovery and validation of plasma biomarkers of Parkinson’s disease. J. Proteome Res. 13 , 4535–4545 (2014).

Qin, X. Y., Zhang, S. P., Cao, C., Loh, Y. P. & Cheng, Y. Aberrations in peripheral inflammatory cytokine levels in Parkinson disease: a systematic review and meta-analysis. JAMA Neurol. 73 , 1316–1324, (2016).

Choi, M. L. & Gandhi, S. Crucial role of protein oligomerization in the pathogenesis of Alzheimer’s and Parkinson’s diseases. FEBS J. 285 , 3631–3644 (2018).

Walter, P. & Ron, D. The unfolded protein response: from stress pathway to homeostatic regulation. Science 334 , 1081–1086 (2011).

Article   ADS   CAS   PubMed   Google Scholar  

Bertolotti, A., Zhang, Y. H., Hendershot, L. M., Harding, H. P. & Ron, D. Dynamic interaction of BiP and ER stress transducers in the unfolded-protein response. Nat. Cell Biol. 2 , 326–332 (2000).

Colla, E. Linking the endoplasmic reticulum to Parkinson’s disease and alpha-synucleinopathy. Front. Neurosci. 13 , 560 (2019).

Mercado, G., Castillo, V., Soto, P. & Sidhu, A. ER stress and Parkinson’s disease: pathological inputs that converge into the secretory pathway. Brain Res. 1648 , 626–632 (2016).

Loeffler, D. A., Camp, D. M. & Conant, S. B. Complement activation in the Parkinson’s disease substantia nigra: an immunocytochemical study. J. Neuroinflamm 3 , 29 (2006).

Zattoni, M. et al. Serpin signatures in Prion and Alzheimer’s diseases. Mol. Neurobiol. 59 , 3778–3799 (2022).

Seo, M. H. & Yeo, S. Association of increase in Serping1 level with dopaminergic cell reduction in an MPTP-induced Parkinson’s disease mouse model. Brain Res. Bull. 162 , 67–72 (2020).

Anderson, J. P. et al. Phosphorylation of Ser-129 is the dominant pathological modification of alpha-synuclein in familial and sporadic Lewy body disease. J. Biol. Chem. 281 , 29739–29752 (2006).

Fujiwara, H. et al. alpha-Synuclein is phosphorylated in synucleinopathy lesions. Nat. Cell Biol. 4 , 160–164 (2002).

Kim, K. S. et al. Proteolytic cleavage of extracellular alpha-synuclein by plasmin implications for Parkinson disease. J. Biol. Chem. 287 , 24862–24872 (2012).

Reho, P. et al. GRN mutations are associated with Lewy body dementia. Mov. Disord. 37 , 1943–1948 (2022).

Kao, A. W., Mckay, A., Singh, P. P., Brunet, A. & Huang, E. J. Progranulin, lysosomal regulation and neurodegenerative disease. Nat. Rev. Neurosci. 18 , 325–333 (2017).

Mateo, I. et al. Reduced serum progranulin level might be associated with Parkinson’s disease risk. Eur. J. Neurol. 20 , 1571–1573 (2013).

Bachmann-Harildstad, G. Diagnostic values of beta-2 transferrin and beta-trace protein as markers for cerebrospinal fluid fistula. Rhinology 46 , 82–85 (2008).

PubMed   Google Scholar  

Pediaditakis, I. et al. Modeling alpha-synuclein pathology in a human brain-chip to assess blood-brain barrier disruption. Nat. Commun. 12 , 5907 (2021).

Serafino, A., Giovannini, D., Rossi, S. & Cozzolino, M. Targeting the Wnt/β-catenin pathway in neurodegenerative diseases: recent approaches and current challenges. Expert Opin. Drug Discov. 15 , 803–822 (2020).

Arenas, E. Wnt signaling in midbrain dopaminergic neuron development and regenerative medicine for Parkinson’s disease. J. Mol. Cell Biol. 6 , 42–53 (2014).

L’Episcopo, F. et al. A Wnt1 regulated Frizzled-1/β-Catenin signaling pathway as a candidate regulatory circuit controlling mesencephalic dopaminergic neuron-astrocyte crosstalk: therapeutical relevance for neuron survival and neuroprotection. Mol. Neurodegener. 6 , 49 (2011).

Sweeney, M. D., Sagare, A. P. & Zlokovic, B. V. Blood-brain barrier breakdown in Alzheimer disease and other neurodegenerative disorders. Nat. Rev. Neurol. 14 , 133–150 (2018).

L’Episcopo, F. et al. Wnt/beta-catenin signaling is required to rescue midbrain dopaminergic progenitors and promote neurorepair in ageing mouse model of Parkinson’s disease. Stem Cells 32 , 2147–2163 (2014).

Marchetti, B. Wnt/beta-catenin signaling pathway governs a full program for dopaminergic neuron survival, neurorescue and regeneration in the MPTP mouse model of Parkinson’s disease. Int. J. Mol. Sci. 19 , 3743 (2018).

Marchetti, B. et al. Parkinson’s disease, aging and adult neurogenesis: Wnt/beta-catenin signalling as the key to unlock the mystery of endogenous brain repair. Aging Cell 19 , e1310110 (2020).

L’Episcopo, F. et al. Neural stem cell grafts promote astroglia-driven neurorestoration in the aged Parkinsonian brain via Wnt/beta-catenin signaling. Stem Cells 36 , 1179–1197 (2018).

Serafino, A. et al. Developing drugs that target the Wnt pathway: recent approaches in cancer and neurodegenerative diseases. Expert Opin. Drug Discov. 12 , 169–186 (2017).

Harms, A. S., Ferreira, S. A. & Romero-Ramos, M. Periphery and brain, innate and adaptive immunity in Parkinson’s disease. Acta Neuropathol. 141 , 527–545 (2021).

Dong, M. X. et al. Serum butyrylcholinesterase activity: a biomarker for Parkinson’s disease and related dementia. Biomed. Res. Int. 2017 , 1524107 (2017).

Postuma, R. B. et al. Risk and predictors of dementia and parkinsonism in idiopathic REM sleep behaviour disorder: a multicentre study. Brain 142 , 744–759 (2019).

Zhang, H. et al. Risk factors for phenoconversion in rapid eye movement sleep behavior disorder. Ann. Neurol. 91 , 404–416 (2022).

Palma, J. A. et al. Prevalence of REM sleep behavior disorder in multiple system atrophy: a multicenter study and meta-analysis. Clin. Auton. Res. 25 , 69–75 (2015).

Jennings, D. et al. Imaging prodromal Parkinson disease: the Parkinson Associated Risk Syndrome Study. Neurology 83 , 1739–1746 (2014).

Schade, S. et al. Identifying prodromal NMS in a population-based recruitment strategy: Kassel data of Healthy Brain Ageing. Zenodo (2023).

Schade, S. et al. Acute levodopa challenge test in patients with de novo Parkinson’s disease: data from the DeNoPa cohort. Mov. Disord. Clin. Pr. 4 , 755–762 (2017).

Hummel, T., Sekinger, B., Wolf, S. R., Pauli, E. & Kobal, G. ‘Sniffin’ sticks’: olfactory performance assessed by the combined testing of odor identification, odor discrimination and olfactory threshold. Chem. Senses 22 , 39–52 (1997).

Mollenhauer, B. et al. Monitoring of 30 marker candidates in early Parkinson disease as progression markers. Neurology 87 , 168–177 (2016).

Mollenhauer, B. et al. α-Synuclein and tau concentrations in cerebrospinal fluid of patients presenting with parkinsonism: a cohort study. Lancet Neurol. 10 , 230–240 (2011).

UniProt. BLAST https://www.uniprot.org/blast/ . Accessed January 2024.

MacLean, B. et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 26 , 966–968 (2010).

Chambers, M. C. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 30 , 918–920 (2012).

Seabold, S. & Perktold, J. Statsmodels: econometric and statistical modeling with Python. In Proc. 9th Python in Science Conference (SciPy, 2010).

Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17 , 261–272 (2020).

Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

MathSciNet   Google Scholar  

Hunter, J. D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 9 , 90–95 (2007).

Sherman, B. T. et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 50 , W216–221, (2022).

Huang Da, W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4 , 44–57 (2009).

Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13 , 2498–2504 (2003).

Download references

Acknowledgements

This work was supported by the Michael J Fox Foundation, PDUK, The Peto Foundation, The TMSRG (UCL), The BRC at Great Ormond Street Hospital, and the Horizon 2020 Framework Programme (Grant number 634821, PROPAG-AGING). We thank the PROPAG-AGING consortium, a full list of the members can be found in the supplementary material.

Open Access funding enabled and organized by Projekt DEAL.

Author information

These authors contributed equally: Jenny Hällqvist, Michael Bartl.

These authors jointly supervised this work: Kevin Mills, Brit Mollenhauer.

Authors and Affiliations

UCL Institute of Child Health and Great Ormond Street Hospital, London, UK

Jenny Hällqvist, Ivan Doykov, Justyna Śpiewak, Héloїse Vinette & Wendy E. Heywood

UCL Queen Square Institute of Neurology, Clinical and Movement Neurosciences, London, UK

Jenny Hällqvist & Kevin Mills

Department of Neurology, University Medical Center Goettingen, Goettingen, Germany

Michael Bartl, Mohammed Dakna, Mary Xylaki, Sandrina Weber & Brit Mollenhauer

Institute for Neuroimmunology and Multiple Sclerosis Research, University Medical Center Goettingen, Goettingen, Germany

Michael Bartl

Paracelsus-Elena-Klinik, Kassel, Germany

Sebastian Schade, Maria-Lucia Muntean, Friederike Sixel-Döring, Claudia Trenkwalder & Brit Mollenhauer

Department of Experimental, Diagnostic, and Specialty Medicine (DIMES), University of Bologna, Bologna, Italy

Paolo Garagnani, Chiara Pirazzini & Claudio Franceschi

IRCCS Istituto delle Scienze Neurologiche di Bologna, Bologna, Italy

Maria-Giulia Bacalini

National Hospital for Neurology & Neurosurgery, Queen Square, WC1N3BG, London, UK

Kailash Bhatia & Sebastian Schreglmann

Institute of Diagnostic and Interventional Neuroradiology, University Medical Center Goettingen, Goettingen, Germany

Marielle Ernst

Department of Neurology, Philipps-University, Marburg, Germany

Friederike Sixel-Döring

UCL: Food, Microbiomes and Health Institute Strategic Programme, Quadram Institute Bioscience, Norwich Research Park, Norwich, UK

Héloїse Vinette

Department of Neurosurgery, University Medical Center Goettingen, Goettingen, Germany

Claudia Trenkwalder

You can also search for this author in PubMed   Google Scholar

Contributions

J.H., M.B., K.M., and B.M. conceptualised, planned and oversaw all aspects of the study. J.H., K.M., J.S., H.V., M.B. and S. Schreglmann performed and analyzed most of the experiments. S. Schade, S.W. and M.B. consented to the subjects and collected the samples. M.-L.M., F.S.-D. and S. Schade analyzed the sleep lab data and diagnosed the iRBD subjects. J.H. and M.D. performed the statistical data analysis. J.H. applied the machine learning methods and designed the figures. W.H., I.D., C.F., M.-G.B., P.G., C.P., K.B. and M.X. provided substantial contributions to the conception of the work, acquisition and interpretation of the data, particularly for the mass spectrometry setup and the refinement of the targeted panel. S. Schade, S.W., C.T., M.B., B.M., M.-L.M. and F.S.D. conceptualised the clinical study, analyzed the clinical data and reevaluated the diagnosis. M.E. provided substantial contributions to the clinical data analyzes, particularly the imaging patient data in regard to differential diagnosis. J.H., M.B., K.M. and B.M. wrote the manuscript with input and substantial revisions from all authors.

Corresponding authors

Correspondence to Jenny Hällqvist or Michael Bartl .

Ethics declarations

Competing interests.

JH, MD, MX, SW, KB, ME, PG, MGB, CP, KM, ID, WH, JS, HV and CF and have no competing interests to report. MB has received funding from the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 413,501,650. CT has received honoraria for consultancy from Roche, and honoraria for educational lectures from UCB, and has received research funding for the PPMI study from the Michael J. Fox Foundation and funding from the EU (Horizon 2020) and stipends from the (International Parkinson’s and Movement Disorder Society) IPMDS. BM has received honoraria for consultancy from Roche, Biogen, AbbVie, UCB, and Sun Pharma Advanced Research Company. BM is a member of the executive steering committee of the Parkinson Progression Marker Initiative and PI of the Systemic Synuclein Sampling Study of the Michael J. Fox Foundation for Parkinson’s Research and has received research funding from the Deutsche Forschungsgemeinschaft (DFG), EU (Horizon 2020), Parkinson Fonds Deutschland, Deutsche Parkinson Vereinigung, Parkinson’s Foundation and the Michael J. Fox Foundation for Parkinson’s Research. MLM has received honoraria for speaking engagements from Deutsche Parkinson Gesellschaft e.V., and royalties from Gesellschaft fur Medien + Kommunikation mbH + Co. FSD has received honoraria for speaking engagements from AbbVie, Bial, Ever Pharma, Medtronic and royalties from Elsevier and Springer. She served on an advisory board for Zambon and Stada Pharma. FSD participated in Ad Boards for consultation: Abbvie, UCB, Bial, Ono, Roche and got honorary for lecturing: Stada Pharm, AbbVie, Alexion, Bial. S. Schade received institutional salaries supported by the EU Horizon 2020 research and innovation programme under grant agreement No. 863664 and by the Michael J. Fox Foundation for Parkinson’s Research under grant agreement No. MJFF-021923. He is supported by a PPMI Early Stage Investigators Funding Programme fellowship of the Michael J. Fox Foundation for Parkinson’s Research under grant agreement No. MJFF-022656. S. Schreglmann received institutional salaries supported by the EU Horizon 2020 research and innovation programme under grant agreement No. 863664, support from the Advanced Clinician Scientist programme by the Interdisciplinary Centre for Clinical Research, Wuerzburg, Germany, and from the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) Project-ID 424778381-TRR 295. He is a fellow of the Thiemann Foundation. He serves as a scientific adviser to Elemind Inc.

Peer review

Peer review information.

Nature Communications thanks Kanta Horie and the other, anonymous, reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, reporting summary, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Hällqvist, J., Bartl, M., Dakna, M. et al. Plasma proteomics identify biomarkers predicting Parkinson’s disease up to 7 years before symptom onset. Nat Commun 15 , 4759 (2024). https://doi.org/10.1038/s41467-024-48961-3

Download citation

Received : 06 April 2023

Accepted : 20 May 2024

Published : 18 June 2024

DOI : https://doi.org/10.1038/s41467-024-48961-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

meaning of cohort study in research

COMMENTS

  1. What Is a Cohort Study?

    Cohort studies are a type of observational study that can be qualitative or quantitative in nature. They can be used to conduct both exploratory research and explanatory research depending on the research topic. In prospective cohort studies, data is collected over time to compare the occurrence of the outcome of interest in those who were ...

  2. Cohort study: What are they, examples, and types

    Nurses' Health Study. One famous example of a cohort study is the Nurses' Health Study. This was a large, long-running analysis of female health that began in 1976. It investigated the ...

  3. Cohort Study: Definition, Designs & Examples

    A prospective cohort study is a type of longitudinal research where a group of individuals sharing a common characteristic (cohort) is followed over time to observe and measure outcomes, often to investigate the effect of suspected risk factors. In a prospective study, the investigators will design the study, recruit subjects, and collect ...

  4. Cohort study

    A cohort study is a particular form of longitudinal study that samples a cohort (a group of people who share a defining characteristic, typically those who experienced a common event in a selected period, such as birth or graduation), performing a cross-section at intervals through time. It is a type of panel study where the individuals in the panel share a common characteristic.

  5. Cohort Study: Definition, Benefits & Examples

    Cohort studies are observational designs, meaning that the researchers do not manipulate experimental or environmental conditions. Instead, they collect data over time and try to understand how various factors affect the outcome. These projects can last for periods ranging from weeks to decades, depending on the research questions.

  6. Cohort Studies: Design, Analysis, and Reporting

    Abstract. Cohort studies are types of observational studies in which a cohort, or a group of individuals sharing some characteristic, are followed up over time, and outcomes are measured at one or more time points. Cohort studies can be classified as prospective or retrospective studies, and they have several advantages and disadvantages.

  7. Methodology Series Module 1: Cohort Studies

    The term "cohort" refers to a group of people who have been included in a study by an event that is based on the definition decided by the researcher. For example, a cohort of people born in Mumbai in the year 1980. This will be called a "birth cohort.". Another example of the cohort will be people who smoke.

  8. What Is a Cohort Study?

    Purpose. Strengths. Weaknesses. A cohort study often looks at 2 (or more) groups of people that have a different attribute (for example, some smoke and some don't) to try to understand how the specific attribute affects an outcome. The goal is to understand the relationship between one group's shared attribute (in this case, smoking) and its ...

  9. Overview: Cohort Study Designs

    To further understand the meaning of the risk ratio results, if the result was equal to 1, then the exposure (smoker) did not affect the outcome. In other words, the risk was the same for the exposed and unexposed groups. ... Observational research methods—Cohort studies, cross sectional studies, and case-control studies.

  10. Cohort Study

    Definition. A study design where one or more samples (called cohorts) are followed prospectively and subsequent status evaluations with respect to a disease or outcome are conducted to determine which initial participants exposure characteristics (risk factors) are associated with it. As the study is conducted, outcome from participants in each ...

  11. Cohort Studies: Design, Analysis, and Reporting

    Design, Analysis, and Reporting. Cohort studies are types of observational studies in which a cohort, or a group of individuals sharing some characteristic, are followed up over time, and outcomes are measured at one or more time points. Cohort studies can be classified as prospective or retrospective studies, and they have several advantages ...

  12. Research Design: Cohort Studies

    Abstract. In a cohort study, a group of subjects (the cohort) is followed for a period of time; assessments are conducted at baseline, during follow-up, and at the end of follow-up. Cohort studies are, therefore, empirical, longitudinal studies based on data obtained from a sample; they are also observational and (usually) naturalistic.

  13. LibGuides: Quantitative study designs: Cohort Studies

    Cohort studies are longitudinal, observational studies, which investigate predictive risk factors and health outcomes. They differ from clinical trials, in that no intervention, treatment, or exposure is administered to the participants. The factors of interest to researchers already exist in the study group under investigation.

  14. Cohort Studies: Design, Analysis, and Reporting

    Cohort studies can be either prospective or retrospective. The type of cohort study is determined by the outcome status. If the outcome has not occurred at the start of the study, then it is a prospective study; if the outcome has already occurred, then it is a retrospective study. 4 Figure 1 presents a graphical representation of the designs of prospective and retrospective cohort studies.

  15. What Is a Prospective Cohort Study?

    A prospective cohort study is a type of observational study focused on following a group of people (called a cohort) over a period of time, collecting data on their exposure to a factor of interest. Their outcomes are then tracked, in order to investigate the association between the exposure and the outcome. Prospective cohort studies look ...

  16. What are cohort studies?

    Cohort studies are a type of longitudinal study —an approach that follows research participants over a period of time (often many years). Specifically, cohort studies recruit and follow participants who share a common characteristic, such as a particular occupation or demographic similarity. During the period of follow-up, some of the cohort ...

  17. Cohort (statistics)

    Case-control study versus cohort on a timeline. "OR" stands for "odds ratio" and "RR" stands for "relative risk". In statistics, epidemiology, marketing and demography, a cohort is a group of subjects who share a defining characteristic (typically subjects who experienced a common event in a selected time period, such as birth or graduation).

  18. Cohort Study (Retrospective, Prospective): Definition, Examples

    A Cohort study, used in the medical fields and social sciences, is an observational study used to estimate how often disease or life events happen in a certain population. "Life events" might include: incidence rate, relative risk or absolute risk. A cohort is a defined group, like "nurses," "10-19 year-olds," or "college students

  19. What are cohort studies and why are they important?

    Cohort studies can be prospective (meaning that data are collected as individual lives unfold), or retrospective (meaning that researchers look at historical data about individuals after a certain outcome, such as diagnosis with a disease, has occurred). ... A cohort study is a type of medical research study. it's different to a randomised ...

  20. An Introduction to the Fundamentals of Cohort and Case-Control Studies

    Design. In a case-control study, a number of cases and noncases (controls) are identified, and the occurrence of one or more prior exposures is compared between groups to evaluate drug-outcome associations ( Figure 1 ). A case-control study runs in reverse relative to a cohort study. 21 As such, study inception occurs when a patient ...

  21. Cohort studies investigating the effects of exposures: key principles

    Cohort studies follow a population exposed or not exposed to a potential causal agent forward in time and assess outcomes. Cohort studies are beneficial because these studies allow the ...

  22. Prospective Cohort Study Design: Definition & Examples

    A prospective observational study is a type of research where investigators select a group of subjects and then observe them over a certain period of time. The researchers collect data on the subjects' exposure to certain risk factors or interventions and then track the outcomes. This type of study is often used to study the effects of suspected risk factors that cannot be controlled ...

  23. What Is a Retrospective Cohort Study?

    A retrospective cohort study is a type of observational study that focuses on individuals who have an exposure to a disease or risk factor in common. Retrospective cohort studies analyze the health outcomes over a period of time to form connections and assess the risk of a given outcome associated with a given exposure. Note.

  24. The Power of Cohorts

    Prospective cohort studies have informed our understanding of cancer, directing scientific inquiries in basic and clinical laboratory science, as well as translational studies and treatment trials, and led to the development of guidelines and regulatory actions to protect public health. Learn about the different types of cohort studies in the Division and the major accomplishments they have ...

  25. Metformin use correlated with lower risk of cardiometabolic diseases

    Study sample and population. This cohort study utilized a nationally representative population of cancer survivors from the National Health and Nutrition Examination Survey (NHANES) [12, 13].This study's analysis adhered to the analytical guidelines of NHANES, which adopted a multi-stage stratified systematic sampling design.

  26. Plasma proteomics identify biomarkers predicting Parkinson's ...

    The study included three phases. Phase 0 consisted of discovery proteomics by untargeted mass spectrometry to identify putative biomarkers, followed by phase I in which targets from the discovery ...