Home Insights Advanced Computational Methods Drive Large-Scale Data Analysis in 4D

Advanced Computational Methods Drive Large-Scale Data Analysis in 4D

December 3, 2020

Source: Ryzhi/Getty Images

Modern high-throughput proteomics experiments produce huge amounts of data which, in raw form, provide little usable information about biological processes. With such large quantities of data, manual analysis is nearly impossible, meaning that only automated computer-based methods allow identification of proteins and other biomolecules.

The Computational Systems Biochemistry Group at the Max Planck Institute of Biochemistry develops computational methods for the identification and quantification of the molecular components of cells, tissues, and body fluids. At the Max Planck Institute, the key focus is around the development of computational approaches to analyze large-scale data resulting from mass spectrometry (MS)-based proteomics analysis.

Over the last two decades, significant advances in technology, together with new methodologies for data analysis, have made proteomics an extremely powerful tool for protein scientists, biologists, and clinical researchers.¹ With analytical instruments constantly evolving, more data is produced with each technical advancement in proteomics research. Although more data is a positive in terms of research into a wider selection of proteins, it also creates new challenges for software development, which must constantly evolve to keep up with the amount of data generated.

Software to manage large-scale research challenges

The MaxQuant software developed at the Max Planck Institute of Biochemistry is the most widely used platform in computational proteomics, enabling the analysis of large sets of MS data. Prior to its creation, researchers would manually print out and look at data for every peptide that they wanted to identify, while quantification was also completed manually. This was time consuming, unscalable, and inefficient as a means of obtaining results.

MaxQuant enabled an immediate transition to high-throughput proteomics, with software transforming the raw data output from MS instruments. The globally used software is freely available for both academic and nonacademic researchers, supporting proteomics experiments across the world.

This software can also be used for liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) shotgun proteomics—a method used for identifying proteins in complex mixtures to widen the dynamic range and coverage. Shotgun proteomics, the most commonly used MS-based approach, studies proteins by digesting them into peptides ahead of MS analysis. The software possesses a large ecosystem of algorithms for comprehensive data analysis.

The addition of the ion mobility dimension in advanced MS instruments delivers greater sensitivity, selectivity, and MS/MS acquisition speeds for proteomics research. The novel design allows for ions to be accumulated in the front section, while ions in the rear section are sequentially released depending on their ion mobility, and in subsequent scans selected precursors can be targeted for MS/MS. This process is called parallel accumulation serial fragmentation (PASEF).²

The unique trapped ion mobility spectrometry (TIMS) design allows researchers to reproducibly measure the collisional cross section (CCS) values for all detected ions, and those can be used to further increase the system’s selectivity, enabling further relative quantitation information from complex samples and short gradient analyses.

PASEF and TIMS technology have added an additional dimension to the discovery and research of proteomics, which can often be a challenge for algorithm processing times and software development. The increase in speed resulting from PASEF technology allows more samples to be analyzed in a shorter timeframe but generates vast amounts of spectral data, thereby creating challenges when dealing with large sample cohorts. The MaxQuant shotgun proteomics workflow was adapted to extract the copious amounts of information, making it possible to manage 4D features in the space scanned by retention time, ion mobility, mass, and signal intensity that advance the identification and quantification of peptides, proteins, and post-translational modifications.³

Clinical research

The analysis of proteomics data from samples derived from patients requires special computational strategies and presents several problems that need to be addressed. These include how to extract meaningful protein expression signatures from data with high individual variability, how to integrate the genomic background of the patients into the analysis of proteomic data, and how to determine biomarkers and properly estimate their predictive power. These problems must be addressed with special care, along with the reliability of answers obtained in data analysis, since clinical data analysis has direct implications for the health of individuals.

Researchers at the Max Planck Institute make use of machine learning algorithms and employ feature selection algorithms to extract predictive protein signatures (Figures 1 & 2). It is thought that clinical research proteomics will be one of the main applications of the future and that modern TIMS quadrupole time-of-flight (QTOF) instruments will greatly benefit from this as we move forward globally within clinical research proteomics.

Figure 1. MaxQuant temperature check software (visualization tab)

Figure 2. MaxQuant temperature check software (raw data tab)

The future: Analysis of single cells

The commitment to ensure the continual evolution of software such as MaxQuant for the development of proteomic research is key to allow scientists to progress through large-scale data sets. Without technological advances such as TIMS and PASEF and the integration of different algorithms, answering questions in biology through proteomics would be impossible, preventing researchers from addressing deeper questions and uncovering even more answers.

For example, technological advances are expanding single-cell (SC) analysis. SC genomics and SC transcriptomics are already being implemented in many laboratories, but SC proteomics is still relatively new. It is, however, set to evolve over the coming years. It holds the potential to enable researchers to compute the proteins in single cells, avoiding the need to infer proteins from cellular mRNA levels.⁴ It also creates new challenges for computational analysis, with TIMS instrumentation requiring sensitivity as well as the correct software to provide deeper cellular insights to scientists.

References
1. Cox J, Mann M. Quantitative, High-Resolution Proteomics for Data-Driven Systems Biology. Annu. Rev. Biochem. 2011; 80: 273–299.
2. Meier F, Brunner AD, Koch S, et al. Online Parallel Accumulation-Serial Fragmentation (PASEF) with a Novel Trapped Ion Mobility Mass Spectrometer. Mol. Cell. Proteomics 2018; 17(12): 2534–2545.
3. Prianichnikov N, Koch H, Koch S, et al. MaxQuant Software for Ion Mobility Enhanced Shotgun Proteomics. Mol. Cell. Proteomics 2020; 19(6): 1058–1069.
4. Marx V. A dream of single-cell proteomics. Nat. Methods. 2019; 16: 809–812.

Jürgen Cox, PhD, is research group leader, computational systems biochemistry at Max Planck Institute of Biochemistry. Gary Kruppa, PhD, is vice president, proteomics at Bruker Daltonics.

Advanced Computational Methods Drive Large-Scale Data Analysis in 4D

Software to manage large-scale research challenges

Clinical research

The future: Analysis of single cells

Single-Cell Cloning Remains a Challenge

New Approach to Nanopore Sequencing That Is Sure to CATCH Your...