Keywords: metaproteomics, microbiome, bioinformatics, taxonomy, mass spectrometry
Metaproteomics research using mass spectrometry data has emerged as a powerful strategy to understand the mechanisms underlying microbiome dynamics and the interaction of microbiomes with their immediate environment. Recent advances in sample preparation, data acquisition, and bioinformatics workflows have greatly contributed to progress in this field. In 2020, the Association of Biomolecular Research Facilities Proteome Informatics Research Group launched a collaborative study to assess the bioinformatics options available for metaproteomics research. The study was conducted in 2 phases. In the first phase, participants were provided with mass spectrometry data files and were asked to identify the taxonomic composition and relative taxa abundances in the samples without supplying any protein sequence databases. The most challenging question asked of the participants was to postulate the nature of any biological phenomena that may have taken place in the samples, such as interactions among taxonomic species. In the second phase, participants were provided a protein sequence database composed of the species present in the sample and were asked to answer the same set of questions as for phase 1. In this report, we summarize the data processing methods and tools used by participants, including database searching and software tools used for taxonomic and functional analysis. This study provides insights into the status of metaproteomics bioinformatics in participating laboratories and core facilities.
ADDRESS CORRESPONDENCE TO: Pratik D. Jagtap, 321 Church Street SE, University of Minnesota, Minneapolis, Minnesota 55455 (Phone: 612-816-4232; Email: [email protected]).
ADDRESS CORRESPONDENCE TO: Susan T. Weintraub, 7703 Floyd Curl Drive, University of Texas Health Science Center, San Antonio, Texas 78229 (Phone: 210-567-4043; Email: [email protected]).
ADDRESS CORRESPONDENCE TO: Magnus Palmblad, 2300 RC Leiden, Center for Proteomics and Metabolomics, Leiden University Medical Center, Leiden, The Netherlands (Phone: +31(0)71 526 6969; Email: [email protected]).
Conflict of Interest: None of the individuals and financial support funding associations have a conflict of interest.
All mass spectrometry datasets are available via ProteomeXchange with identifier PXD034795.
Keywords: metaproteomics, microbiome, bioinformatics, taxonomy, mass spectrometry
Mass spectrometry–based metaproteomics provides valuable insight into microbiome composition and function by identifying and quantifying the proteins expressed by microbiota. The field has seen a steady growth in the last few years., While notable progress has been made in the optimization of sample preparation and data acquisition methods, bioinformatics analysis has remained particularly challenging., For example, metaproteomics samples often have a high level of microbial diversity, and when the host is also present, the microbial content may be proportionally very low. Once the mass spectrometry data are acquired, the spectra need to be matched against protein sequences. However, when the composition is very diverse, it is necessary to search against large protein sequence databases, which can lead to low peptide identification sensitivity and/or detection of high numbers of false positives. Moreover, the protein sequence databases need to be appropriately annotated to permit protein inference as well as functional pathway and taxonomic analyses. Researchers can address this challenge by generating sample-specific databases using matched metagenomic data. However, these tasks require substantial computational resources at both the software and hardware infrastructure levels. Moreover, mapping microbial peptides to proteins is further complicated by the fact that many of the peptides are shared by homologous proteins across taxonomic groups. The relative abundance of the taxonomic species, dynamic range of their protein expression, and biological variability add further challenges in biological interpretation.
To address issues related to the detection of microbial peptides and proteins, iterative approaches have been used to search metaproteomics datasets against large public repository databases.,,,,, Other advancements for database search strategies include the following: database reduction by processes such as the sectioning of the large protein sequences database,, the use of specialized database structure and search, and de novo search methods., More recently, the availability of matched metagenome data has made it possible to generate customized search databases for optimal search results.,, In peptide-centric approaches, researchers have used various software tools to detect taxonomy,, and function.,,, Bioinformatics approaches have also been developed for the quantitative assessment of changes in taxonomy and function.,,,,
The Association of Biomolecular Resource Facilities (ABRF) Proteome Informatics Research Group (iPRG) considered it important to assess the state of metaproteomics bioinformatical analysis and decided to conduct a 2-phase research study on this topic in 2020. In addition to standard mechanisms for the announcement of ABRF studies, we proactively contacted metaproteomics researchers across the world and invited them to participate by submitting their analysis of our bacteriophage infection dataset. In the first phase, minimal information about the dataset was provided, while in the second phase, we provided a search database that would aid in the metaproteomics analysis. We asked participants to respond with the following information: (a) a detailed explanation of their data processing workflow, including protein sequence database(s) and strategy used for peptide-spectral matching; (b) taxonomic composition (along with details on how this was determined); and (c) any biological functions or phenomena observed in the data. See Figure 1 for an overview of the study design.
Four separate liquid cultures were prepared for each of the following bacteria: Escherichia coli strain BE, Salmonella enterica strain UB-0015, and Bacillus subtilis strain 168. Each culture was derived from a different single colony that was inoculated into 15 mL of Luria Broth (LB) (Miller) supplemented with 0.2% nutrient broth (LB+N) and incubated with shaking (150 rpm) overnight at 34 °C. The following morning, the 12 cultures were separately subcultured (1:75 dilution) into 30 mL LB+N broth and incubated at 34 °C with shaking. An additional subculture of E. coli was included to permit the monitoring of optical density. At optical density at 600 nm wavelength (OD600) 0.9, the 5 E. coli cultures were infected with bacteriophage T4 at a multiplicity of infection of ~3, and the infection was allowed to proceed for 20 minutes. One culture of each of the 3 different bacteria was used to generate 4 biological mixtures, each containing 3 mL of B. subtilis, 30 mL of S. enterica, and 30 mL of T4-infected E. coli. The cells were then rapidly harvested by centrifugation (7900 g, 23 °C, 3 minutes) and the supernatants decanted. The cell pellets were subjected to 2 freeze–thaw cycles, and each was resuspended in 1.4 mL of buffer (50 mM Tris-Cl [pH 7.5], 50 mM NaCl, and 1 mM MgCl2) that was supplemented with 1x BugBuster reagent (MilliporeSigma) and 4 µL of Lysonase Bioprocessing Reagent (MilliporeSigma). The mixtures were incubated at 23 °C for 10 minutes with vortexing and then stored at -80 °C.
Four biological replicates of the mixture described above of proteins from T4 bacteriophage and its host E. coli along with non-host species S. enterica and B. subtilis were used for mass spectrometry (MS) analysis. After thawing, samples were mixed with a buffer containing 5% sodium dodecyl sulfate (SDS) /50 mM triethylammonium bicarbonate (TEAB) in the presence of protease and phosphatase inhibitors (Halt; Thermo Scientific) and nuclease (Pierce Universal Nuclease for Cell Lysis; Thermo Scientific). Aliquots corresponding to 100 µg protein (EZQ Protein Quantitation Kit; Thermo Scientific) were reduced with tris(2-carboxyethyl) phosphine hydrochloride, alkylated in the dark with iodoacetamide, and applied to S-Traps (mini; Protifi) for tryptic digestion (sequencing grade; Promega) in 50 mM TEAB. Peptides were eluted from the S-Traps with 0.2% formic acid in 50% aqueous acetonitrile and quantified using Pierce Quantitative Fluorometric Peptide Assay (Thermo Scientific). A 1-µg sample of each digest was analyzed by capillary LC-MS/MS on a Thermo Scientific Orbitrap Fusion Lumos mass spectrometer. On-line High-performance liquid chromatography (HPLC) separation was accomplished with a Thermo Scientific/Dionex RSLC NANO HPLC system: column, PicoFrit (New Objective; 75 μm i.d.) packed to 15 cm with C18 adsorbent (Vydac; 218MS 5 μm, 300 Å); mobile phase A, 0.5% acetic acid (HAc)/0.005% trifluoroacetic acid (TFA); mobile phase B, 90% acetonitrile/0.5% HAc/0.005% TFA; gradient 3 to 42% B in 30 minutes; flow rate, 0.4 μL/minutes. Precursor ions were acquired in the Orbitrap in centroid mode (scan range, m/z 300-1500; resolution, 120000); data-dependent, higher-energy, collision-induced dissociation spectra of ions in the precursor scan were acquired at the same time in the ion trap ("top speed"; threshold to trigger MS2, 50000; quadrupole isolation, 0.7; charge states, 2+ to 5+; dynamic exclusion, 30 seconds; normalized collision energy, 30%).
The “iPRG-2020 Proteome Informatics Research Group Study on Metaproteomics” was first announced in April 2020 via the iPRG website (see announcement here) and social media (Google Forms and Twitter). In addition, potential participants were contacted via email. In this phase, no information was provided about the composition of the samples (such as the number of species present or the domains that were represented). Raw LC-MS/MS data files that had been acquired on a Thermo Fisher Scientific Orbitrap Fusion Lumos instrument were made available to participants along with specifics about data acquisition. Participants were asked to provide the following:
1.1. A detailed explanation of the steps used for the analysis, including why the specific sequence databases were selected, how they were assembled, how spectra were matched/assigned, and how the taxa were identified.
1.2. A list (as a text file or table) of the taxa identified in the sample.
1.3. A relative abundance of the different species (such as numbers of peptide-spectrum matches [PSMs], distinct peptides, or proteins).
1.4. A description of any biologically interesting phenomena observed (such as biological pathways, functional groups, or proteins).
The results were to be submitted via email to the study anonymizer by the end of November 2020. The anonymizer did not share the identities of the participants with anyone, including other members of the iPRG.
In the second phase, participants were given some clues about the composition of the sample and were asked to use a provided FAST-all text-based format (FASTA) protein sequence database to answer a set of additional questions. The sequence database contained all species present in the sample and 3 additional species—Citrobacter freundii, Clostridium butyricum, and Salmonella bongori—having varied evolutionary distance from its closest relative among the species in the study sample, and, consequently, varying degrees of overlap of shared peptide sequences. Participants were not told which of the 7 species should have been detectable in the sample. In this second phase, the participants were asked to provide the following:
2.1. A detailed description of how you performed the analysis after being provided with the FASTA sequence databases covering the species present in the sample.
2.2. A list of the species or taxa identified in the sample, along with metrics of their relative abundances (including number of PSMs, distinct peptides or proteins, and/or their confidences).
2.3. A description of any biologically interesting phenomena you can observe (such as biological pathways, functional groups, or proteins).
Participants were also asked the following 3 “bonus” questions:
2.4. Did you find chimeric tandem mass spectra in the dataset? If so, how did you report the PSM, and how did you decide between the multiple options?
2.5. There are public datasets for some of the species that are in these samples. Comparing the proteins identified in these samples with those in the public resources, what can you infer about the physiology/state of the organisms in the study samples?
2.6. For peptides corresponding to important taxonomy or functions in a metaproteomics study, how would you validate them?
The results were to be submitted via email to the study anonymizer by the end of January 2021.
The data files were subsequently deposited in the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD034795 and DOI 10.6019/PXD034795.
Submissions were received from 9 participants in the first phase and 8 participants in the second phase; 7 groups participated in both phases. Please see https://osf.io/pze9x for Phase 1 submissions and https://osf.io/w6msx for Phase 2 submissions.
No information was initially provided about sample complexity or taxonomic composition. As such, participants used a wide variety of search engines for the study (Table 1,,,,,) protein sequence databases (Table 2) and informatics tools for taxonomy and functional analysis (Table 3,,,,), as shown below.
Search engines used by participants.
Number of participants
Protein sequence databases used by participants.
Number of sequences
Number of participants
Wellcome Sanger Institute MetaHIT 3.3
Informatics tools used by participants.
Number of participants
Spectral counts and relative biomass estimation
The sample was made up by approximately 48% T4 bacteriophage-infected E. coli, 48% S. enterica, and 4% B. subtilis. A correct ranking of the taxa in order of abundance should, therefore, have E. coli and S. enterica above B. subtilis. Only 2 groups (Figure 2A and Supplementary Table 1) reported the correct species abundance order in the samples. In Phase 1, participants used various metrics to estimate taxonomic abundance, including the number of PSMs, number of peptides, number of proteins, peptide intensities, relative abundance of detected taxa, or the relative cell biomass associated with taxonomic units. Of the 9 participants in Phase 1, all detected E. coli in the sample, and all but one found evidence for S. enterica. Only 5 participants identified B. subtilis in the sample, while 6 participants reported the presence of Enterobacteria phage T4.
Phase 2 participants were provided with a database containing the protein sequences of the 4 organisms in the samples plus sequences for 3 related organisms that were not present in the samples (see Supplementary Figure 1). In Phase 2, all 8 participants reported the presence of E. coli, S. enterica, B. subtilis, and Enterobacteria phage T4 in the samples. Four of the groups (Figure 2B) determined the correct order of abundance. One participant (20200626_0845) specifically reported that there was insufficient mass spectrometry evidence for any of the 3 additional species that were included in the provided database but not actually present in the sample as decoys.
For functional analysis in Phase 1, participants reported Gene Ontology terms (taxon specific and nonspecific), heatmap analyses, and Sankey diagrams. Since information was not initially provided about the species present in the samples, we expected to receive a range of observations and conclusions about the dataset. One participant (20200626_0845) submitted a comprehensive functional interpretation in which detected proteins were matched to Kyoto Encyclopedia of Genes and Genomes Ontology (KO) entries, and then KO entries with normalized spectral matches were used for principal component analysis (PCA). PCA separated E. coli and S. enterica, with 2 primary differences being that (a) OmpC (receptor for T4 bacteriophage) is more abundant in S. enterica, and (b) OmpF (receptor for T2 bacteriophage) is more abundant in E. coli (Figure 3). Based on this observation (along with the detection of T4 bacteriophage), the participants hypothesized that the interaction of the bacteriophage T4 with the OmpC of E. coli impacted the expression of OmpC. Their logic was that E. coli might overproduce OmpF to compensate for OmpC functional loss. Since the effect of OmpC expression is specific to E. coli, the participants further speculated that bacteriophage T4 specifically interacted with E. coli.
Other reported conclusions for Phase 1 of the study were the following: (a) the samples were generated by infection of E. coli and Salmonella with T4 bacteriophage (20200625_1441); (b) the T4 bacteriophage selectively infected E. coli (20200623_1230); (c) the samples contained E. coli and S. enterica. (20200731_1646); (d) speculation that the samples were obtained from human gut following foodborne salmonellosis (20200626_1542); and (e) organism-specific functions were detected (20200721_1054), but the participant did not report anything about taxonomy.
For the second phase, the participant who had correctly ascertained that the bacteriophage T4 had specifically interacted with E. coli (20200626_0845) reasserted their conclusion. Additionally, 1 participant detected elevated ribosomal E. coli proteins, an indication of bacteriophage T4 infection (20210128_1356). Among the other submissions were the following (not all of which are correct): (a) the samples were generated from a laboratory mixture of Bacillus, Escherichia, and Salmonella (20200625_2338); (b) the samples were from an anaerobic fermenter (20200626_1856); and (c) the samples represented a bacterial mixture cultivated in aerobiosis or insufficient de-aeration for C. butyricum (20200731_1630).
While providing additional information about the component organisms resulted in more accurate taxonomy detection among participants in the second phase as compared to the first phase, this did not translate into improved functional analysis or biological interpretation.
Through identification and relative quantification of component proteins, metaproteomics can provide insight into how a microbiome responds to its immediate environment. However, numerous challenges remain because of the complexity of the samples, resulting in the need for optimization of sample preparation, data acquisition, and data analysis., The current study was designed to assess the bioinformatic approaches available to address the difficult task of detecting taxonomy and deducing biological interpretation from a metaproteomics dataset.
In preparation for the study, 4 biological replicates of mixtures of bacterial cells were generated, each containing approximately equal quantities of 2 related bacteria (E. coli strain BE and S. enterica strain UB-0015) and a substantially lower level of a third bacterium (B. subtilis strain 168). Since E. coli and Salmonella share many homologous proteins, it was anticipated that this would result in the identification of many peptides shared between the 2 bacteria in the downstream analysis of the mass spectral data. To introduce an additional dimension to the study, we used E. coli that had been infected with bacteriophage T4 as one of the components of the sample. We anticipated that phage infection of one of the bacteria would alter the quantity of one or more of the host proteins, thereby introducing permutations in the levels of some shared peptides.
Our analysis of the bioinformatic approaches used by participants indicated that there is a need for educating researchers who are new to the field of metaproteomics of best practices for data processing, especially for functional data analysis. In the first phase of the study, only 2 research groups reported the correct rank order of taxonomic abundance in the samples. Interestingly, 1 of the 2 groups accurately deduced the biological implications of the dataset. Other participants concluded that there had been bacteriophage infection of E. coli. It is important to note that determining biological relevance was an especially challenging question since no metadata, metagenomic sequencing information, or protein FASTA database were provided in the first phase. In real-life scenarios, metaproteomics searches are often performed against a matched metagenome or data from a public repository, but in this study, we wanted to find out what level of information could be independently deduced from metagenomics data.
In the second phase, participants were much more successful in identifying the taxonomic composition because the necessary protein sequence database file was provided. Four groups correctly reported the taxa and their relative abundance. For a functional analysis though, apart from the group that reported the bacteriophage–E. coli interaction in the first phase, only one more participant speculated that there was evidence for bacteriophage infection.
A valuable outcome of this iPRG 2020 study is the insight it provided into the repertoire of bioinformatics approaches in use for metaproteomics research, since a variety of software tools and data processing pipelines were used (Table 1). The study also highlighted the fact that there is a need for a wider dissemination of knowledge from expert metaproteomics laboratories about ways to process data and how to use the outputs to formulate biologically relevant conclusions.
In 2021, the Metaproteomics Initiative was formally established. The mission of this consortium of microbiome researchers is to disseminate metaproteomics fundamentals, advancements, and applications through collaborative networking in microbiome research. This type of initiative is essential to provide educational resources for researchers who are interested in metaproteomics. Additionally, it brings together researchers of diverse backgrounds, perspectives, and skillsets to achieve higher goals via collaborative efforts.
Based on this study, we feel that the metaproteomics informatics field will benefit from improvements in taxonomy detection tools for mass spectrometry–based metaproteomics datasets. While there was a marked improvement in taxonomic detections in Phase 2, when a protein database had been made available, it is clear that further improvements in taxonomic ranking and accuracy are still needed. The lack of consistent results at the taxonomic level was also evident from the CAMPI Study  undertaken by the Metaproteomics Initiative. We believe that there is a need for the generation of ground truth datasets both at taxonomic and functional levels to develop better algorithms and methods for taxonomic detection, protein grouping, and spectral assignment. With this in mind, the Metaproteomics Initiative has recently launched the CAMPI3 study with a focus on taxonomic and functional annotations as well as resolving protein inference and spectral assignment issues in metaproteomics research.
Although proteomics can be considered to be a generally mature discipline, metaproteomics remains one of its more challenging extensions. Nevertheless, the performance of the participants in this study demonstrated that metaproteomics is sufficiently developed to be able to answer a wide range of interesting questions about taxonomic composition. The results from this study can serve as a starting point for larger, in-depth, community efforts in metaproteomics and related fields.
Mass spectrometry analyses were conducted at the University of Texas Health Science Center at San Antonio (UTHSCSA) Institutional Mass Spectrometry Laboratory with the expert technical assistance of Sammy Pardo and Dana Molleur, supported in part by UTHSCSA and the University of Texas System Proteomics Core Network for the purchase of the Orbitrap Fusion Lumos mass spectrometer.
The identification of certain commercial equipment, instruments, software, or materials does not imply recommendation or endorsement by the National Institute of Standards and Technology or the ABRF, nor does it imply that the products identified are necessarily the best available for the purpose.
The MS data files are available in the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD034795 and 10.6019/PXD034795.
The authors would like to thank all the participants who took the time to analyze and return the data for this study. Some of the participants are the members of the Metaproteomics Initiative (https://metaproteomics.org/), the goals of which are to promote, improve, and standardize metaproteomics.