ProteomeGenerator: A Framework for Comprehensive Proteomics Based on de Novo Transcriptome Assembly and High-Accuracy Peptide Mass Spectral Matching
journal contributionposted on 08.10.2018, 00:00 by Paolo Cifani, Avantika Dhabaria, Zining Chen, Akihide Yoshimi, Emily Kawaler, Omar Abdel-Wahab, John T. Poirier, Alex Kentsis
Modern mass spectrometry now permits genome-scale and quantitative measurements of biological proteomes. However, analysis of specific specimens is currently hindered by the incomplete representation of biological variability of protein sequences in canonical reference proteomes and the technical demands for their construction. Here, we report ProteomeGenerator, a framework for de novo and reference-assisted proteogenomic database construction and analysis based on sample-specific transcriptome sequencing and high-accuracy mass spectrometry proteomics. This enables the assembly of proteomes encoded by actively transcribed genes, including sample-specific protein isoforms resulting from non-canonical mRNA transcription, splicing, or editing. To improve the accuracy of protein isoform identification in non-canonical proteomes, ProteomeGenerator relies on statistical target–decoy database matching calibrated using sample-specific controls. Its current implementation includes automatic integration with MaxQuant mass spectrometry proteomics algorithms. We applied this method for the proteogenomic analysis of splicing factor SRSF2 mutant leukemia cells, demonstrating high-confidence identification of non-canonical protein isoforms arising from alternative transcriptional start sites, intron retention, and cryptic exon splicing as well as improved accuracy of genome-scale proteome discovery. Additionally, we report proteogenomic performance metrics for current state-of-the-art implementations of SEQUEST HT, MaxQuant, Byonic, and PEAKS mass spectral analysis algorithms. Finally, ProteomeGenerator is implemented as a Snakemake workflow within a Singularity container for one-step installation in diverse computing environments, thereby enabling open, scalable, and facile discovery of sample-specific, non-canonical, and neomorphic biological proteomes.
Read the peer-reviewed publication
canonical reference proteomesSnakemake workflowPEAKS massSingularity containernon-canonical protein isoformsModern mass spectrometrynon-canonical mRNA transcriptionreport ProteomeGeneratorreport proteogenomic performance metricsNovo Transcriptome Assemblyhigh-confidence identificationreference-assisted proteogenomic database constructionanalysis algorithmssample-specific protein isoformssample-specific controlsintron retentionMaxQuant mass spectrometry proteomics algorithmsleukemia cellsalternative transcriptionalfactor SRSF 2SEQUEST HTHigh-Accuracy Peptide Mass Spectralsample-specific transcriptome sequencinghigh-accuracy mass spectrometry proteomicsgenome-scale proteome discoveryprotein sequencesprotein isoform identificationnon-canonical proteomesComprehensive Proteomicsproteogenomic analysis