While refining techniques for tracing the proteogenomic origins of cancer, researchers found protein-coding DNA sequences lurking in the shadowy realm known as junk DNA. Junk DNA is often thought to be functionless, with the exception of certain sequences thought to have a role in gene expression. But at least some portions of junk DNA actually code for protein, say the aforementioned researchers.
In fact, these scientists, pioneers in cancer proteogenomics, have identified nearly 100 new protein-coding regions, a number of which qualify as pseudogenes, sequences found in the genome’s putative junkyard.
The scientists, based at Karolinska Institutet and Science for Life Laboratory (SciLifeLab) in Sweden, described their findings November 17 in Nature Methods, in a paper entitled “HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics.”
The “HiRIEF” cited in the paper’s title refers to high-resolution peptide isoelectric-focusing fractionation. Developed by the researchers to increase sensitivity and analytical depth in shotgun proteomics, HiRIEF helps simplify samples before MS analysis. It is, in essence, a search-space reduction technique, a way to expedite comparisons of protein sequence data with data from MS spectra. Such techniques are especially valuable in the study of organisms having large genomes and low protein-coding content.
“We had to match experimental data for sequences of peptides with millions of possible locations in the whole genome,” said study leader Janne Lehtiö, Ph.D. “We had to develop both new experimental and bioinformatics methods to allow protein-based gene detection, but when we had everything in place it felt like participating in a Jules Verne adventure inside the genome.”
In their paper, Dr. Lehtiö and colleagues describe how they used HiRIEF at the peptide level in the 3.7–5.0 pH range and peptide isoelectric point prediction to probe the six-reading-frame translation of the human and mouse genomes. This work allowed them to identify previously undiscovered protein-coding loci—98 for human and 52 for mouse genomes. According to the paper’s authors, the method also enabled deep proteome coverage, revealing 13,078 human and 10,637 mouse proteins, a result suggesting that their method does provide deep proteome coverage.
The authors grouped genomic loci into three different classes: (1) refined models of known genes, (2) pseudogenes and long noncoding RNA genes, and (3) intronic and intergenic loci with no connection to gene annotations. According to the authors, 36%, of human novel peptides mapped to pseudogenes, a surprisingly large percentage given that pseudogenes represented less than 0.1% of the total search space.
“Our study challenges the old theory that pseudogenes don’t code for proteins,” said Dr. Lethiö. “The presented method allows for protein-based genome annotation in organism with complex genomes and can lead to discovery of many novel protein-coding genes, not only in humans but in any species with a known DNA sequence.”