Send to printer »

Feature Articles : Jul 1, 2013 (Vol. 33, No. 13)

Unraveling the Transcriptome

  • Richard A. Stein, M.D., Ph.D.

While developments in DNA sequencing provided increasingly high-quality and high-confidence genetic and genomic data, the indispensability of additional layers of inquiry in characterizing biological systems emerged as an acute necessity.

This fueled the interest to characterize and survey the transcriptome of various cell types, under specific conditions.

Insights into the transcriptome have been made possible as a result of several waves of technological advances and, among these, RNA-seq is assuming a central and expanding role.

“RNA-seq provides a great tool to start with and, in all certainty, people will develop more precise methods down the road,” says Chuan He, Ph.D., professor of chemistry and the University of Chicago. Dr. He and his colleagues are using RNA-seq, more specifically m6A-seq or MeRIP-seq, to characterize methylation changes on RNA transcripts.

While RNA methylation was observed and reported decades ago, its importance in shaping gene expression has only more recently come into the spotlight, and understanding the removal of methylation marks has been even more elusive.

Investigators in Dr. He’s group were the first to reveal that, just like DNA and histone methylation, RNA methylation is reversible. They found that fat mass and obesity-associated protein, FTO, involved in human obesity and energy homeostasis, is an oxidative demethylase of RNA N6-methyladenosine. Subsequently, Dr. He et al., identified a second RNA demethylase, ALKBH5, which affects mammalian mRNA export and metabolism. While both proteins demethylate N6-methyladenosine RNA residues, they participate in distinct biological pathways and show different tissue expression patterns.

“Once we open this door, there are so many possibilities that emerge because, if we consider all the pathways and networks, RNA modifications can shape and, in some cases, dominate gene regulation,” Dr. He says.

Additional efforts his lab revealed that RNA demethylation is functionally significant and performs a regulatory role. “We justified two critical points, change in gene expression and reversibility but, based on the more stringent definition for epigenetic modifications, we also need to ask whether these changes are heritable, and this aspect needs significantly more work,” he adds.

According to Jia Meng, Ph.D., associate researcher and bioinformatics core facility supervisor at MIT, “not much research has been done on RNA methylation in the past, but recent approaches are enabling us to study the RNA epigenome at an enhanced resolution and at the genome-wide scale.”

Dr. Meng and his colleagues recently developed FRIP-seq (fragmented RNA immunoprecipitation sequencing), a new tool that combines ChIP-seq with RNA-seq. Because of the nature of the RNA, software and algorithms developed for DNA methylation analysis are not informative, and new tools are required.

“This motivated us to develop this new algorithm that will help us, in the long run, to analyze the function of RNA methylation,” says Yufei Huang, Ph.D., professor of electrical and computer engineering at the University of Texas at San Antonio and senior author of the study describing FRIP-seq.

Dr. Huang and his colleagues are currently applying this technology to examine epigenetic changes in the RNA, particularly mRNA, from cancer cell lines.

Based on FRIP-seq, and with the help of computational strategies, a new MATLAB-based package called exomePeak was developed and is freely available for researchers interested in characterizing transcriptome-wide post-transcriptional RNA modifications.

“Over the next few months we will release a new version based on R, and it will be more powerful and user-friendly,” says Dr. Meng.

The existence of RNA methylation in several species—from humans to bacteria—reveals the importance that this process plays in biology. RNA methylation profiling is marked by several challenges, some of which are shared with the ones encountered in the case of DNA, while others are specific for RNA. For example, the presence of 5’-cytosine and 6’-adenine RNA methylation make it technologically more demanding to study this modification than it is in the case of DNA.

“In addition, RNA can be very unstable, and this makes it even more challenging to understand how methylation is introduced into and removed from RNA,” Dr. Huang says.

While the correct alignment of RNA reads to the original genomic sequences is one of the major goals in RNA sequencing, this process may be challenging for multiple reasons. One of them is that the length of RNA reads significantly shapes the effectiveness of reconstructing the transcriptome of the original cell, and shorter reads, though less costly to generate, present a higher risk for misalignment.

“The longer the reads, the higher the likelihood to assign them to the correct location,” says Steven L. Salzberg, Ph.D., professor of medicine, biostatistics, and computer sciences at Johns Hopkins University School of Medicine.

An additional, somewhat related challenge lies in the fact that the human genome contains at least 14,000 pseudogenes. Pseudogenes have highly similar sequences to transcribed genes but, as opposed to them, lack one or several introns, or contain premature stop codons and, as a result, do not encode functional proteins. Nevertheless, intron-spanning RNA reads may align to pseudogenes. A new spliced aligner that Dr. Salzberg and colleagues designed, TopHat2, addressed this and several other concerns.

In a two-step process, TopHat2 first identifies potential intron splice sites, similar to its previous version, TopHat1, and in a second step, it aligns reads that contain multiple exons. Novel algorithms incorporated into TopHat2 allow it to process more diverse sequencing datasets and to align reads of various lengths.

“Overall, TopHat2 aligns more reads, and it does so more accurately than the earlier versions of this algorithm,” Dr. Salzberg says.

The Pseudogene Problem

“We are interested in several aspects related to RNA-seq, as this approach allows us to find coding and noncoding transcripts, examine splicing, and perform quantification,” says Mark B. Gerstein, Ph.D., professor of biomedical informatics at Yale University.

Investigators in Dr. Gerstein’s lab recently performed a complete annotation of pseudogenes from the GENCODE Project data.

While pseudogenes were historically viewed as genomic loci that might not have any roles, some of them were recently proposed to have an active cellular role. By using locus-specific gene expression analyses combined with high-throughput RNA-seq, Dr. Gerstein and colleagues revealed that pseudogene transcription occurs in a tissue-dependent manner and is associated with active promoter regions and open chromatin states. The analysis revealed that even though many pseudogenes are inactive, some of them potentially may assume regulatory functions that are reminiscent of noncoding RNA molecules.

“RNA-seq is something that we will see being increasingly rolled out for multiple applications, including personal transcriptome profiling in cancer and other diseases,” Dr. Gerstein says.

Among the remaining challenges are the need to standardize gene expression measurements, to incorporate the degree to which specific genes are turned on or off, and to advance insights into noncoding RNA—a topic that has seen intense transformation over the past few years.

“The power of RNA-seq is that we can use it to visualize complex molecular signatures,” says B. Alex Merrick, Ph.D., group leader of the Molecular Toxicology and Informatics Group at the National Institute of Environmental Health Sciences.

Dr. Merrick and colleagues recently illustrated this in an analysis exploring the impact of subchronic aflatoxin B1 exposure on the male rat liver transcriptome. Aflatoxin B1, classified as a group A carcinogen by the World Health Organization, is synthesized by certain Aspergillus species. Causally linked to hepatocellular carcinoma, this toxin is still a significant public health concern worldwide, particularly in developing countries.

In a comparison to transcriptome profiles obtained with RNA-seq and microarray analyses, Dr. Merrick and colleagues reported that an increased number of differentially expressed transcripts can be visualized with RNA-seq. A key finding was that 49 differentially expressed transcripts were changed upon aflatoxin exposure.

“These transcripts would not have been captured, had we relied solely on microarray data,” Dr. Merrick says. Two of these transcripts, which appear to originate from new, previously unannotated genes, were induced 10- to 25-fold, respectively, as a result of exposure. Investigators in Dr. Merrick’s group cloned one transcript, HafT1 (hepatic aflatoxin transcript 1) and reported that it appears to correspond to a unique gene, for which no corresponding ESTs were previously identified. HafT1, induced into visibility by aflatoxin exposure, lies within an exon of a transcription factor (ortholog to mouse Tcf7l1), but it is transcribed in the opposite direction.

“There are so many unique features about this gene, and we would not have been able to capture them by using microarrays,” Dr. Merrick says.

A relevant aspect of this experimental strategy is that the 90-day 1 ppm exposure that was employed in the analysis provided an opportunity to examine chemical carcinogenesis under conditions that mimic chronic, low-dose human toxicity. “This is a reasonable surrogate for human exposure,” says Dr. Merrick.

While liver tumors can form over time at this exposure level, no malignancies or advanced tissue necrosis were observed in the study. Several of the differentially regulated transcripts were related to the function of kinetochore components, which are involved in cell division. These molecular changes, in all likelihood, would not have been captured with the more pronounced histological damage that generally occurs at higher doses. This illustrates the ability of RNA-seq to reveal very early molecular changes that occur during tissue remodeling, at stages that precede histological damage and tumor formation, and the strategy emerges as a promising tool to dissect molecular and cellular pathways affected by other toxins.

Transcriptomics Meets Proteomics

“It is exciting to perform RNA-seq analysis on the same sample on which proteomics was done,” says Lloyd M. Smith, Ph.D., professor of chemistry at the University of Wisconsin-Madison.

One of the challenges accompanying mass spectrometry-based analyses is that human proteomic databases, despite being frequently updated, do not reflect cell-to-cell variation in the multiple protein forms that are found in various cell and tissue types.

Two major approaches have been implemented and are broadly used in proteomics. Bottom-up proteomics, which involves the enzymatic digestion of proteins into fragments that are subsequently identified by mass spectrometry, is technically more amenable, and the data are easier to interpret than in top-down proteomics, which involves the ionization and mass spectrometry analysis of intact proteins.

While bottom-up proteomics offers higher sensitivity, it is not informative about the context where the peptides originated from, such as alternatively spliced or post-translationally modified protein products.“There are a lot of things that get lost during bottom-up proteomics,” Dr. Smith says.

A new concept that Dr. Smith and colleagues introduced, that of proteoforms, is used to refer to all the molecular forms that the protein product of a single gene can be found in.

This term, capturing a new layer of complexity thus far mostly overlooked, would ensure that protein changes resulting from coding single nucleotide polymorphisms and mutations, post-translational modifications, and RNA splicing are represented when referring to cellular proteins. “It is important to describe all the different forms of a protein that may exist in cells,” he explains.

Recently, Dr. Smith and his colleagues collected proteomics and RNA-seq data from a homogeneous cell population, and developed a bioinformatics pipeline in which novel splice junction sequences were translated into the respective polypeptides, to establish a database that can be used to characterize splice junctions during mass spectrometry.

“Using RNA-seq in conjunction with proteomics is more powerful than performing proteomics alone,” he says. While the strength of this analysis was illustrated in one cell type, efforts in Dr. Smith’s lab are currently directed toward characterizing splice junction peptides and splice site variation in additional cell types. “This is similar to giving glasses to proteomics, and its main advantage is the possibility to unveil splice variants that otherwise one could not see,” Dr. Smith adds.

RNA-seq helped open new research avenues, forge inter- and cross-disciplinary connections, and define new concepts. Areas that historically received relatively little attention, such as RNA methylation, are now expanding into vibrant fields, while more recent disciplines, such as proteomics, are acquiring additional levels of inquiry.

Knowing the extent to which learning about the genome reshaped our perspectives about biology, and considering the even more accentuated complexity of the transcriptome, one can only imagine the wealth and intricacy of the regulatory networks that are waiting to be elucidated.