Now that gene expression microarray experiments are routine, the challenge is to extract accurate intelligence from the data they produce. The problem isn’t inaccuracies in the analysis software or methodologies, or even in the microarrays. Rather, because complex biological processes are driven by interactions of multiple genes, different analytical approaches could yield different results. Researchers, instead of focusing on individual genes, are stepping back to see the broader picture.
“The problem with microarray analysis is that you have a small sample size versus a large number of things you’re looking at,” notes Dhammika Amaratunga, Ph.D., senior research fellow in statistics, Johnson & Johnson Pharmaceutical Research & Development (J&JPRD; www.jnjpharmarnd.com).
In such circumstances, a few bubble to the top, and that’s what researchers tend to focus on. “That’s not enough,” Dr. Amaratunga says, “because those few samples don’t represent intersample expression differences and intergene correlations well, particularly if they are subtle, and therefore don’t directly present the true findings. You have to dig deeper.”
Data analysis for gene expression microarrays is an evolving field, Dr. Amaratunga emphasizes. “The technology is improving and people are running experiments with greater precision, but the field including the biology, the technology, and the associated data analysis” hasn’t matured yet. In terms of data analysis, companies throughout the world are developing approaches to integrate information to create more thorough analyses.
J&JPRD is integrating gene function and pathway information into routine gene expression data analysis. That approach, according to Dr. Amaratunga and Nandini Raghavan, Ph.D., principal biostatistician, “yields a different and functionally more interpretable array of genes than methods that rely solely on individual gene scores,” which tend to identify the largest differentially expressed genes in a given experiment at the cost of those with more subtle changes they said, quoting one of their published papers.
Although many genes aren’t well delineated, their functional groups and pathway information often are better understood. Because many genes with a common function share a common binding site in their promoter regions for a specific transcription factor, they can be coregulated by that transcription factor, according to their paper, which discusses an example of a group of enzymes involved in the oxidation of fatty acids in rat livers.
These enzymes increase their production in the presence of peroxisome proliferators. Increasing the production of cofunctional genes could suppress other pathways, their research shows. Consequently, applying gene function and pathway information to gene expression analysis could result in more effective compounds being designed faster and with fewer side effects.
“By grouping genes by biological function or pathway, you tend to pick up the subtle signals that may not be picked up if you look at these genes individually,” Dr. Raghavan says. Although many of the genes grouped together in the analysis may seem relatively unimportant singly, their collaborative actions may regulate molecular mechanisms and biological pathways that otherwise could be undetected or ignored, thereby altering the interpretation of experiments and outcomes.
Integrating gene function and pathway information in gene expression data analysis is being used alongside other analytic methods in J&JPRD’s drug development programs, Dr. Raghavan says. Studies have yielded some “relevant findings.” The company is exploring using this method in other areas such as proteomics and possibly developing related methods to identify unknown pathways and biological networks.
Stratagene (www.stratagene.com) is addressing the issue of gaining more accurate and complete information from gene expression data through a gene set enrichment-analysis approach. The algorithms are based on those from the Broad Institute (www.broad.mit.edu) and look at sets of statistically significant gene changes between experiments or between the subjects and controls, according to David Edwards, Ph.D., director of software solutions.
The benefit, Dr. Edwards says, is that rather than focus on individual gene expression, you can look at a gene with some known biological function first and then locate commonalities with other genes or pathways. Dr. Edwards uses Broad subsets related to specific biopathways and diseases, and to physical location. “This method gives more clues faster and is an alternative to existing methods.” But, he emphasizes, “researchers should use multiple approaches to understanding biological function.”
Stratagene added Gene Set Enrichment Analysis to its software applications in August, Dr. Edwards says, offering the benefit of performing multiple analyses within one application. When he investigated biological function in lung cancer he also used gene ontology, standard expression analysis, and copy-variation analysis, he says.
The next endeavor, according to Dr. Edwards, is to combine data to look at overlapping groups and pathways, and to perform different experiments in the same system.
JMP® Genomics, statistical discovery software from SAS (www.sas.com), leverages the SAS and JMP, a business unit of SAS, platforms for genomics-specific analysis. The application was developed in response to the increasingly large data sets and more complex modeling needs of biologists, chemists, and biostatisticians. JMP Genomics enables researchers to run prebuilt SAS analytical methods on the desktop, according to Shannon Conners, Ph.D., JMP Genomics product manager. “JMP Genomics is a marriage of JMP and SAS,” she says.
The marriage yields “better visualization and heavy data manipulation, and you don’t need to know programming or SAS,” she says of the point-and-click interface. Dr. Conners notes, though, that “biostatisticians can look inside and see what we’re doing” and adapt existing SAS code to run customized programs.
JMP Genomics is used to identify patterns in high-throughput genetics, copy number, expression microarrays, and proteomics data. More than 100 analytical procedures help researchers generate a clearer vision of data quality and then apply sophisticated statistical modeling methods to determine relationships between experimental variables. Scientists can merge annotation from various sources or link directly to Ingenuity Systems’ (www.ingenuity.com) Pathways Analysis software for further functional analysis of results.
Because JMP is a platform, Dr. Conners emphasizes, “you aren’t limited to a specific analytic work flow letting users employ additional analyses beyond our prebuilt ones.” JMP Genomics features dynamically interactive graphics and analysis dialog boxes so researchers can explore data relationships using traditional and advanced statistical algorithms, Dr. Conners explains.
“In the future, users will combine data sets from multiple experiments, and those data sets will be really big,” she says. For IT departments, the ongoing migration of CPUs to a 64-bit environment will help deal with the larger data sets, and a 64-bit version of JMP Genomics is being planned. Another option is the expansion of the JMP Genomics platform to take advantage of a grid-computing environment for even faster, more efficient processing power.
Alternative to Bottom-up Approach
Wei Liu, Ph.D., principal scientist at Wyeth Research (www.wyeth.com), is using data mining to find tissue-selective genes “as a complement to the traditional genomics’ bottom-up approach to drug discovery,” he says. This approach, notes Dr. Liu, yields a smaller, more focused data set than the traditional method.
Researchers know that multiple genes are involved in given diseases, yet the usual approach looks at a single gene. “About three or four years ago,” Dr. Liu says, “we began a systems biology approach, dividing the body into tissue, organs, and cells to see their involvement in disease.”
Wyeth has collected tissue-selective genes for more than 10% of human genes, finding 119 kinases, 33 phosphatases, and 152 transcription factors that are tissue selective. A rough literature search of 4,000 tissue-selective genes revealed about 1,600 that were linked to about 3,000 diseases. So, by targeting particular tissues, researchers can produce a smaller, more relevant and more specific data set.
“There is no solid example of this approach in use in drug discovery yet,” but, he says, the approach may potentially decrease off-target and off-tissue effects, thus speeding the early pipeline. By using immune-specific genes as markers, he can replace pathology testing with electronic histology.
Citing neutrophil as an example, Dr. Liu notes that it is involved in COPD, arthritis, and other diseases. Although it has a half life of 4–10 hours, it is present for a longer period in diseased tissues. “We can look for genes expressed in neutrophils and find out how many are involved in the diseases. Knocking them down or destroying their expression pathway has a direct impact on the disease and the therapeutic outcome.”
Noise hasn’t traditionally been a large issue in microarray analysis, but it is growing as data sets become larger. “Consequently, many researchers aren’t aware of the substantial noise reduction that can be achieved in microarray experiments, leading to higher quality data and experiments that are reproducible,” according to Thomas J. Downey, president and CEO at Partek (www.partek.com). “Reducing noise begins at the experiment-design stage, thus ensuring that appropriate analysis methodology can be leveraged.”
For example, he says, if treated samples of DNA or RNA chips are run on Monday and controls are run on Wednesday, differences that actually are due to technical batch effects may masquerade as real biological signals. “A better strategy is to balance the treatments with the processing batches, running treated samples and controls in the same batch,” he says. The approach guarantees that apparent differences between treatment groups aren’t attributable to batch effects, and it allows analysis methods to eliminate noise, he adds, making it easier to distinguish between noise and biological signal.
Downey advocates the analysis of variance method. “It’s the most powerful method because it looks at many factors.” The number of factors to be analyzed depends on the size of the study. “Noise is an unexplained variation in data. By explaining the variations due to technical processing batches, we can eliminate as much as 99 percent of the noise,” Downey notes.