GEN Exclusives

More »

Expert Tips

More »

6 Tips for Annotating Your Genome from Scratch

You’ve just received the freshly assembled genomic sequence of your favorite organism. Now what?

  • Click Image To Enlarge +
    Tools with a rich UI that tightly integrate sequence, function, and pathway data enable a faster and deeper understanding of genome biology.

    A deluge of genome sequences has attended the rise of next-generation sequencing (NGS) technologies. However, learning more about the genotype-phenotype relationship can be challenging. Here are some tips to identify and understand the function of genes in a novel genome.

    1. Run an RNAseq experiment. In organisms where no transcript splicing occurs, ab initio prediction methods (implemented, for example, in the Glimmer program) can identify most of the protein-coding sequences. For all other organisms (principally eukaryotes), these algorithms have their limitations. For example, the correct identification of exon-intron boundaries remains a considerable challenge. Noncoding RNAs are also ignored by these approaches. A more straightforward and comprehensive approach to identify protein-coding sequences is based on empirical evidence, namely short-read sequencing of mRNA (RNAseq), which has become affordable with NGS. Using a paired-end read technology, representative RNA samples can be sequenced at the same time as the genomic DNA.
    2. Map the short-reads to the genome. A prerequisite to identifying full-length transcripts is mapping of the RNAseq reads to the assembled genome. A tool that can map reads to exon-intron boundaries (e.g. TopHat) is required to achieve the highest accuracy possible, and is indispensable for protein sequence identification (Step 4).
    3. Identify transcripts. Exons can be identified from the mapped reads and transcripts built from the exons. To get the most from the data, it is best to employ a tool that identifies different splice variants for each gene (e.g. CuffLinks).
    4. Generate protein sequences. Proteins are the end-product of a coding gene, and a plethora of tools can predict function based on protein sequence (Step 5). Extract the longest open reading frame (ORF) from each transcript, using a tool such as the EMBOSS getORFs.
    5. Annotate proteins. A lot of information about a gene product can be inferred from sequence similarity. Functional domains can be identified using dedicated packages (e.g., Pfam), and/or function can be predicted through sequence homology with proteins in other organisms (e.g., by Blasting against UniProt).
    6. Store and analyze genome annotation. Having generated all the annotation for a genome, it's important to secure effective data mining tools. This is particularly important when the number of sequenced genomes grows and the data also need to be stored for other types of analysis (e.g. phenotype-genotype analysis). (Step 5).

Add a comment

  • You must be signed in to perform this action.
    Click here to Login or Register for free.
    You will be taken back to your selected item after Login/Registration.

Related content

Jobs

GEN Jobs powered by HireLifeScience.com connects you directly to employers in pharma, biotech, and the life sciences. View 40 to 50 fresh job postings daily or search for employment opportunities including those in R&D, clinical research, QA/QC, biomanufacturing, and regulatory affairs.
 Searching...
More »

GEN Poll

More » Poll Results »

Should the CDC Director Resign?

Do you think the CDC chief should resign?