Three Bioinformatics Tools that Any Scientist Can Learn Today

Hello and thanks for stopping by!  The first entry of what I hope will be many is an essay by Mark N. Ziats, PhD.  He is the founder and president of the consulting firm Creative Bioinformatics (www.creativebioinformatics.com).  His publication record (http://www.ncbi.nlm.nih.gov/pubmed/?term=ziats+mn%5Bauthor%5D) demonstrates his first-hand experience in coping with the flood of structural and functional genomics data that threatens to overwhelm the modern neuroscientist.  His essay below is intended to share some useful tools he has learned about in his work.  Please scroll down the the bottom of this post for more on Mark.  So, without further rambling, on to his essay:

Three Bioinformatics Tools that Any Scientist Can Learn Today

Bioinformatics is scary to the uninitiated. Fortunately, there are many bioinformatics tools available for biologists with no computational expertise that can provide important, insightful analysis to genomics datasets. These tools are designed to be point-and-click applications that take no more computer skills to operate than are required by Microsoft Word or PowerPoint.

I personally taught myself how to use each of the following three tools in less than one day.  With a starter dataset to work with, and a few hours spent learning each of these programs, anyone can expertly annotate genomics data to provide novel insight into cellular pathways and network interactions among the genes of interest.

 

1) DAVID Gene Ontology (http://david.abcc.ncifcrf.gov/)

The Database for Annotation, Visualization and Integrated Discovery (DAVID) is a gene set enrichment analysis tool provided to the research community for free by an intramural laboratory at the National Institute of Allergy and Infectious Disease, NIH.  The DAVID tool is an online web interface that allows for copy-and-paste or manual upload of files directly on the website.

DAVID takes as an input lists of genes in many formats, such as the official Entrez gene symbol, UniProt ID, or even unique gene identifiers from some of the most commonly used microarray platforms.  The software then runs a functional ontology enrichment analysis on the inputted list of genes, to assess the list for over-representation of particular cellular or biological pathways (termed Gene Ontology (GO) categories).

As an output, DAVID provides lists of pathways enriched among the input dataset with corresponding significance values, which are even corrected for multiple testing comparisons.  While the default settings should be more than sufficient for most users, DAVID allows users to specify specific parameters such as the stringency of significance, the type of significance tests to include, and the database/pathways to assess.

Gene ontology enrichment analysis should be the first step in the annotation of any genomics dataset, and DAVID provides users with a simple interface to do so for free.  Results can be downloaded in .txt format and then opened with Microsoft Excel, or even just copy-and-pasted from the browser interface.

Estimated time to learn: 2 hours

 

2) Cytoscape (www.cytoscape.org/‎)

Cytoscape is another free tool for the research community that allows for the visualization of network interactions among genes/proteins and provides the ability to quantify network properties in order to compare them to one another.

Cytoscape was originally developed by the Systems Biology Institute in Seattle, WA, and is now managed by a multiple member consortia of research organizations.  Unlike the web-based program DAVID, Cytoscape requires the user to download software to their local computer in order to function.

Cytoscape is an open-source software tool, meaning that others can access the code the program is built upon and therefore can develop additional tools that integrate within Cytoscape (called plugins).  These can be incredibly useful in addition to the basic features of Cytoscape, and there are hundreds available with a myriad of functions (all free of charge).

Cytoscape can import user’s files or can directly access archived datasets or known gene-gene (or protein-protein) interaction databases.  Cytoscape then creates interaction networks based on the underlying data (for example correlations in gene expression levels, or known protein-protein interactions).  These networks can be further analyzed by quantifying their graph theory properties using built-in analysis tools, or more sophisticated plugins.

Unlike DAVID, Cytoscape provides not only quantifiable analysis outputs but also publication-ready graphics.  While an analysis may assess many networks for different properties, one high-quality graphic of a representative network makes for a nice figure in genomics manuscripts.

Cytoscape can initially seem intimidating, but after spending a day getting comfortable with its workflow any scientist should be able to create networks, analyze their properties, and generate publication-ready figures quickly.  Furthermore, the Cytoscape community is large and supportive, so there are many forums and publications to help beginners get acquainted and start analyzing their data.

Suggested reading:  Cline MS, et al. Integration of biological networks and gene expression data using Cytoscape. Nat Protoc. 2007;2(10):2366-82.

Estimated time to learn: 8 hours

 

3) Ingenuity Pathways Analysis (www.ingenuity.com/products/ipa)

For those of you at research institutions, check to see if your department or institution has a site license for Ingenuity Pathways Analysis (IPA).  This software, which runs as a hybrid compared to DAVID and Cytoscape (i.e. the analysis is carried out on IPA’s servers but user access this through a JAVA-based interface that is downloaded to their local machines), also functions as somewhat of a hybrid between those two tools.

IPA takes as input gene lists, similar to DAVID, and as output provides both functional enrichment analysis lists (again similar to DAVID) and graphical pathway networks similar to Cytoscape. However, IPA is unique to these two programs in a number of ways.

First, IPA is somewhat easier to learn than Cytoscape, but at the cost of providing less information about the ‘networks.’  Whereas Cytoscape allows for the statistical assessment of network properties based on graph theory, IPA only provides graphical representation of biological pathways as networks of interacting genes, but focuses solely on the biological nature of the pathways and not their graph theoretical properties.

Unlike DAVID, which assesses for functional enrichment into known Gene Ontology (GO) categories, IPA functional enrichment assesses for pathways based on IPA’s proprietary ‘knowledge base.’ This knowledge base is a based upon manually-curated descriptions of gene-gene (or protein-protein) interactions from the literature (as is GO), but without the ‘open-access to the underlying annotations that GO allows.

The output of IPA is both functional enrichment lists with statistical significance, and gene interaction pathways.  While the functional enrichment is similar compared to DAVID, the pathways graphics produced by IPA are more ‘gene focused’ than in Cytoscape, where the emphasis is more on the network as a whole.  Therefore, IPA is often a good resource for biologists interested in specific molecular/cellular pathways based on known biological interactions, as opposed to the more theoretical approach to network analysis provided by Cytoscape.

Note: requires site license to access after initial free trial period

Estimated time to learn: 4 hours

 

Summary

DAVID, Cytoscape, and IPA represent three of the core bioinformatics tools for assessing genomics data that can be learned by any biologist with no computer skills, and can be learned today.  These three tools provide complementary insight into the functional ontologies, pathways, and network properties of underlying gene lists, with varying degrees of user competency needed and different types of outputs generated.  Used together, these three tools could easily produce enough annotation of a gene expression dataset to fill the results section of a high-quality manuscript.  The best part is that any scientist can quickly and easily learn these tools, and he or she could have all the analysis complete by the end of the week.

 

 

 

About the author:

Mark N. Ziats, PhD is the founder and current president of Creative Bioinformatics Consultants, LLC (www.creativebioinformatics.com), a fee-for-service bioinformatics firm specializing in custom data analysis for academic laboratories, non-profit research institutes, and industry.  Creative Bioinformatics Consultants function like a team of ‘temporary post-docs,’ providing customers with specific data analysis at their direction to enable customers to evaluate their specific scientific hypothesis by avoiding pipeline processing of data. Contact him via email at mark@creativebioinformatics.com.