Microbes are microscopic organisms that live in every nook and cranny of our planet. Without them, plants wouldn’t grow, garbage wouldn’t decay, humans wouldn’t digest food, and there would literally be no life on Earth, or at least as we know it. By examining the genetic makeup of these “bugs,” scientists hope to understand how they work, and how they can be used to solve a variety of important problems like identifying new sources of clean energy.

One important tool for this analysis is the IMG/M (Integrated Microbial Genomes with Microbiome Samples) data management system, which supports the analysis of microbial communities sequenced by the Department of Energy’s Joint Genome Institute (JGI). With computing and storage support from the National Energy Research Scientific Computing Center (NERSC), this system currently contains more than 3 billion microbial genes—more than any other similar system in the world.

“Last December IMG/M crossed the boundary of 1 billion genes recorded in the system, and we wouldn’t have been able to reach this important milestone without NERSC,” says Victor Markowitz, Chief Informatics Officer and Associate Director at JGI. Markowitz also heads the Lawrence Berkeley National Laboratory’s Biological Data Management and Technology Center (BDMTC). Since the December milestone, billions more genes have been recorded in the system.

A Fire Hose of Data

No microbe is an island. In fact, these organisms live in communities of thousands or more. By studying the aggregate genome, or metagenome, of these communities, researchers can better understand how the different organisms interact with each other and their host, as well as how they adapt to different environments.

According to Nikos Kyrpides, who heads JGI’s Microbial Genome and Metagenome Super Program, the number and size of metagenome sequence datasets are growing incredibly fast. This is primarily due to advancements in DNA sequencing technologies that allow for rapid and relatively cheap sequencing of microbial communities. Unlocking these interactions is crucial for developing clean energy sources or understanding the mechanism of disease.

Kyrpides also notes that a series of innovative data processing methods developed by JGI’s Omics group has paved the way in finding a way of coping with the deluge of metagenomics data. The majority of data obtained from current sequencing technologies is comprised of short fragments, and the new tools allow researchers to predict genes on these fragments.

“These new methods allow us for the first time to explore the full genetic capacity of the sequenced data,” says Kostas Mavrommatis, who leads the Omics group; but he adds that at the same time, the comparative analysis and data management of billions of genes poses a tremendous computational challenge.

According to Markowitz, the rapidly growing size of metagenome sequence datasets requires using supercomputers for interpreting the biological function of hundreds of millions, and even billions of genes.  As part of the NERSC and JGI partnership, JGI tools and pipelines can now be utilized on NERSC’s Cray XE6 “Hopper” system, currently the eighth most powerful supercomputer in the world.  Providing scientists with the ability of analyzing metagenome sequence datasets has also required new data management methods that were developed by BDMTC’s Amy Chen and NERSC’s Shane Canon.

 “NERSC has a long history of successfully managing computing clusters, and its management of JGI’s cluster together with the added capacity of Hopper has sped up the biological interpretation process and allowed IMG/M to reach this milestone of billions genes so quickly, ”says Markowitz.

Both NERSC and JGI are managed by the Berkeley Lab on behalf of the Department of Energy, and IMG/M was developed with support from the Berkeley Lab’s Computational Research Division.