Amazon Web Services (AWS) India Private Limited on September 22 announced that the Centre for Cellular and Molecular Biology (CCMB), a premier research organisation focused on modern molecular biology and population-scale genomics, has chosen AWS as a cloud provider to accelerate its genomics research projects. Operating under the direction of the Council of Scientific and Industrial Research (CSIR), one of CCMB’s focus areas is the study of genetic material, how it varies among populations, and how the variance leads to disparities in human health and disease.
“Understanding the genomic variation in India’s population is a government priority towards developing precision healthcare and diagnostics and delivering them at affordable costs. However, genomics research is data intensive, and the increasing volume and velocity of genomics data is a challenge for research institutions in managing both infrastructure and costs,” said Pankaj Gupta, Leader – Public Sector (Government, Education, Healthcare), AWS India Private Limited. “Finding greater computing efficiency at scale is well addressed by cloud computing, but more crucially, it can accelerate genomics research, enabling researchers to translate insights faster to enable drug development and drive better health care treatments.
Life sciences and genomics research organisations need to access, store, and analyse large amounts of data, generated from next generation high-throughput sequencers. Previously, these organisations have relied on on-premises servers to meet their storage and compute needs. The data-intensive nature of genomics research meant that CCMB had to procure more on-premises storage frequently to manage petabyte scale datasets, and store the raw data and the resultant output files generated from secondary and tertiary analysis. CCMB was also relying on on-premises high-performance computing (HPC) clusters to perform this analysis, which was prone to downtime, impacting research timelines and output. Using on-premises servers created challenges for scalability and performance, so CCMB turned to cloud computing to seamlessly scale up its data storage and analysis needs.
“At a time when genetics research is becoming critical for life sciences advancement, disease diagnosis, and drug development, we must innovate using technologies like cloud computing to achieve outcomes faster and better,” said Dr. Divya Tej Sowpati, genomics scientist at the CSIR CCMB.
CCMB moved 83 terabytes of genomics data from on-premises servers to AWS using AWS Snowball, an offline data transport service that uses secure devices to transfer large amounts of data into and out of the AWS Cloud without traversing the internet. It then migrated its genomic analysis toolkit and bioinformatics data pipelines for secondary analysis to Amazon Genomics CLI, an open-source tool that enables genomics organisations to process raw genomics and biological data. CCMB also successfully accessed multiple genomics databases from the Registry of Open Data on AWS (RODA) without having to download these locally for processing, saving months of data download time, and benefiting from the access to documented sources of truth.
Running on AWS, CCMB performed short tandem repeat (STR) genotyping — an analysis to determine a person’s DNA profile — on 3,200 samples from the 1000 Genomes Project, an international research effort to establish a detailed catalogue of human genetic variation. Using services such as Amazon Aurora, Amazon Elastic Compute Cloud (Amazon EC2), EC2 Auto Scaling, Amazon Simple Storage Service (Amazon S3), and AWS Batch, CCMB was able to reduce the time taken for research analysis by up to 98%, from 550 days to just nine days on average.
In another project, CCMB has started analysing breast cancer samples to identify molecular signatures of triple negative breast cancers among the Indian population. Using CPU and GPU-accelerated computing on AWS Cloud, CCMB brought down the time taken of analysis per sample by 50 to 70%.
CCMB also used AWS graphics processing unit (GPU) instances to train and test machine learning (ML) neural network models on long-read data[1] sequenced using Oxford Nanopore sequencers to detect DNA modifications associated with various diseases, including cancer, neurodegenerative disorders, and cardiovascular diseases. It achieved an accuracy of more than 91%, and reduced the time taken to train these models from several days on their on-premise servers to approximately three to four hours per dataset on AWS.
CCMB joins a list of premier genomics research initiatives around the world running their genomics research on AWS, including organisations such as AstraZeneca, CSIRO, GRAIL, Illumina, Melbourne Genomics Health Alliance, National Institutes of Health, Regeneron, and Stanford University.