Managing Large Molecular Data Sets

Integrating large and complex molecular biomedical datasets at scales useful for research and clinical applications poses significant challenges. This is especially true for genetic and genomic data, which are commonly used in clinical practice. Similar concerns apply to other types of “omics” data, such as transcriptomics, epigenomics and proteomics.

In a research setting, exome or whole-genome profiling on a large or even population-scale cohort is increasingly common. While clinical use of molecular medicine is typically on a smaller scale, the reality of data-driven personalized medicine means that hospitals and clinicians will need to deal with the scale of data encountered in research applications. This includes addressing issues related to regulatory compliance for patient data.

Regardless of whether it is for research or clinical use, any large-scale biomedical data management effort must address several common issues.

Data Size

Storing and managing the sheer amount of data produced by modern molecular techniques is a significant challenge. Raw sequencer data can be in the hundreds of gigabytes, if not terabytes. Even processed variant data for a few thousand subjects can be many gigabytes at exome or whole-genome scales. Moreover, the typical formats used to store processed data may not be the most efficient for storage or usage, and the tools used to process these file formats are often not easily accessible to non-specialists.

Data Search and Retrieval

Ensuring that ready-to-use molecular data can be stored efficiently is not enough. The storage system must also make this data available to users quickly and straightforwardly. Even for sophisticated users who can manipulate data with open-source tools, there is still a need to write scripts and go through complex processes to retrieve specific variants or genomic regions. Automating and simplifying the research process, preferably with additional search criteria like clinical relevance, would allow more users to utilize molecular data effectively in more applications.

Data Fragmentation

Molecular data can be scattered across different institutions, making it challenging to collate relevant data. Even within the same institution, data fragmentation can occur. This fragmentation poses significant challenges for data sharing, access rights and collaboration. A solution that centralies access to molecular data while maintaining appropriate access controls would streamline this process and substantially improve the availability and usability of molecular data.

Security

The security of molecular data is a critical concern, especially when dealing with human and clinical data. Storing and retrieving molecular data must prioritize data security both at rest and in transit and ensure the highest possible level of protection against data breaches.

Multimodal Integration

Molecular data is most valuable when it is combined with other contextual information. In research, genomic data needs to be matched with other molecular, phenotypic and demographic data for analysis. In clinical applications, combining molecular data with imaging data and electronic health records (EHR) is essential for making informed decisions and tracking patient treatment progress. However, matching samples or patients to their corresponding data across different modalities presents challenges similar to those encountered with molecular data. A system that handles integration as an integral function, such as by synchronizing with EHR systems in clinical applications, is necessary to unlock the full potential of molecular data.

Meeting the Challenge

Addressing these challenges is a complex task, and a comprehensive solution that satisfactorily addresses all these concerns is crucial to realize the full potential of molecular data and personalized medicine. dātma's ultimate goal is to build a system that tackles all these challenges, and we will explore further how we are achieving that goal in future discussions.

Hollis Wright

Principal Bioinformatics Engineer at dātma

Previous
Previous

Revolutionizing Healthcare with Federated Learning and Artificial Intelligence: A Collaborative Approach for Improving Medical Outcomes