Rapid Phylogenetic Tree Construction from Long Read Sequencing Data: A Novel Graph-Based Approach for the Genomic Big Data Era
Abstract
Genomics is the largest producer of big data, with an expected 40 EB of data every year. The rapid growth of genomic data necessitates efficient methods for analysis and classification. We present a novel, automated pipeline for swift phylogenetic tree construction from long-read sequencing data. Our approach addresses computational challenges by utilizing compact repeat graphs instead of full genome assemblies. We integrate advanced graph embedding techniques, combining structural and content-based approaches, to capture genomic relationships efficiently. Demonstrating our method on 20 bacterial genomes across 5 classes, we achieve a cophenetic correlation of 0.53 with the ground truth phylogenetic tree. Our pipeline reconstructs meaningful evolutionary relationships directly from sequencing reads without requiring complete assemblies or time-consuming alignments. This work represents a significant advancement towards rapid pathogen classification during outbreaks and offers a scalable solution for analyzing the expanding universe of sequenced organisms. By bridging graph theory, machine learning, and genomics, our method paves the way for more efficient phylogenetic analysis in the era of big data biology.