Advanced computational approaches for understanding allele-specific biology of complex diseases
From Ekaterina Churkina on February 5th, 2021
Reconstructing the complete phased sequences of every chromosome copy in human and non-human species are important for medical, population and comparative genetics. The unprecedented advancements in sequencing technologies have opened up new avenues to reconstruct these phased sequences that would enable a deeper understanding of molecular, cellular and developmental processes underlying complex diseases. Despite these interesting sequencing innovations, the highly polymorphic and gene-dense regions human leukocyte antigen (HLA) are not yet fully phased in the reference genome. The reference genome still contains gaps in multi-megabase repetitive regions, and thus annotating novel expression and methylation results are incomplete and inaccurate, that affect the interpretation of molecular genetics and epigenetics of diseases. There is a pressing need for a streamlined, production-level, easy-to-use computational algorithmic approaches that can reconstruct high-quality chromosome-scale phased sequences, and that can be applied to hundreds of human genomes.In this talk, first, I will present a combinational optimization formulation and solution to the haplotype reconstruction problem that leverages new long-range Strand-specific technology and long reads to generate chromosome-scale phasing. Second, I present an efficient graph-based algorithm to perform accurate haplotype-resolved assembly of human individuals. The advantage of graphs is that they enable a unique compact representation of massive datasets for their integration on the common genome sequence space. This method takes advantage of new long accurate data type (PacBio HiFi) and long-range Hi-C data. We for the first time can generate accurate chromosome-scale phased assemblies with base-level-accuracy of Q50 and continuity of 25Mb within 24 hours per sample, therefore, setting up a milestone in the genomic community. Third, I will present the generalized computational approach that has the advantage to work on any type of sequencing data types for different number haplotypes and repeat variation. Finally, I will present the importance of haplotype-resolved assemblies to various medical applications.
In summary, my works develop scalable computational approaches that efficiently and robustly combine data from a variety of sequencing technologies to produce high-quality diploid assemblies. These computational methods have the potential to enable high-quality precision medicine and facilitate new and unbiased studies of human (and non-human) haplotype variation in various populations which are currently goals of the Human Genome Reference Project.