Reconstructing the complete phased sequences of every
chromosome copy in human and non-human species are important for medical,
population and comparative genetics. The unprecedented advancements in
sequencing technologies have opened up new avenues to reconstruct these phased
sequences that would enable a deeper understanding of molecular, cellular and
developmental processes underlying complex diseases. Despite these interesting
sequencing innovations, the highly polymorphic and gene-dense regions human
leukocyte antigen (HLA) are not yet fully phased in the reference genome. The
reference genome still contains gaps in multi-megabase repetitive regions, and
thus annotating novel expression and methylation results are incomplete and
inaccurate, that affect the interpretation of molecular genetics and
epigenetics of diseases. There is a pressing need for a streamlined,
production-level, easy-to-use computational algorithmic approaches that can
reconstruct high-quality chromosome-scale phased sequences,
and that can be applied to hundreds of human genomes.
In this talk, first, I will present a
combinational optimization formulation and solution to the haplotype
reconstruction problem that leverages new long-range Strand-specific technology
and long reads to generate chromosome-scale phasing. Second, I present an
efficient graph-based algorithm to perform accurate haplotype-resolved assembly
of human individuals. The advantage of graphs is that they enable a unique
compact representation of massive datasets for their integration on the common
genome sequence space. This method takes advantage of new long accurate data
type (PacBio HiFi) and long-range Hi-C data. We for the first time can generate
accurate chromosome-scale phased assemblies with base-level-accuracy of Q50 and
continuity of 25Mb within 24 hours per sample, therefore, setting up a
milestone in the genomic community. Third, I will present the generalized
computational approach that has the advantage to work on any type of sequencing
data types for different number haplotypes and repeat variation. Finally, I
will present the importance of haplotype-resolved assemblies to various medical
applications.
In summary, my works develop scalable computational approaches that efficiently
and robustly combine data from a variety of sequencing technologies to produce
high-quality diploid assemblies. These computational methods have the potential
to enable high-quality precision medicine and facilitate new and unbiased
studies of human (and non-human) haplotype variation in various populations which
are currently goals of the Human Genome Reference Project.