Seminar in Computer Engineering

Bilkent University
Department of Computer Engineering
CS 590/690 SEMINAR

Multiple Genome Alignment using Locally Consistent Parsing

Ecem İlgün
Master Student
(Supervisor: Assoc.Prof.Can Alkan)
Computer Engineering Department
Bilkent University

Abstract: Multiple genome alignment is a fundamental problem in computational biology, with applications in phylogenetic tree generation, variant detection, and comparative genomics. Pairwise genome alignment tools like Minimap2, Cactus, Mercator, and Mavid have seen significant advancements. Although some tools, including Cactus, can perform multiple whole genome alignments, the multiple sequence alignment problem is NP-hard; therefore, scaling these approaches to align multiple genomes simultaneously remains a challenge. Distributed and parallelized multiple genome alignment relies on efficiently partitioning the input genomes into smaller segments that can be processed independently. Existing partitioning methods often rely on maximal exact matches (MEMs) or maximal unique matches (MUMs), which require identifying anchor points for segmentation. Finding MEMs of size m in a string of size n takes O(m log n) time, which incurs additional overhead to the alignment procedure. Alternatively, minimizers, a widely used sketching technique, can be used to find alignment anchor positions, which can then be chained and verified. However, minimizers have drawbacks in their distribution patterns and frequency, owing to their short length, leading to suboptimal partitioning in terms of computational and communication overhead. This work will explore using Locally Consistent Parsing (LCP) as an alternative to minimizers and MEMs/MUMs for multiple genome partitioning and alignment. LCP identifies "cores" - short genomic sequences that are consistently present across genomes - and can provide a more comprehensive and compact representation of the input data compared to minimizers. Additionally, LCP cores can be computed hierarchically in linear time, where higher-level and longer cores can be used as primary anchors, and lower-level shorter cores can be used for tuning partitions. This makes LCP-based partitioning better suited for distributed alignment, as it can lead to more balanced computational loads and reduced communication overhead between processes. Building on these advantages, we will develop a multiple genome alignment framework that utilizes LCP-based partitioning to enable scalable and efficient alignment of large-scale genomic datasets.

DATE: November 25, Monday @ 15:50 Place: EA 502