Compression for Population Genetic Data

Supervisor: Lloyd Elliott

Previous work in dictionary methods for compression of population genetics data have provided little improvement, even when shared dictionaries and modern compression methods such as Facebook's cutting-edge ZStandard (zstd) and the classical zlib are employed. In this work, the student will explore compression methods based on source coding theory or other compression methods, with application to efficiency in storage and access time of population-level genetic data, and multivariate genome wide-association study. This work leverages the Bayesian nonparametric aspects of exchangeability and partial-exchangeability of population-level genetic data. Applications include summary statistics for genome-wide association studies, logistic regression and advances in compression theory. Familiarity with Intel AVX instructions, and parallel architectures such as POSIX threads and OpenMP or forks are recommended, as are fundamental knowledge of the C programming language or fortran.