Abstract:
Both reference genomes assembled for individual species and large, publicly
maintained sets of resequenced genomes are of immense value to researchers.
The former represent important milestones for research involving the species of
interest and serve as ostensibly static points of reference for other data, while the
latter serve as catalogues of genetic variation, enabling researchers to place their
own data in a wider context. However, maintaining sets of resequenced genomes
and ensuring their integrity as they undergo updates to match any new releases
of their reference genome poses certain computational challenges, as does
manipulating and comparing those large sets of genomes in general.
This work reports on the detection and correction of significant errors which were
introduced into resequenced tomato data in the course of updating them to a new
version. It also introduces Tersect, a low-level utility optimized for manipulating
and comparing large sets of resequenced genomic data, as well as Tersect
Browser, a Web application which uses the high performance of Tersect, coupled
with a higher-level indexing and precomputation scheme to allow for interactive
comparison of large sets of resequenced genomes, giving biologists a tool
capable of generating visualisations of genetic distance and phylogenetic
relationships based on whole-genome sequence data from hundreds of genomes
in seconds rather than hours.