CSE60532 Bioinformatics - Correlating Genetic and Geographic Distance
I did this project with Nate Garrison, a Notre Dame undergraduate.
The study of genomes and DNA sequences in general is terribly fascinating. While on the one hand many important insights about biology and life have been made, there is still much science doesn't understand about the systems at play. One of the big problems is the sheer glut of data; genomes can range from thousands to billions of nucleotides in size. If we were to gather the whole genome for a number of members of a species (mosquitoes, for instance) most of their sequences of DNA would be the same. However, since each individual is unique, there will be differences. Areas of the sequences which differ are frequently interesting areas to study.
But how can we find these areas which are different? Even more importantly, how can we find areas which are different in an interesting way, rather than just random genetic mutation? We propose that a particular "interesting" difference is when the amount of change between genomes happens to correspond to how far the two individuals lived from each other. (Or how long they lived from each other, or how deep in the ocean they lived from each other, etc.) To find these interesting areas, we developed a tool that finds the correlation between the genetic and annotated (e.g. geographic) distance and plots it along the genome. The graph to the right does this for 13 genomes of the SARS virus, where genetic distance is compared to time of the sample.
Files: Paper (pdf).