With the emergence of high-throughput next-generation sequencing machines, an incredible amount of data is being produced at a very high rate. The big problem is mapping this data back to the genome. One significant problem with many genomic mapping programs is the way duplicate regions in genomic DNA are dealt with. Since it is impossible to know where exactly where a duplicate region should be mapped to, many programs simply throw out these sequences. Often, this results in a loss of nearly 40% of the data.
This project develops gnumap, a program capable of handling such repetitive regions. By using statistical formulas, we are able to account for these repetitive reads by distributing them across several regions in the genome. In addition, the output of the program is created in such a way that it can be easily viewed through other free and readily-available programs. Several benchmark data sets were created with spiked-in duplicate regions, and gnumap was able to more accurately account for these duplicate regions.