How do they sequence genomes?

Visualizing* the graph theory behind genome assembly

*Don't like reading? See it live here

A Debruin Graph demonstrating the assembly of a DNA sequence.

A Debruin Graph demonstrating the assembly of a DNA sequence.

Overview

DNA sequencing is not a clean science. Sequencing the genome of any animal actually produces millions to billions of short, overlapping strings of DNA. The challenge is to piece together a full genome sequence from these DNA reads.

Assembly of DNA sequences from small reads makes use of a DeBruijn Graph to find overlaps. Often, the preliminary DeBruijn Graph contains ambiguities where a single optimal, Eulerian Path is not immediately apparent. In this case, ambiguities, or tangles, are resolved by the inclusion of paired end reads, which provide a partial ordering of the nodes. So in the graph above, you might now know that AA comes before TC for example. These partial orderings allow the graph to be detangled, and a full sequence assembled.

After implementing this assembly algorithm for a Computational Molecular Biology course, I wanted to explore how this method could be explained to friends and family with no computational or biological background. The still evolving result is a template that renders the still tangled DeBruijn Graph of text input by the user. The user also selects a size to use.

If you'd like to explore the code, find the GitHub repository here.

Implementation

This web app is implemented using Node.js and D3.js. Input text and k-mer length is passed to the Node application and parsed. From the parsed text, substrings of the specified lengths are created, and relations are created for all linking nodes. This information is parsed into JSON format and passed back to the browser where D3 renders the supplied JSON as the graph displayed to the user.