Evenness of coverage – An important alignment feature
January 16th, 2018
The intention of this post is to give a short introduction into the topic of evenness of coverage, which is an important feature of a sequence alignment.
Evenness – Why should I check this?
In NGS (whole genome sequencing/whole exome sequencing/amplicon sequencing) you normally get reads or sequences from a sequencing machine or company and align them to a reference genome to check for some aberrations in the sequences or relative to the reference genome like copy number variations (CNV) or small nucleotide variants (SNV).
Up until now most of the algorithms calculating these kind of changes in the sequences do their work with the amount of sequences or reads having this nucleotide or having that nucleotide, e.g. if you have 100 reads at the same position and half of them is showing an A the other half is showing a T and the reference has an A, it is pretty likely that here a mutation occurred and created a T, and it may be that there are two copy numbers, own having this A, the other a T. But what if you do not have 100 reads, but only 5? It is hard to evaluate these kinds of analysis with a low number of reads, because such events (like having an T instead of an A) may also arise through errors in amplification or sequencing.
Therefore, one wants a maximal number of reads in an experiment. But a high number of reads is nothing if they are aligned to only 0.01% of the reference genome. At this position, you can of course do such calculations, but all other regions are left out.
So, to be sure that the reads and the alignment you have is suitable for further analysis, you should always check for the evenness of coverage.
Good Evenness vs. bad Evenness – examples
To give you an example of how a good and a bad evenness looks like, I show you some pictures. Here at first you have a bad coverage of the experiment we did here at SYGNIS to check other systems of amplification. This is done by a simple MDA method (you can get more information in our Nature Communication paper about TruePrime).
This second image shows the same chromosome amplified with TruePrime and it is clearly visible that the ups and downs are less high, more a wavy landscape, than mountainous.
How can I check this?
The interesting thing is how to check for this evenness. The easiest way is of course looking at the alignment in form of pictures. For this you will need some software which can do this visually. And the second issue is the high amount of reads to display (which is why a lot of tools put reads in bunches and just display those bunches) and the probable high length of the reference genome (the human genome has approx. 3 200 000 000 nucleotides so you would have to scroll a lot on your screen – a small viral genome is easier to cover).
The second easiest way would be to check for some chromosome aberrations (if your organism has chromosomes of course). With the reference genome, the number of nucleotides for each chromosome is known and you can check if the percental number of reads aligning to each of these chromosomes is the same. If it is not, you now that there are some reads more often aligning to certain chromosomes. You can even check for this statistically and get a certain idea if this difference is significant or not.
If you have any question about the science behind these tips, or any other question about contamination in sequencing, please contact us. If you have any other suggestions for blog post topics, please let us know.