Read composition – Quality and Nucleotides

The intention of this post is to give a short introduction into the topic of read composition, which is an important feature of NGS sequencing analysis.

Quality value – what are they and why do I need them?

Reads received from e.g. Illuminas Sequencing Machines do not only contain the corresponding nucleotides, but also Quality values for every position in the read. These quality values contain information about the certainty of the written nucleotide at this position. This certainty is (for Illumina reads) the probability of a being incorrect at this position). The value of this certainty is transfigured into a single letter so it can be processed easier and does take less space (it is then called Phred quality score, which were originally invented for Phred base calling), but it can be calculated backwards to get an impression of this certainty. The actual calculation depends on the type of machine that was used, because the values used are a bit different, but basically you calculate 10 to the power of the negative ASCII number for the letter divided by 10.

Captura de pantalla 2017-08-22 a las 12.48.28
So, if you see a quality value of 30 this nucleotide is in most of the cases incorrect in only 0.01% of the cases.

These scores are needed to gain knowledge about how much you can trust the given nucleotide. If the score is low, then there is a good chance, that the given nucleotide is wrong. It is well known, that at the end of reads this quality tends to be far lower than at the beginning – the reason for this is the process to analysis the sequence itself.

(Sadly, this principle is not as easy as it sounds, because sometimes even wrong nucleotides are associated with a high Phred Score – see M. Schirmer, U. Z. Ijaz, R. D Amore, N. Hall, W. T. Sloan und C. Quince, „Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform,“ Nucleic Acids Research, Bd. 43, pp. e37–e37, jan 2015. )

Nucleotide composition – what letters can tell you

Now you know how to use these quality scores, but you can do even more with just the reads. Besides a GC analysis you can also analyse the composition of all the four nucleotides. This analysis may tell you if there are any irregularities in your reads. A completely random library would show about 25% of each base at each position, but natural or artificially produced sequences are seldom random, so the percentages or total amounts will reflect the mean number of this nucleotide in the data – if everything went well of course. Sometimes some positions show a deviation from this “ideal” value. The reason for this is the priming process in library preparation, favouring some random hexamers over others. Sequencing adapters which are removed very strictly – and thus some real part of the sequence is removed with them – can also result in a shift at some positions. The first issue is often seen in results at the first five to ten positions in which there is a higher fluctuation than at the rest of the sequence positions. Adapter dimers in the results are a reason for different values, too. Those dimers can be created by the preparations steps and will most likely show a different composition than the sequences which should have been analysed.

All in one this analysis shows you if there are any specialties which should be kept in mind for further analysis.

If you have any question about the science behind these tips, or any other question about contamination in sequencing, please contact us. If you have any other suggestions for blog post topics, please let us know.. If you have any other suggestions for blog post topics, please let us know.