0

I'm working with next generation sequencing on a daily basis and hence interpreting a lot of coverage analysis reports to decode the quality of sequencing runs. I'm using the Ion Torrent technology and targeted sequencing.

A coverage analysis report consists of:

  • mapped reads (in millions) - How many reads that have been mapped to the reference genome.
  • mean depth - a summary statistic for reads that are assigned to specific amplicons.
  • On target (%) - the percentage of reads that were mapped in the target regions file to any targeted region of the reference.
  • Uniformity (%) - Percentage of bases in targeted regions that is covered by at least 20% of the mean depth.

The aim is to create a single number that gives a quick interpretation in form of a score. This is to efficiently determine if the sequencing is of a quality that can be used in downstream analysis.

The standard parameters for an accepted sequencing in our lab are:

  • Mapped reads: 5000000
  • Mean depth: 1000
  • On target: 80%
  • Uniformity: 80%

However, coverage reports can vary a lot, hence a score would be ideal for the assessment.

The equation so far:

enter image description here

Given the above mentioned parameters, would give a SeqScore of 0.090. Meaning that a sequencing with a SeqScore > 0.090 would be of bad quality and a sequencing with a SeqScore ≤ 0.090 would be accepted.

enter image description here

Examples: Sequencing 1.

  • Mapped reads: 6902500
  • Mean depth: 850
  • On target: 70%
  • Uniformity: 81%

SeqScore = 0.098 (Bad)

Sequencing 2.

  • Mapped reads: 4000000
  • Mean depth: 1100
  • On target: 75%
  • Uniformity: 87% SeqScore = 0.082 (Good)

I'm not sure if this a valid way of creating a score? Constructive criticism and inputs to improve this score are very welcome.

Thank you for your time.

EdM
  • 57,766
  • 7
  • 66
  • 187

1 Answers1

0

In principle you can construct any score that you want if it matches the needs of your application. The form of your score, however, seems to make little sense in your context of nucleic acid sequencing. High values of both the numerator and of the denominator would seem to be good, so it's not clear why you then take their ratio for a combined score, with a high score being bad.

The basis of the numerator of your score makes a lot of sense: multiplying the total mapped reads by the fraction of on-target reads. That tells you how many on-target reads you have to work with. A large value of the numerator thus is a good thing. (Why you then take the square root of the log of that value isn't clear.)

In the denominator, both high uniformity and high mean depth would seem to be good things. So why divide one good thing (the number of on-target reads) by another good thing (high uniformity and mean depth) to get a ratio for which a high value is supposed to be bad?

I think that trying to provide a single combined score in this context is mistaken. How you are going to apply the score: are you just going to throw out a whole run if your single combined score is just under some arbitrary threshold? That doesn't seem to be a very wise choice, as there might nevertheless be a good deal of useful information in the run. I'd recommend looking at the parts of the quality control more closely and individually.

EdM
  • 57,766
  • 7
  • 66
  • 187