2

In the field of genome assemblies in biology (the problem being reconstructing the genome from many, short random pieces of it, where a genome is one or more long strings of a limited alphabet), there is a metric called 'N50' for a set of strings (pieces of DNA sequence). The N50 length is defined as 'the length of the string for which 50% of all characters are in strings of that length or longer'. See for example http://seqanswers.com/forums/showthread.php?t=2332.

I have sometimes seen this metric referred to as the 'length weighted median'. Is this correct?

  • 2
    The definition you give above is not equivalent to that given in the site you link to, or [in Wikipedia](http://en.wikipedia.org/wiki/N50_statistic), which both say it's "as the length N for which half of all *bases* in the sequences are in a sequence of length L < N", which translated from bases in sequences to strings in characters would be "the length of the string for which 50% of all *characters* are in strings of that length or longer". – onestop Jan 24 '12 at 13:11
  • You are correct, sorry. Duly edited. – Lex Nederbragt Jan 24 '12 at 14:56
  • Somewhat related: http://stats.stackexchange.com/questions/137931/when-would-we-use-tantiles-and-the-medial-rather-than-quantiles-and-the-median/142384#142384 – kjetil b halvorsen Apr 02 '17 at 19:28

0 Answers0