0

Example situation

After each exam, the professor provides the following information.

  • Minimum Score
  • [Arithmetic] Mean
  • Median
  • Maximum Score
  • Standard Deviation

I also know what my score was as well as how many students took the exam.

Restating question in light of example

Is there a known way in Python to take this other information into account such that calculating the mean, median, minimum, st. deviation, and maximum of the resulting dataset is an exact match for the given actual mean, median, minimum, and maximum AND that my score is among the output dataset?

numpy.random.normal() doesn't give me what I want

I know that I can use numpy.random.normal to generate random data that tends toward a given distribution, e.g., numpy.random.normal(loc=median_of_scores, scale=sigma_of_scores, size=num_of_scores), but that only tends toward the statistical parameters. Also, it doesn't take into account known pieces of information (my score, the median, the minimum score, and the maximum score). Adding my score, the minimum score, and the maximum score would further warp the randomly-generated data away from the known population numbers.

Other use-cases

While my example is specific to a college course, I imagine this problem is also faced by any Python developer who works with datasets that they're only allowed to know the statistics of (for privacy reasons). For quality testing, I imagine these developers would want a way to create statistically-indistinguishable data.

Nik
  • 1
  • This seems like you really have a math or statistics question, rather than a programming question. – Karl Knechtel Apr 09 '21 at 01:29
  • I think it's definitely at the intersection of the two, Karl, but a math solution wouldn't be very helpful without an accompanying explanation in how to implement it in Python. Alternatively, if someone said "you can use this library to do what you want," and didn't explain the underlying mathematics, the answer would be acceptable to me. –  Apr 09 '21 at 01:48
  • Do you seek to do either of the following? Generate a random variate from a distribution with the same mean, median, minimum, maximum, and standard deviation, or calculate the distribution function of that distribution? See also: https://stackoverflow.com/questions/61433438 . – Peter O. Apr 09 '21 at 06:06
  • @PeterO., it doesn't need to be random. For example, if I was given "Mean = 3, Min = 2, Max = 5, Population size = 3, Median = 2, St. Dev = 1.414, Example sample = 2", {2, 2, 5} would be an acceptable output. Any discrete set of elements that, if tested, would yield the same values for the provided statistical tests would suffice. – Nik Apr 09 '21 at 06:33

1 Answers1

1

Yes, there is. You have four data points out of N: the low, median, and high scores, as well as your own. You have the equations for mean and stdev. You need to generate more scores until you are down two those two equations, and only two unknowns (missing scores).

You already have 4 scores; you need N-2 scores. Generate some however you like, within certain limits:

  • Keep an eye on the over-under balance; if you get too many above (or below) the median and mean, you'll need to compensate by replacing those values.
  • Keep an eye on the spread: your variance is a "budget" on that spread. If you exceed that budget, you'll have to replace a few outliers.

With a little testing and replacement, you should get to a set of N-2 values that have the same median, close to the same mean, and a variance that's somewhat below the needed figure. At this point, you have an easy solution (although tedious) to hit both the desired variance (and stdev) and mean.

From here, it's just a matter of doing the algebra and coding the expressions to compute the two missing scores.


Do take note that your phrase "statistically identical data" is not achievable in general, without duplicating the original data set. There are many more statistics one can apply, enough that in the extreme case, there is no substitute mathematically possible. A set of N scores can always be uniquely specified by judicious choice of N statistics.

Prune
  • 191
  • 5
  • Median does not necessarily belong to N if N has an even number of members. – pavel Apr 09 '21 at 01:59
  • @pavel Yes, it does. By formal mathematical definition, the median is an element of the sample. – Prune Apr 09 '21 at 05:08
  • 1
    [1, 2, 3, 4, 5, 6] - has a median of 3.5 which is not an element of the sample. – pavel Apr 09 '21 at 05:10
  • Again, the formal stats definition, not the colloquial "gotta cut it in half" kludge. The median of your sample is 3, not 3.5 – Prune Apr 09 '21 at 05:11
  • 1
    This is false as it goes against the formal definition of the median - that half of the population/sample has value greater than the median and half - lower. If you have a median of 3 in this case then you have 2 data points that have lower values and 3- higher. – pavel Apr 09 '21 at 05:14
  • Numpy, R and Matlab all give 3.5 as median for this sample. – pavel Apr 09 '21 at 05:17
  • 1
    @Prune, is the "formal definition" something like "an element in a set where no more than half of the elements in the set are less than it and no more than half of the elements are greater than it"? If that's so, wouldn't 3 and 4 equally have a claim to being the median? –  Apr 09 '21 at 05:22
  • The formal definition from my mathematical and psychological statistics class is the largest value M for which at least 50% of the sample is <= M. For a set with even size, *conventional* use is the mean of the middle elements. This is a trivial point, as the approach above will work for either interpretation. Unless stated otherwise, I assume the rigorous definition. – Prune Apr 09 '21 at 05:26
  • *The formal definition from my mathematical and psychological statistics class is the largest value M for which at **least** 50% of the sample is <= M.* -- the formal definition as i know it replaces "least" in with "most" and also adds that *"and at most 50% of the sample/population is >= M"* – pavel Apr 09 '21 at 05:35
  • 1
    There are many concepts and inequivalent definitions of the median (and of other quantiles) of a dataset. See https://stats.stackexchange.com/questions/24112 and https://stats.stackexchange.com/questions/134229 for some of our discussions about this. In the present instance it will suffice for the OP to adopt *some* definition of the median, but the nature of the solution will not change as a result--this is just a detail. – whuber Apr 09 '21 at 13:28