1

I have a large datasets of values that range from 0 to n. I am interpreting the values as probabilities for a later pseudo-random selection process. To make the values serve as probabilities, I normalize the entire dataset to the range (0.0:1.0) by dividing every number by n. Values are essentially random, and could be like {0.0156, 0.259, 0.0844, 0.904, ...}

After this, the dataset mean is not what I need it to be. (The end user will be specifying the desired mean). I need to transform (or dilate) all values so that the mean of the transformed dataset equals the desired mean, but the range constraint is unchanged. How can I do this?

Note, my question here is similar to How to simulate data that satisfy specific constraints such as having specific mean and standard deviation?, but the answers to that question do not constrain the range.

Edit

I have come up with a brute force iterative guessing approach to get the transformed mean to be within a tolerance of the target mean, but it will be slow. So now my question really is: Is there a closed form solution to get this exact?

philologon
  • 133
  • 7
  • You're finding this difficult to achieve because it's not a natural thing to do. Can you clarify what property of the initial dataset you are trying to preserve (and why)? Let's say, the initial dataset has two values, 0.2 and 0.3. Do the transformed values need to have the same difference? Ratio? Can you discard them and just simulate two new values after the user specifies the parameter? – juod Jul 08 '20 at 05:15
  • @juod, thanks for your response. There are three properties to maintain. 1. No value can go beyond the range. 2. Rank order must be preserved. 3. After the transformation, the mean of the dataset must be the specified value. If I use brute force, then the mean of the transformed dataset must be approximately equal to the target value (within a tolerance). – philologon Jul 08 '20 at 05:22
  • What I am doing is essentially the same thing as algorithmic histogram transformation using an equation or a piecewise function. My brute force algorithm uses a piecewise function split at the starting mean. The left side scales from 0. The right side scales the ones-complement from 1.0. But as I have it now, it converges very slowly. – philologon Jul 08 '20 at 05:26
  • There are infinitely many closed form solutions. Among them are the one that changes every one of your numbers into the desired mean. The fact is that literally *any* dataset of numbers in the interval $[0,1]$ of the same size with the desired mean can be the result of this transformation. The problem with this question is that it doesn't offer enough constraints or context even to provide reasoned advice. – whuber Jul 08 '20 at 20:23

1 Answers1

0

If the only property of the initial dataset that needs to be preserved is the rank order, then a variety of transformations are possible. Here's the simplest one I can think of:
Let $m$ be the mean specified by the user, and $x$ be the initial data ($n$ points). For $i=1,...,n$, define the new values as $$ x_i' = 2m(i-0.5)/n $$ Map the $i$-th smallest $x$ to $x_i'$. Done: the new mean is $$\sum_{i=1}^n (2m(i-0.5)/n)/n = 2m/n^2 \sum_{i=1}^n (i-0.5)= m$$

In fact, this is a special case of inverse rank transform, which allows you to map values between a pair of distributions. I just chose discrete uniform between $[0;2m]$ as the target, but you can choose any other distribution bound to $[0;1]$ and parametrized by mean.

More generally, this latter requirement - controlling the distribution by mean - generally makes little sense for bounded distributions, because you can't shift them left and right easily, and hence might end up with undesired effects such as the $<2m$ truncation above. It might be more natural to use another one, such as the beta distribution, and have the user control its parameters rather than the mean.

juod
  • 2,192
  • 11
  • 20
  • Why is xi not on the right side of equals in the first equation? It looks like this is making up new data, but that is not what I am seeking. I need for transform each xi to a new x'i. – philologon Jul 08 '20 at 16:51
  • Because you only want to preserve the rank of the old data, there is no need to use any other properties except that. The smallest $x$ ($x_1$) will remain the smallest in this procedure. If you want to preserve any other properties of the original data, a different approach may be needed, or a solution may not be possible at all. – juod Jul 08 '20 at 22:37
  • More fundamentally, there is no difference between a transformation and "making up new data" and then mapping old data to it. You could say that `x' = round(x) +1` is a transformation, but conceptually it is the same as generating a bunch of integers $1,...,n$ and mapping $x$ to the one that solves the above equation. I chose the second way for ease of notation, but you can also define IRT from distr. $F_A$ to $F_B$ as $x' = F_B^{-1}(F_A(x))$. – juod Jul 08 '20 at 22:45