1

I have seen the min-max normalization formula in several answers (e.g. [1], [2], [3]), where data is normalized into the interval $\left[0,1 \right]$.

However, is there a method to normalize data into the interval $\left(0,1 \right)$, i.e. excluding 0 and 1?

EDIT:

My data is a sample from a uniform distribution within the range $\left[a,b \right]$. I would like to normalize it into the interval $\left(0,1 \right)$ while remaining uniformly distributed.

skoestlmeier
  • 97
  • 1
  • 2
  • 12
  • 2
    $$\frac{1}{1 + \exp(-x)} \in (0,1)$$ for any $x\in \mathbb{R}$. Do you have some other requirements that would exclude this? – Sycorax Dec 04 '18 at 15:47
  • Thanks @Sycorax, to clarify, i just edited my question to point out that my data sample should be uniformly distributed. – skoestlmeier Dec 04 '18 at 16:01

3 Answers3

5

A uniform distribution on $(a, b)$ is the same as a uniform distribution on $[a, b]$, since for any $X$ distributed uniformly on $[a, b]$, $P(X = a) = P(X = b) = 0$. So, just use the formulae for translating to $[0, 1]$. On the other hand, if your sample has a value equal to $a$ or $b$, then you can safely conclude that you don't actually have a continuous uniform distribution.

Kodiologist
  • 19,063
  • 2
  • 36
  • 68
  • I don't agree with your latter statement. Following the same logic, you could exclude any data from ever being sampled from a uniform distribution. – dedObed Dec 04 '18 at 19:47
  • 1
    @dedObed The argument works for any countable set of points, because any such set has Lebesgue measure zero, but not for uncountable sets. – Kodiologist Dec 04 '18 at 20:27
  • I agree that a uniform distribution on (a, b) is the same as a uniform on [a, b]. The claim I challenge is "if your sample has a value equal to a or b [...] you don't actually have a continuous uniform distribution." – dedObed Dec 04 '18 at 20:34
  • @dedObed I know. I'm saying that the argument works because $\{a, b\}$, the set of just the two values $a$ and $b$, is countable. It wouldn't if you used a non-null set, which is what would be required to "follow the same logic" to "exclude any data from ever being sampled from a uniform distribution". – Kodiologist Dec 04 '18 at 20:36
  • I think I've wrapped my head around this, seems that living with IEEE 754 introduces some math-brain damage :-) I'm still confused a bit (we can trivially find the ML estimate of a uniform distribution for a finite sample, can't we?), but I'll try to distill it into a proper question. – dedObed Dec 04 '18 at 21:48
  • 2
    @dedObed I guess the chief thing to keep in mind is that continuous distributions are the sort of ethereal mathematical entities you can't get in real life. Computers fake a continuous uniform distribution with a discrete distribution that covers a large number of floating-point values. It's close enough for many applied purposes, but, e.g., a random float will always be rational, whereas a random sample from a continuous uniform distribution will be almost surely irrational. – Kodiologist Dec 04 '18 at 21:58
3

The formula $x' = \frac{x - \min{x}}{\max{x} - \min{x}}$ will normalize the values in $[0,1]$.

I am not sure of why you want to exclude $0$ and $1$, anyway one way would be to choose a new minimum and maximum values for the transformed variable, e.g. $[0+\epsilon,1-\epsilon]$. You can then transform the variable using $$x' = \epsilon + (1-2\epsilon) \cdot \left(\frac{x - \min{x}}{\max{x} - \min{x}} \right)$$

Another way could be, as suggested by Sycorax in his comment, to use a logistic transform $$ x' = \frac{1}{1 + \exp(-x)} $$ This ensures that $\forall x \in \mathbb{R} \implies x' \in (0,1)$. However, depending on the original distribution of $x$, $x'$ might span only a limited range of the interval $(0,1)$, so you might want to try e.g. to standardize $x$ before applying the logistic transform.

matteo
  • 2,631
  • 11
  • 19
3

Using the property that the CDF is uniformly distributed on $[0,1]$, you can compute the empirical CDF for $x$. This is essentially the same as ranking the data and then rescaling by the number of elements $n$. To enforce the requirement that the scaled data exclude 0 and 1, you can deviate from the standard ECDF procedure and construct the scale so that the outputs are $\frac{1}{n+1}, \frac{2}{n+1},\cdots, \frac{n}{n+1}$, which is likewise uniform.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • There's a whole class of symmetric versions of your scaling procedure: $u_\alpha(i) = \frac{i-\alpha}{n+1-2\alpha}$ (with $0\leq\alpha\leq 1$, of which the above has $\alpha=0$. (There's also asymmetric ones which have uses in some applications) – Glen_b Dec 06 '18 at 05:12
  • Does this have any particular name? – Sycorax Dec 06 '18 at 13:42
  • Several, I think but I can't recall any right now. It comes up in probability plotting. Blom 1958 "Statistical Estimates and Transformed Beta Variables" is the standard reference for this thing (and variations). – Glen_b Dec 07 '18 at 08:49