1

If I make some experiment, which I repeat $N$ times, and look at the distribution of the experiment. I would usually expect (and hope) for a Gaussian or normal distribution.

If however my data is T-distributed, what is this saying about my data? What can I interpret from the T-distributed.

I realise this may be a slightly open ended question, but if I may illustrate what I am after by means of example. If I make an experiment and my data comes out as a Rayeligh distribution, I can already say that my data is comes from the absolute magnitude of two components which are both Gaussian distributed.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Q.P.
  • 248
  • 1
  • 13
  • 1
    Your logic is inverted: when your data are the norms of Gaussian vectors, it's reasonable to use a Rayleigh distribution to model them; but when a Rayleigh model fits your data, that does not imply your data were generated by norms of some (hidden) Gaussian vectors. Similarly, there's very little you can say about the process by which $t$-distributed data were generated. – whuber Mar 04 '20 at 15:01
  • Do your data come from the ratio between normally distributed data and the sum of squares of normally distributed data? https://en.wikipedia.org/wiki/Student%27s_t-distribution – Paolo Nadalutti Mar 04 '20 at 14:25
  • @whuber I did a little more reading, and I now interpret a t-distribution as data drawn from a Gaussian distribution but where there are not enough points in the sample to fully represent the parent distribution, from which the sample is generated. But I do take your point about my logic being inverted. – Q.P. Mar 04 '20 at 15:11
  • That's not a correct interpretation: $t$ distributions do not arise that way. They arise as *sampling distributions* of a mean divided by a standard error. – whuber Mar 04 '20 at 16:39
  • Data that follows a Student's-t distribution has fatter tails (more extreme observations, kurtosis > 3) than you would see in a normal distribution but still have the "central hump" of observations around the mean. – RobertF Mar 04 '20 at 16:42
  • @RobertF I know how a Student's-t distribution looks, I want to know why it looks that way. – Q.P. Mar 04 '20 at 17:09
  • @whuber I don't understand what you mean by "sampling distributions of a mean". By this do you mean, if I have some parent Gaussian distribution and I take samples of length $N$ from this distribution and calculate the mean of that sample. Then repeating this, my distribution of calculated means will then be Student t-distributed? – Q.P. Mar 04 '20 at 17:11
  • Close (and the concept's right): the distribution of calculated means will have a *Gaussian* distribution, but the distribution of means minus the parent mean, all divided by the calculated standard error, will have a Student $t$ distribution. – whuber Mar 04 '20 at 17:21
  • @whuber Brilliant I think I understand! How does that link in with, if one has two random variables both generated from a Gaussian distribution and takes the ratio, then repeating many times -- why does this then produce a student t-distribution? This comes back to me trying to understand why some recorded data may come out looking t-distributed. I know I impose my own "thoughts" on the data here, as you said, but experimentally sometimes this is all you have to go on. The physics of the problem and what your statistics look like! – Q.P. Mar 04 '20 at 17:29
  • In short it's because you're using the sample variance, not population variance, to find the standardized scores of your sample observations. For small samples your sample variance estimate is unstable and jumps around from sample to sample, resulting in fatter tails for the distribution of z-scores. – RobertF Mar 04 '20 at 17:34
  • 1
    The link is this: the particular Student $t$ in question has "one degree of freedom:" it is the distribution of the $t$ statistic for a sample of size $2.$ In this case the standard error is a multiple of the size of the difference between the two numbers in the sample. Because the difference follows a Gaussian distribution, its size follows a "half-Gaussian." Thus, the Student $t(1)$ arises by dividing one Gaussian by an (independent) half-Gaussian. However, because the numerator is equally likely to be negative as positive, you get the same distribution as dividing by the full Gaussian. – whuber Mar 04 '20 at 17:39
  • @whuber Okay now I really have it! Thank you! And also thanks to the other users for trying to help me out! – Q.P. Mar 04 '20 at 17:47
  • @whuber do you know of anywhere, where a mathematical proof is written for what you said in your last comment. I simulated the behaviour and replicated exactly what you said! – Q.P. Mar 05 '20 at 21:11
  • 2
    Historically, the first proof would have appeared in a paper by R. Fisher around the first World War. Any theoretical or sufficiently advanced stats textbook includes one. The result in a disguised form is proven at https://stats.stackexchange.com/questions/85916. I presented the argument in my previous comment, with more details, at https://stats.stackexchange.com/a/437424/919, and the same thread contains other demonstrations, including the one at https://mathworld.wolfram.com/NormalRatioDistribution.html. – whuber Mar 05 '20 at 21:19
  • @whuber Brilliant, thanks for taking the time to teach! – Q.P. Mar 05 '20 at 22:37

1 Answers1

1

If your data resembles as if sampled from a t-distribution rather than from a normal distribution, that means it has fatter tails, while it is still symmetric. You cannot say from this how the data was generated, there must be many possibilities. But it must be some process that for some reason tends to produce many atypical values outliers.

One way to learn about this is using simulation, in R try x <- rt(100, df=5) and experiment, plot, ...

Sometimes a t-distribution is assumed to make a more robust model, see Why should we use t errors instead of normal errors? and Fitting t-distribution in R: scaling parameter

Your conclusion in the last paragraph is wrong, as said in comment by whuber:

Your logic is inverted: when your data are the norms of Gaussian vectors, it's reasonable to use a Rayleigh distribution to model them; but when a Rayleigh model fits your data, that does not imply your data were generated by norms of some (hidden) Gaussian vectors.

Similarly, there's very little you can say about the process by which -distributed data were generated.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467