Is it possible to determine if a dataset is real or randomly generated?

Question

I've been tasked with developing regression and classification models for time series data. For each observation I have a continuous target for regression and a discrete target for classification.

I've fitted a LASSO model which works stupendously well. It has very low MAE and RMSE values, as well as low relative errors (in the order of 0.4%). R^2 is 0.99 but I never trust it too much. Residues are very regular and appear to follow a normal distribution with zero mean and 1/2 variance:

Finally, there's virtually no autocorrelation between residues for any meaningful lag value.

On the other hand, the classification task seems impossible. There are three classes which are almost perfectly distributed in the dataset, but they appear to have been chosen at random. I ran PCA to reduce the data to two components in order to plot it, and I just observed a garbled mess of colored points with no visible pattern.

My distinct impression is that the dataset is contrived and does not represent real observations. Are there techniques to determine statistically how likely is such a thing?

I know about Benford's law, but I'm not sure if it applies to real values as well as integers. Out of curiosity, I computed the frequencies for the integer-valued columns in my dataset, and these are the results:

Those certainly do not seem to match Benford's law.

I also observed that some columns approximate very well certain distributions (e.g. I have three columns whose values approximate bimodal distributions very well).

Unfortunately, the dataset has been anonymized, and it might come from a financial context. I don't know anything else about it.

EDIT: I plotted histograms for all the columns in the dataset. Some of them really look like famous distributions. Maybe someone with more experience than me? I'm not looking for a definitive judgement here, since I know it's not possible. It's just my curiosity, in the end. Here is the result (click to see full picture with all the columns):

This is a fundamentally difficult question. Google "telling if series is randomly generated" (no quotes) for hundreds of links, including many on stackexchange (especially math.stackexchange), producers of mathematical software, and even Dilbert :) — , May 14 '17 at 16:56
Benford's law can apply to leading digits of real values but will generally only have any reasonable hope of holding (and then only approximately) if the values you can observe range across several orders of magnitude and small values are more likely than large ones. — Glen_b, May 14 '17 at 17:25
Benford's law is not so useful to detect whether data were generated with a good random number generator - there too good a fit to theoretical distributions is more of a hint. — Björn, May 14 '17 at 17:34
@barrycarter Yeah, I realized it's pretty much an impossible task, especially with the scarce information I possess. — rubik, May 15 '17 at 08:40
@Björn I see. I added histograms of all the columns. I don't really have much experience with weird probability distributions, what do you think? Just curious. — rubik, May 15 '17 at 08:40
Benford's Law applies to real values as well as integers. But this is not a definite proof of anything. For example, house pricing in euros does not follow Benford's Law at all. Also, someone with such an ability to fake the data can also make it so that it follows it. In short, this is a quite complicated question — David, Jun 20 '19 at 11:25

Is it possible to determine if a dataset is real or randomly generated?

0 Answers0