Estimating probability distribution function of a data stream

Question

I have a very large number of observations. Observations arrive sequentially. Each observation is an $n$-dimensional vector (with $n \ge 100$), is independent from the others and is drawn from the same unknown distribution. Is there an optimal policy to estimate the unknown distribution, given some space bounds on the number of observations that can be stored? I would leave the estimation criteria open-ended, (in terms of expected or minimax error, asymptotic consistency, aysmptotic efficiency etc.).

the answer will depend on what kind of distribution is estimated. Is your goal only to estimate the distribution, or you intend to estimate something else using this distribution? What about the structure of this $n$-dimensional vector? Is it also a sample, is it time-series, or something else? Does the distribution change with time? — mpiktas, Jan 17 '11 at 18:55
Just *how* large is 'very large'? Are you hoping to estimate a 100-dimensional multivariate distribution with no constraints on its form? You might indeed need a *very* large number of observations, maybe of the order of 10^100. — onestop, Jan 17 '11 at 19:06
As mentioned in the post: the sequence is iid, hence stationary. The goal is to estimate the PDF, not something else. I am open to learn about any know results for specific classes, e.g., PDFs of given smoothness, with finite support, with finite n-th moment, etc. I suspect that the literature is not very large. — gappy, Jan 17 '11 at 23:26
it is still not clear for me, are the elements of the $n$-dimensional vector independent and distributed identicaly? — mpiktas, Jan 18 '11 at 07:49
Surely you know SOMETHING about the nature of these 100+ variable observations. — cespinoza, Jan 18 '11 at 20:42

score 2 · Answer 1 · answered Feb 06 '11 at 01:23

If you have no reason to suspect more than one mode in your data, then a multivariate normal distribution is not a bad first go. You just calculate the mean vector, and covariance matrix, and there's your PDF.

However this is a very rough answer, and it seems to me that with 100 observations, you will probably have multi-modal data. Normal just looks like one big smooth multi-dimensional mountain. You data would probably look more like a multi-dimensional mountain range (with lots of local humps).

Estimating probability distribution function of a data stream

1 Answers1