I have frequently been faced with analyzing csv files that are limited in length to approximately 32,000 records in length (overflows at 32,768, from signed integer limits on a 16 bit wide data bus).
As a result, I have often had data that spans several hundred files, each containing up to the 32,000 records. Even this 32,000 record file length is variable and random, owing to an automated data collection process whose collection will halt if the process being monitored should stop.
Here is the issue -- I can very easily open these two-hundred plus files completely algorithmically and sequentially determine how many data points they contain and some desired summary statitistics. The typical statitistics of interest are the sample size, mean, and standard deviation.
Armed with these two-hundred-some-odd summary stats, I wish to extrapolate the mean and standard deviation of the entire sample population collected (up to 200+ times 32,000max). It seems to me that some moments about a mean are somehow involved in that the standard deviation of a hypothetical subset of say 16,000 data points will carry less weight (and therefore have less influence on determining the overall standard deviation), than a data set of 32,000 for example. For that reason, I suspect that variances of each subset somehow play a roll in this, or that a root-mean-square procedure will be able to reconstitute these summary statitistics in a meaningful way to further weight them in deriving the mean and standard deviation of the entire sample set. Your assistance is appreciated.