Are two subsequences obtained by dropping elements from a random sequence still random and independent?

Question

This question is a follow up of a previous question

If one has a sequence of numbers generated by a PRNG that is assumed to be random and independent, what can be assumed of sequences obtained by dropping elements from the original. Namely:

$$ \begin{aligned} S_1 =& ~ (u_0, u_1, u_2, u_3, u_4, u_5, ...) \\ S_2 =& ~ (u_0, u_2, u_4, u_6, ...) \quad \text{(every second element)} \\ S_3 =& ~ (u_0, u_3, u_6, u_9, ...) \quad\text{(every third element)} \end{aligned} $$

Is there any assumption that makes S1 random that will be broken if it is divided in this way? Something like introducing correlation between $S_2$ and $S_3$, or making either more autocorrelated.
Considering two new sequences $S_4$ and $S_5$ are generated as random samples (with replacement) from $S_1$, does any of the answered in 1 holds?
What if $S_4$ and $S_5$ were random samples without replacement? Meaning that they are disjoint.

As a side note, this was particularly interesting to me because if one picks elements knowing the period of the PRNG and does so "every period", then the resulting sequence will have all equal elements, so there is definitely an effect when is done in that way, the question is whether there is an effect for anything besides that specific value.

score 1 · Accepted Answer · 2020-07-18T02:27:43.180

This update is based on your comments below, which completely change how I read your question.

A Pearson's correlation is for paired observations. If you make S2 a 50% random sample of S1, and make S5 the first half, in order, of S1 (so they are the same length) the correlation between S2 and S5 will be very close to zero, as you have completely clobbered the order. The correlation will also be very close to zero if you make S2 every other element of S1.

Here is the output of some R code demonstrating this.

# This makes the results repeatable (use the same seed)
> set.seed(1188)
# Choose a random, normally distributed sample (default mean and SD are 0 and 1) 
> S1 <- rnorm(1000)
# Take every other element in S1
> S2 <- remove[S1(TRUE, FALSE)]
# Take the first 500 elements of S1
> S5 <- S1[1:500]
# Show the first few values in S1 and S2
> head(S2)
[1] -0.5583091  0.2582470 -0.6253171  1.2863448
[5] -0.7943670 -1.0510371
> head(S5)
[1] -0.5583091  1.2792432  0.2582470 -1.4063328
[5] -0.6253171 -0.3928849
# Perform a Pearson correlation
> rcorr(S2,S5, type="pearson")
     x    y
x 1.00 0.05
y 0.05 1.00
n= 500 
P      x        y     
x           0.2785
y   0.2785       
# The correlation is 0.05, very close to zero, and the p-value of the correlation  
# test is 0.2785, which is > 0.05, which means we can't conclude that the correlation  
# of 0.04 is actually different from 0.

My original answer is below.

I didn't read the "previous question" you refer to. I will read "correlation" in a way that makes sense given the question - meaning how similar are the three sets, assuming the order of the numbers is not important. (Pearson's correlation is defined for paired observations, which is not what you have, as the set are different lenghts.)

Assuming true random number generation:

Question 1: S2 will be 1/2 the size of S1, and S3 will be 1/3 the size of S1. There will be overlaps between the values chosen for S2 and S3 (every 6th element in S1).

Question 2 and 3: Assuming S2 and S3 will be 1/2 and 1/3 of S1 -- If you take S3 from [S1 - S2], S3 won't have any of the exact values S2 does (assuming random real numbers, with infinite fractional digits). There won't be any overlap of the variables. (At 8 decimal points it's "possible" two values in S1 could be the same, and one could end up in S1, and one in S2). With replacement, some of the same values will be chosen for S2 and S3 -- on average, 1/6 sixth of the numbers from S1 will be found in both S2 and S3 (1/2 of 1/3).

If you want exactly 1/6 of the numbers to be chosen for both S2 and S3 (every time), use the first method, picking every other number, then every third (with replacement - assumed). Using a systematic way of choosing the numbers, which is not based on the value of the numbers in any of the sets, will not effect the randomness of the sets.

These are the criteria I replied to:

S1 = (u0,u1,u2,u3,u4,u5,...un)
S2 = (u0,u2,u4,u6,...)(every second element)
S3 = (u0,u3,u6,u9,...)(every third element)

Is there any assumption that makes S1 random that will be broken if it is divided in this way? Something like introducing correlation between S2 and S3, or making either more autocorrelated.

Considering two new sequences S4 and S5 are generated as random samples (with replacement) from S1, does any of the answered in 1 holds?

What if S4 and S5 were random samples without replacement? Meaning that they are disjoint.

First some comments: For questions 2 and 3, one can assume that S2 and S3 are the same size since random sampling means the "every two elements" no longer holds. Your answer does not correspond to the question because I am not answering whether the sequences have elements in common but if there is some special assumption that makes S1 random that is broken when the sequence is divided in this way. And with correlation I meant the Pearson's r, but any significant similarity that can bias the randomness between S2 and S3 will be equivalent for this discussion — Ezequiel Castaño, Jul 17 '20 at 18:35
I don't know why you think somebody would assume that set 2 and set 3 are 50% samples of set 1. Your original samples were 50% and 33.33%. You should update your post to say what you want. -- Selecting every other item for set 2, and every third item for set 3 does not make those two sets non-random, as I said in my answer. -- Pearson's r is defined for paired observations. That's not what you have with a 50% sample and a 33.33% sample, therefor I read "correlation" to mean "unordered set correlation" - "how similar are the unordered sets?" — , Jul 17 '20 at 22:39
If you want to do a Pearson correlation, the sets need to be the same lengths, and order matters - as you are looking at paired observations. If S2 is a 100% random sample of S1, the correlation between S1 and S2 will be very close to zero, as the order will be completely clobbered. — , Jul 17 '20 at 22:45
You are completely right, I have changed the question in order to fit more what I really wanted. In the case of calculating something like the Pearson's r which requires same length,I think of two possible workarrounds (I don't know whether they work or not), one will be to cut one of the sequences so that they are equal length, the second is to consider S1 infinite (like a real random number sequence) and tell what would happen then with S2 and S3 which will be in turn of infinite size and countable. — Ezequiel Castaño, Jul 17 '20 at 23:37
The correlations will all be very close to zero. I added some R code to my answer, showing this. — , Jul 17 '20 at 23:56

Are two subsequences obtained by dropping elements from a random sequence still random and independent?

1 Answers1

Linked