Check if a character string is not random

Question

Background
Let's say we have an alphabet of A,B, C, D, then we look through some data and find a "word" which is DDDDDDDDCDDDDDD the chance of finding this random seems low to me whereas finding BABDCABCDACDBACD seems less random.

Question
How should I check whether the strings I encounter are not random?

I tried some things in R, e.g., encoding the letters numerically and then comparing these to permutations. But encoding beforehand is quite cumbersome. Likely there is a more direct approach for this?

You could make an n-bit state machine and then record how many mispredictions it makes. Sort of like a CPU’s branch predictor. — alexyorke, Oct 10 '18 at 14:27
You can compute the probability that a character string has been generated by some particular known process. Whether it's "random" cannot be known (and is probably not a meaningful question). — OrangeDog, Oct 10 '18 at 14:58
You could [check with accounting](http://dilbert.com/strip/2001-10-25). — hlovdal, Oct 10 '18 at 17:53
If you're focused on letter frequencies rather than sequences, the Chi-squared test of actual-versus-expected frequency is common. That is, if your first example is "odd" because it has way too many "D", while your second has a fairly equal number of "A", "B", "C", and "D", then you'd want to compare the number of each of A, B, C, and D versus whatever you'd expect. (Maybe roughly equal numbers of A, B, C, D, or maybe twice as many A's as B's and twice as many B's as C's or D's.) — Wayne, Oct 10 '18 at 19:54
Just because a result is low probability doesn't mean it's not random. The probability of any specific person winning the lottery is extremely low, but it's still the result of a random process. — Barmar, Oct 10 '18 at 21:10
**Any** string has a equal chance of being randomly created. Using the lottery as an example, if you played the numbers `1`,`2`,`3`,`4`,`5`,`6`, you have exactly the same chance of winning the jackpot as *any* other set of numbers. **Q:** If you flip a coin ten times and the first 9 times come up `Heads`, what will the next flip most likely be? (**A:** it's still a 50/50 chance on the tenth flip.) — ashleedawg, Oct 11 '18 at 04:23
randomness as lack of structure: https://cs.stackexchange.com/questions/14772/compressed-information-randomness — thebjorn, Oct 11 '18 at 09:55
`DDDDDDDDCDDDDDD` has 15 letters while `BABDCABCDACDBACD` has 16, making the former more likely to appear. — Henry, Oct 11 '18 at 18:45

score 19 · Accepted Answer · edited Oct 10 '18 at 15:08

19

the chance of finding this random seems low to me whereas finding BABDCABCDACDBACD seems less random.

Why would that be? If the overall proportion of letters A...D is equal to 0.25 for each letter, and each letter is independent of the other one, then both words are exactly equally probable. If the distribution of letters differ, then of course the probabilities of generating both words might be different.

You can try to find "low complexity" words, for example words with an especially high proportion of one letter (you could use the Shannon information as suggested in the other response, and in biological sequence analysis there are many other approaches), but there is no test for "randomness", as without further assumptions or knowledge about what you are actually analyzing, the term "randomness" makes no sense.

edited Oct 10 '18 at 15:08

OrangeDog

103
3

answered Oct 10 '18 at 13:44

January

6,999
1
32
55

10

"both words are exactly equally probable" would be a great place for bold emphasis. – Tashus Oct 10 '18 at 16:35
1

"If the overall proportion of letters A...D is equal to 0.25 for each letter...". No, actually every possible word is as likely as any other, whatever proportion of letters is in the word. – DJClayworth Oct 10 '18 at 18:08
6

@DJClayworth I believe the intent of that line is to say that if instead of A:.25 B:.25 C:.25 D.25, we have A:.5, B:.25, C:.125, D:.125, the chance of getting the word ABAA is far more likely in the second case than the first, and CDBD is equally likely as ABAA for the first scenario, but less likely than ABAA in the second. Thus, the chance of a given word depends on the 'proportion' of letters relative to other possible proportions. – ale10ander Oct 11 '18 at 02:06

score 17 · Answer 2 · edited Oct 10 '18 at 13:52

17

You could try Shannon information: $$ H = -\sum_{i = 0}^n {P_{i}\log_{2}(P_{i})} $$ where, $P_{i} = \frac{c_{i}}{n}$, $c_{i}$ is the count of some letter $c$ in the word and $n = |{\rm word}|$.

For the first word you have $H = 0.35$. In the second word you have $H = 2$.

If the entropy is high, you could think of it as more random vs. another word with lower entropy.

edited Oct 10 '18 at 13:52

gung - Reinstate Monica

132,789
81
357
650

answered Oct 10 '18 at 13:32

Edvrsoft

171
4

This is a good way to go for detecting a string's unpredictability, and I upvoted, but your criterion would give the same results for both `bababbaabb` and `aaaabbbbbb`. The, admittedly very loose, notion of "randomness" used by OP would probably consider the former to be "more random" than the latter. – ymbirtt Oct 11 '18 at 15:54

Ben · Answer 3 · 2018-10-11T03:14:26.707

Other answers here have focused on the overall occurrence of different letters in the sequence, which may be one aspect of the "randomness" expected. However, another aspect of interest is the apparent randomness in the order of the letters in the sequence. At minimum, I would think that "randomness" entails the exchangeability of the vector of letters, which can be tested using a "runs test". The runs test counts the number of "runs" in the sequence and compares the total number of runs to its null distribution under the null hypothesis of exchangeability, for a vector with the same letters. The exact definition of what constitutes a "run" depends on the particular test (see e.g., a similar answer here), but in this case, with nominal categories, the natural definition is to count any consecutive sequence consisting of only one letter as a single "run".

For example, your sequence BABD-CABC-DACD-BACD looks prima facie non-random to me (no letter appears with itself, which is probably unusual for a sequence this long).$^\dagger$ To test this formally, we can perform a runs test for exchangeability. In this sequence we have $n = 16$ letters (four of each letter) and there are $r = 16$ runs, each consisting of one single instance of a letter. The observed number of runs can be compared to its null distribution under the hypothesis of exchangeability. We can do this via simulation, which yields a simulated null distribution and a p-value for the test. The result for this sequence of characters is shown in the graph below.

For this sequence, the p-value for the runs test (under the null hypothesis of exchangeability) is $p=0.0537$. This is significant at the 10% significance level, but not at the 5% significance level. There is some evidence to suggest a non-exchangeable series (i.e., non-random order), but the evidence is not particularly strong. With a longer observed string, the runs test would have greater power to distinguish an exchangeable string from a non-exchangeable string. (As you can see, my initial prima facie judgment that this string is non-random may be wrong - the p-value is not actually as low as I expected it to be.)

Finally, it is important to note that this test only looks at the randomness of the order of the letters in the string - it takes the number of letters of each type as a fixed input. This test will detect non-randomness in the sense of non-exchangeability of the letters in the string, but it will not test "randomness" in the sense of overall probabilities of different letters. If the latter is also part of the specified meaning of "randomness" then this runs test could be augmented with another test that looks at the overall counts of the letters, and compares this to a hypothesised null distribution.

R code: The above plot and p-value was generated using the following R code:

#Define the character string vector (as factors)
x <- factor(c(2,1,2,4, 3,1,2,3, 4,1,3,4, 2,1,3,4), 
            labels = c('A', 'B', 'C', 'D'))

#Define a function to calculate the runs for an input vector
RUNS <- function(x) { n <- length(x);
                      R <- 1;
                      for (i in 2:n) { R <- R + (x[i] != x[i-1]) }
                      R; }

#Simulate the runs statistic for k permutations
k <- 10^5;
set.seed(12345);
RR <- rep(0, k);
for (i in 1:k) { x_perm <- sample(x, length(x), replace = FALSE);
                 RR[i] <- RUNS(x_perm); }

#Generate the frequency table for the simulated runs
FREQS <- as.data.frame(table(RR));

#Calculate the p-value of the runs test
R      <- RUNS(x);
R_FREQ <- FREQS$Freq[match(R, FREQS$RR)];
p      <- sum(FREQS$Freq*(FREQS$Freq <= R_FREQ))/k;

#Plot estimated distribution of runs with test
library(ggplot2);
ggplot(data = FREQS, aes(x = RR, y = Freq/k, fill = (Freq <= R_FREQ))) +
geom_bar(stat = 'identity') +
geom_vline(xintercept = match(R, FREQS$RR)) +
scale_fill_manual(values = c('Grey', 'Red')) +
theme(legend.position = 'none',
      plot.title      = element_text(hjust = 0.5, face = 'bold'),
      plot.subtitle   = element_text(hjust = 0.5),
      axis.title.y    = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0))) +
labs(title    = 'Runs Test - Plot of Distribution of Runs under Exchangeability',
     subtitle = paste0('(Observed runs is black line, p-value = ', p, ')'),
     x = 'Runs', y = 'Estimated Probability');

$^\dagger$ I have broken the sequence up with dashes solely to make it easier to read; the dashes have no significance to the analysis.

Interesting! will definitely take a look at the R script – KingBoomie Oct 13 '18 at 19:00 — KingBoomie, Oct 13 '18 at 19:00

score 1 · Answer 4 · answered Oct 11 '18 at 16:08

Assuming the string of letters is long enough, you can apply Randomness tests on the data.

One set of such tests is called the diehard tests:

The diehard tests are a battery of statistical tests for measuring the quality of a random number generator. They were developed by George Marsaglia over several years and first published in 1995 on a CD-ROM of random numbers.

They involve a, perhaps arbitrary, set of tests such as:

Birthday spacings
Overlapping permutations
Ranks of matrices
Monkey tests
Count the 1s
Parking lot test
Minimum distance test
Random spheres test
The squeeze test
Overlapping sums test
Runs test
The craps test

A good sequence of random data should pass these tests.

However, passing these tests isn't sufficient to prove the numbers don't actually encode a real signal. They could be the output from a high-quality encryption routine.

Check if a character string is not random

4 Answers4