7

I have 14k tweets and I want to code these tweets (categorize them based on their topics), but since it is difficult to do the coding for the whole dataset, I decided to take a sample from it.

What I am thinking about is to take a randomly selected 20% of the whole sample (although I am not sure why I decided 20%) and then do the coding just for this sample (20%). My question here is how to check if the random sample that I picked is representative?

I am not an expert in statistics :) Thank you,,

Anas
  • 71
  • 1
  • 1
    Seems better to use a well-vetted _method_ of choosing a random sample than to try to judge somehow whether the _result_ is random. // For example, in R the code `sample(1:14000, 2800)` will give you indices of a random sample without replacement from your list. – BruceET Aug 27 '20 at 20:06
  • Well, I'd say some basic statistics. E.g. distribution of both the whole dataset and the 20 %, the mean, std.dev., median, etc. Those are some good estimators to start off with. – Thomas Aug 27 '20 at 20:57
  • (+1) Thanks for the review of my Sampling Theory! A bit of sampling theory required to correctly answer this question, at least in my humble opinion. Per my edited answer, with educational references, those still believing in a Simple Random Sampling scheme despite the non-existence of a 'list' of members for the sampling frame, to base inferences for the parent population per the question, should likely rethink. – AJKOER Aug 28 '20 at 04:10
  • 2
    *it is difficult to do the coding for the whole dataset* -- I kind of struggle to see how coding for 20% of the tweets is going to be significantly easier than for the whole dataset as you still have the same order of magnitude of tweets to handle. Maybe your question has more to do with programming than with statistics? – dariober Aug 28 '20 at 08:36

3 Answers3

0

Here I prefer the technique of Systematic Sampling where one selects every kth individual from the population. Thus, from a list of n arrived tweets, every kth tweet is chosen to construct a sample set of 's' tweets, such that k*s is close to n.

Advantages:

  • Simple statistically valid procedure

  • Accurate

  • Easier to implement and verify the correct tweets have been selected

  • Unbiased and representative, even more likely so than a Simple Random Sampling scheme, in the current context as this also sorts by time of arrival, where the latter criteria is likely material, as it spreads the sample over the day. As such, it can, for example, likely isolate workers, largely inactive 9 AM to 5 PM day, versus non-workers including students active 3 PM - 8 PM (after school), and older adults active latter in the evening.

Thus, the application of simple, easy to implement, unbiased and representative Systematic Sampling here likely also results in a spread of the sample over important age demographics and income classes.

Note: How one arrives at the best sample size 's' is an important topic, best discussed separately.

[EDIT] An important point that is duly noted by this educational reference, to quote:

You don't have a complete list, so simple random sampling doesn't apply...

So, technically the employment of a simple random sampling scheme, to assess characteristics of the parent population, is valid when one has a complete list of the population over which to subsample. This is NOT the case with a continuous occurring series of generated tweets constituting a subset of the tweeting universe. So, inferences on the parent population and, in particular, the very question as to whether it is representative of the 'whole sample', implying the parent population, only arguably can be answered here by a simple random sampling scheme. However, the same source does affirm the validity of systematic sampling in such a context, to quote:

Since we don't have access to the complete list, just stand at a corner and pick every 10th* person walking by.

*Of course, choosing 10 here is just an example. It would depend on the number of students typically passing by that spot and what sample size was needed.

AJKOER
  • 1,800
  • 1
  • 9
  • 9
  • 2
    If tweets that are close to each other in time share some characteristic (as well they might), then choosing every kth tweet will under sample these related tweets. On what do you rely to say that kth samples are unbiased, in general? On what assumptions is that statement based? – Joel W. Aug 27 '20 at 23:38
  • If tweets are driven in response to the occurrence of random events, systematically sub-sampling should be a representative sample. Spreading over arrival times, also sub-samples via time, which is a logical dimension to achieve a representative sub-sample. Simple random sampling, on the other hand, for me here, is harder to argue as being more efficient in producing an accurate sample from the general population due to the potential clustering of tweets on a particular topic. – AJKOER Aug 28 '20 at 00:26
  • Also, when tweets are produced can be indicative of different population segments as I noted in my answer. This is ignored in a simple random sampling scheme. If you want to be sure that you have a particular segment included, I would go with systematic random sampling which is also far less annoying to accurately implement. Note, if an unusual large number of tweets arise on select topics, systematic sampling resembles important sampling. – AJKOER Aug 28 '20 at 00:37
  • Joel: This statement is wrong: "choosing every kth tweet will under-sample these related tweets", if one is advocating simple random sampling as it may not just under-sample a particularly hot topic, it probabilistically could completely ignore several hot topic discussions altogether and over represent others. Does that really sound more representative? – AJKOER Aug 28 '20 at 01:45
  • I have edited my answer based on a review of the literature. Comments related to the statistical validity of employing Simple Random Sample here is open to question (as the parent population here is likely not discernible in nature or even its precise size, as required in simple random sampling). This comes from an education-based reference and, per my recollection, even my statistic professor. – AJKOER Aug 28 '20 at 13:17
  • 1
    An SRS is easy to do because one has the entire frame; none of your objections apply to that. On the other hand, the systematic sample--although *probably* ok in practice--suffers from several defects. For instance, it will never be able to detect local temporal clusters that are shorter than the sample stride. – whuber Aug 28 '20 at 18:19
  • Thank you all for your answers and discussion.. In case I decided to go with the systematic sampling, how to identify the value of k? how to presume a value for k which ensure that the new small sample is representative? given that I don't know the value of n (small sample). – Anas Aug 30 '20 at 07:55
0

So long as you have no wish to incorporate covariate information into your sampling scheme (e.g., balancing tweets from males/females), the usual method is to take a simple random sample without replacement. This can be implemented in R using the sample.int function. In the code below I show you how to generate a simple random sample from $N$ populuation values. For convenience, the sample is sorted into ascending order, so it is a list of numbers of the tweets to include in the sample. (Remember to set your seed for reproducible randomisation.)

#Generate simple random sample of tweets
set.seed(1)
N <- 14000
p <- 0.2
n <- ceiling(p*N)
SAMPLE <- sort(sample.int(N, size = n, replace = FALSE))

#Show the sample
SAMPLE

   [1]     8    13    17    18    21    25    27    42    59    64  ...
  [24]   126   128   129   149   152   155   157   172   173   179  ...
  [47]   237   241   244   262   267   274   277   289   308   311  ...
  ...
  ...
  ...
[2761] 13775 13777 13779 13780 13784 13785 13787 13788 13796 13798  ...
[2784] 13879 13880 13886 13896 13908 13918 13923 13927 13942 13944  ...
Ben
  • 91,027
  • 3
  • 150
  • 376
  • The time and effort to match numbers to tweets is something that novices here have to experience. Then, one has to check that the right tweets have, indeed, been selected! Further, after all that incredible effort, I suspect that the Systematic Random Sampling will, with much more ease, believably produce a 'better' representative sample set (as I expounded upon in my answer). My acquaintance with this method arose from a textbook dedicated to Sampling Theory based on a course I completed. – AJKOER Aug 28 '20 at 02:08
  • Ordinarily this would all be done directly using scripted coding, so that it can be done in a fraction of a second. All that would require is to import the tweets as an ```R``` data frame and then reference the rows corresponding to the sample numbers. – Ben Aug 28 '20 at 02:17
  • Thanks Ben on your reply. But, even if I followed your procedure, some manual checking should be performed that all is well. Further, how does Simple Random Sampling here actually give one comfort in answering the question: "How to make sure that the random sample is representative for the whole sample?" – AJKOER Aug 28 '20 at 02:24
  • Ben: Please note my comments above and my edited answer questioning the validity of a simple random sampling scheme as even valid to assess attributes from the tweeting universe/parent population as asked in the question: "random sample is representative for the whole sample?". – AJKOER Aug 28 '20 at 04:20
  • If machine learning here could extract the time of day from tweets, as per my understanding of a Systematic Sampling scheme, which does not require prior precise specification of the entire parent population or its size, dividing the day into time intervals for sub-sampling, is apparently a statistically valid and likely valuable path to assess inferences for the parent population. – AJKOER Aug 28 '20 at 13:04
0

What you want is a sample that is representative in terms of the topics you are going to manually code.

First of all, you want to be sure that your coding procedure is not biased. This is really important because a representative sample is useless if your coding procedure is biased. Thus you need at least two independent coders to code the tweets (usually just a part of the tweets you are going to code), and a test to evaluate the coherence between the coding results of the independent coders (such as the Krippendorff’s alpha coefficient).

Having said that, in your case the universe is composed by 14,000 tweets and a random sample would avoid per definition systematic biases in the selection of tweets. However, you might consider a more systematic sampling to be sure that every day of the week and every hours of the day is properly represented. For instance, you could sample a certain number of tweets per hours, for every hours of the day, for all the days in your dataset. In media studies there is also a procedure consisting in created a constructed week, where the data for each day are sampled for the same day across many weeks. With regards to tweets, this method has been compared to simple random sampling finding that the latter performs better.

In general, you can find a lot of examples in literature based on media data and also twitter data. If you want to be really sure of the appropriateness of your sample strategy, you might consider a sort of cross-validation approach. Instead of pick up just a sample, you pick up two samples. Without forgetting to code the tweets with independent coders and verify the validity of the coding, you first code one sample and then the other sample, and finally compare the proportions of codes in the two samples. You could also use a statistical test to be sure that the code proportions in the samples do not differ too much. However, a so detailed approach could be unusual. You should take into account the best practice in your field.

You might also want to try some supervised classification methods that seems to work fine also with a limited quantity of manually coded data.

N9N9
  • 97
  • 1
  • 7