Sample weighting

Question

How do I work out an adequate/representative sample in the following scenario?

My intention is to use correlation and regression models to test the relationship between income level and money spent on a particular product by store type.

Scenario

There are 200 stores, with a total of 10,000 registered customers. The stores can be grouped in terms of their floor space as large (100 stores), medium (70 stores) and small (30 stores).

The large stores account for 60% of the registered customers, the medium stores account for 30% of the customers and the small stores account for 10% of the customers.

Sampling Method

Using the cluster sampling technique, I have randomly chosen 2 large, 2 medium and 2 small stores. (This is on the assumption that the stores are similiar as per their group size.)

Then, using stratified random sampling, I chose 50 customers from each of the six stores such that I have 100 large store customers, 100 medium store customers and 100 small store customers. (This was to ensure gender and age balance.)

My final sample is then 300 customers.

Is this a representative sample (given the scenario)?

Or, do I have to use some sort of weighting to ensure the final sample reflects:

the store size i.e. 100 are large stores, 70 are medium stores and 30 are small stores.
customer distribution i.e. 60% of the customers are from large stores, 30% are from medium stores and 10% are from small stores)

Here a method I propose to use. It is Probability Proportionate to Size (PPS) Sampling Method. Please see my comment below.

How do you know the income level of customers. What about customers who don't buy the product - are you only interested in purchasers? — Michelle, Jan 26 '12 at 07:56
You've asked a number of obviously related questions http://stats.stackexchange.com/questions/21727/guidelines-for-calculating-sample-size, http://stats.stackexchange.com/questions/21658/is-it-necessary-for-a-sample-to-broadly-reflect-the-target-population-distributi, http://stats.stackexchange.com/questions/21595/what-is-a-representative-sample. It would definitely be worth your while studying some basic texts on sampling and weighting and if possible taking a course or getting specialist assistance. As Michelle says in an answer to another question, this is a highly specialized area. — Peter Ellis, Jan 26 '12 at 08:35
Please get some consultant survey statistician help with this. — Michelle, Jan 26 '12 at 18:52
What is the target parameter that you want to make inference for? What is the target population you want to conduct your analysis for? You would need to weight differently depending on whether you want to analyze the store-level characteristics or the consumer-level characteristics (and you'd have to assume that a customer can only go to one store to pin down the selection probabilities for customers). — StasK, Jan 30 '12 at 04:08

Peter Ellis · Answer 1 · 2012-01-29T18:39:45.823

This sampling strategy could work but it needs a bit of refinement.

You certainly need to use weighting. Any sampling strategy can be "representative" so long as all the population have a known, non-zero probability of selection. Then you can set weights to the inverse of their probability of selection. If the probabilities are not equal you need weights. Definitely in your case individuals have different chances of selection and hence need to have weights calculated for them once the sampling is finished but before you start doing analysis.

I think you are using cluster and strata the wrong way around in your description, although it is reasonably clear what you are doing. Your strata are store sizes, and within those strata you first select two stores, which are clusters of 50 customers each. If you were specifying the survey design to statistical software it is important to understand the distinction.

A challenge with your sampling strategy, as you point out in the comments, is that customers from large stores have a very low relative probability of selection. This means that you will end up with a relatively good idea of the behaviour of customers in small stores - but there are so few of them is it worthwhile investing that much of your scarce sample in them. Perhaps you should select more people from the large stores. This is the sort of question that really needs specialist input to resolve - the best approach depends on your actual research question, the variance in your various variables within the various strata, etc.

I don't think your strategy does anything about age and gender balance as you say. You could introduce these into your sampling strategy someway (eg by setting quotas, if you are worried that interviewer bias is stopping them approaching people of particular age or gender types - but being careful to ensure that selection remains as random as possible and that you have not given interviewers more discretion in who they choose) and as weights after the sampling is over.

Can I recommend Thomas Lumley's survey package in R which has a good website. However, I think you will need to purchase his book (or a similar one) and read it carefully before you are really in a position to know how best to collect and analyse your data. He has a good explanation of the issues you are asking about. Of course, there are other good books on samples too, but his has the advantage of being linked to readily available free and powerful software.

With the survey package R is an excellent tool for the sort of analysis you want to do. SAS and Stata work well with complex surveys. SPSS cannot work with surveys and weights properly unless you buy an expensive additional complex surveys module. I wouldn't even contemplate using something like Excel for analysing a survey with weights.

Thanks. I guess my key concern is that the final sample is made up of three sets of 100 customers, despite the fact that the first set represents large stores with 60% market share; the next set represents medium stores with 30% market share; and the final set represents small stores with 10% market share. In this regard the three sets do not have the same weight. Unsure if I am using the term "weights" correctly in this context. — Adhesh Josh, Jan 26 '12 at 15:18
"Weights" normally refers to numbers the analyst calculates after the sampling is complete, based on who actually ended up in the sample and how much weight needs to be given to each row of data so they represent the population. What you are calling "weight" when you say "the three sets do not have the same weight" is something closer to the probability of selection in the sample ie customers in small stores appear to be six times as likely to be selected as those in large stores. Normally the weights you give sample points will be inversely related to the probability of selection. — Peter Ellis, Jan 26 '12 at 18:34
Thanks. The probability of selection is the same within each set of 100 respondents because they are selected from two stores of each size (small, medium and large). Is there a way to resolve the above without resorting to complex formula. — Adhesh Josh, Jan 26 '12 at 19:42
No, can't be resolved without a reasonably complex formula. As well as the fact that the store is "large" you need to know how many people use the store and hence what the sample's chance of selection was from within that store. You need to know the stores chance of selection (stores are your primary selection unit) then individual's chance of selection from within the store. The 100 respondents within each strata don't have equal chances of selection unless each large store is exactly the same size in terms of customers. — Peter Ellis, Jan 26 '12 at 19:51
Thanks Peter. I think I have found a way out! It is Probability Proportional to Size Sampling (PPS) technique. I don't have the points to create a link here, so I am putting a paper I found on this method in my question above. — Adhesh Josh, Jan 27 '12 at 01:22
Adhesh - there is no "way out" of using what you call a complex formula. PPS is not a simple thing to analyse. I urge you to do some more reading as per my answer above, on both sampling and the analysis of the survey once complete. — Peter Ellis, Jan 29 '12 at 18:38
I thought the attached paper (in the question) gave adequate guidance when clusters are of different sizes. The stores are clusters in this case. — Adhesh Josh, Jan 30 '12 at 12:58

Sample weighting

1 Answers1

Linked