4

I have a large dataset which is made up of responses from 600 shoppers on 15 variables.

When I tally the responses, it comes to 603. I know there were 600 respondents, so I must have recorded the responses of three people twice.

Instead of checking each questionnaire against the dataset, can I run a test in Excel that would point out the double entries.

Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
Adhesh Josh
  • 2,935
  • 16
  • 50
  • 67
  • 6
    Not only were some responses recorded twice, but it's also likely some responses were not recorded at all and many more may have been recorded with an error in at least one field. You should treat this 603/600 discrepancy as a red flag that signals the possibility of pervasive quality problems and therefore not limit your actions to fixing only the errors that were detected: if you care about these data, you should take more comprehensive action to detect and fix the other errors that are likely present. See http://stats.stackexchange.com/q/11659. – whuber Sep 15 '11 at 16:42

2 Answers2

4

Suppose you have ID numbers in column A, from row 2 through 604, and a blank column B. In cell B2 type "=countif(\$a\$2:\$a\$604,a2)". Copy that formula all the way down. The number in each cell in column B will now be the number of times that its row's ID appears.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
rolando2
  • 11,645
  • 1
  • 39
  • 60
  • That's a nice technique. It adopts a great simplification in assuming that the rows have unique ids, though! – whuber Sep 15 '11 at 22:20
  • @whuber I don't understand. If an ID appears 3 times, there will be a "3" in column B alongside each instance of that ID. – rolando2 Sep 16 '11 at 01:22
  • You are assuming that ids have been entered into the database! Most likely the database consists of 603 rows of responses and no unique identifiers. – whuber Sep 16 '11 at 14:21
  • Oh! I see. In that case one could simply replace the "a"s with "b"s in the formula. Or use XL's Data...Remove Duplicates command. Or use Data...Identify Duplicates in SPSS :-) – rolando2 Sep 16 '11 at 14:45
  • That's not quite right, as @Glen points out, but it's worth a try. If exactly three extra records are found in this way, they probably are the duplicates (assuming there are no other errors in data entry!). But if more than three are found, there's no way to tell the duplicates from real data where two or more people gave exactly the same responses. – whuber Sep 16 '11 at 14:54
2

Not sure about excel but you can import your data into R and use the duplicate() function to determine which rows are duplicates. However with 603 observations and 15 variables you will probably get plenty more observations than the 3 you typed in twice (maybe not, depends on the distribution of the variables). I don't know how you will determine which 3 are the true duplicates without knowing the 3 you are looking for.

Glen
  • 6,320
  • 4
  • 37
  • 59