If I want to construct a multiple linear regression model upon a dataset, which has one or two thousand observations, and I find there are three or four repeated rows in the dataset, then how should I deal with these repeated rows? Should I remove them before modeling, or it would be just fine to keep them? And what will be the difference between the results of these two choices?
Asked
Active
Viewed 12 times
0
-
It depends on *why* those rows are repeated. If they represent observations with *independent* errors, then they are just as valid as any other row and there is little basis to remove them. Often a consideration of the potential sources of error is helpful in resolving this issue, because it can reveal the extent to which the errors might depart from the standard assumption of independence. – whuber Nov 11 '21 at 21:33
-
1@whuber Thank you! – Cary Nov 12 '21 at 15:11