The might be a basic question for some of you, but I am looking for the appropriate method to run a linear regression when observations are not independent.
Let's say I have the following data structure (it is not my data, but it works as an illustration):
Student | Teacher | Year | Class | Grade | Age of teacher | Household income of student | Number of Students in class |
---|---|---|---|---|---|---|---|
1 | A | 2020 | Math | 5 | 30 | 100 | 25 |
1 | A | 2020 | Geography | 4 | 30 | 125 | 15 |
1 | A | 2021 | Geography | 3 | 31 | 130 | 30 |
2 | A | 2020 | Math | 2 | 30 | 30 | 25 |
2 | B | 2021 | Chemistry | 1 | 40 | 35 | 18 |
3 | A | 2020 | Math | 5 | 30 | 50 | 25 |
3 | A | 2020 | Geography | 5 | 30 | 55 | 15 |
3 | C | 2021 | Chemistry | 4 | 60 | 75 | 15 |
3 | C | 2021 | French | 5 | 60 | 80 | 10 |
For the sake of argument, let us say that my theoretical argument is that a class-specific grade (dependent variable: Grade) is the outcome of student characteristics (Household income of student), teacher characteristics (Age of teacher), and class characteristics (Number of Students in class). I would also like to add year-specific effects (maybe some years are more difficult than others). How do I test this?
Since I am not interested in the average grade of the student, students will appear multiple times in the data (within a year and across years). So observations are not independent. Also, students may study different classes and different numbers of classes so the data is asymmetric in this regard.
By the way, I am using R.
Thank you so much!
Warmest regards, Niels