I have a dataframe say df
df <- data.frame(id = 1:12, country = rep(letters[1:4], each = 3),
chars = rep(c(LETTERS[24:25], "GDP"),4), year_1 = 11:22, year_2 = 31:42)
df
# id country chars year_1 year_2
#1 1 a X 11 31
#2 2 a Y 12 32
#3 3 a GDP 13 33
#4 4 b X 14 34
#5 5 b Y 15 35
#6 6 b GDP 16 36
#7 7 c X 17 37
#8 8 c Y 18 38
#9 9 c GDP 19 39
#10 10 d X 20 40
#11 11 d Y 21 41
#12 12 d GDP 22 42
where
country
is country name
chars
- consider them as characteristic (feature) of that country like education, employment etc. One of them is GDP of that country. So all the countries are compared on the same characteristics.
year_1
and year_2
are the value for that characteristic for that country in those respective years.
The goal is to build a simple linear regression model which can predict GDP of the country based on the characteristics provided (X
, Y
and many others).
My question is what is the best way to arrange this data so that linear regression could be build on this?
Should I have one row per country ?
Should I have one row per country per year ?
I tried reshaping the dataframe to make it one observation per row using
library(reshape2)
melt(df, id.vars = c("id","country", "chars"))
but I am still confused as to what is the correct approach.