How to reshape data to apply linear regression?

Question

I have a dataframe say df

df <- data.frame(id = 1:12, country = rep(letters[1:4], each = 3),
      chars = rep(c(LETTERS[24:25], "GDP"),4), year_1 = 11:22, year_2 = 31:42)

df
#   id country chars year_1 year_2
#1   1       a     X     11     31
#2   2       a     Y     12     32
#3   3       a   GDP     13     33
#4   4       b     X     14     34
#5   5       b     Y     15     35
#6   6       b   GDP     16     36
#7   7       c     X     17     37
#8   8       c     Y     18     38
#9   9       c   GDP     19     39
#10 10       d     X     20     40
#11 11       d     Y     21     41
#12 12       d   GDP     22     42

where

country is country name

chars - consider them as characteristic (feature) of that country like education, employment etc. One of them is GDP of that country. So all the countries are compared on the same characteristics.

year_1 and year_2 are the value for that characteristic for that country in those respective years.

The goal is to build a simple linear regression model which can predict GDP of the country based on the characteristics provided (X, Y and many others).

My question is what is the best way to arrange this data so that linear regression could be build on this?

Should I have one row per country ?
Should I have one row per country per year ?

I tried reshaping the dataframe to make it one observation per row using

library(reshape2)
melt(df, id.vars = c("id","country", "chars"))

but I am still confused as to what is the correct approach.

Seems to me that you have __time series__ data - have a look [here](https://stats.stackexchange.com/questions/268721/multilinear-regression-vs-time-series) or [here](https://www.sas.upenn.edu/~fdiebold/Teaching104/Ch14_slides.pdf) — Xavier Bourret Sicotte, Jun 14 '18 at 20:48

score 2 · Accepted Answer · answered Jun 15 '18 at 08:29

2

I think you want one row per observation of y, i.e. GDP. This can be done as follows:

library(dplyr)
library(tidyr)
library(readr)

df2 <- df %>% 
  select(-id) %>% 
  gather(time, value, year_1:year_2) %>% 
  spread(chars, value) %>% 
  mutate(time = parse_number(time))

  country time GDP  X  Y
1       a    1  13 11 12
2       a    2  33 31 32
3       b    1  16 14 15
4       b    2  36 34 35
5       c    1  19 17 18
6       c    2  39 37 38
7       d    1  22 20 21
8       d    2  42 40 41

Now one can fit a model of the form:

GDP ~ time + X + Y + country

with appropriate interactions, random effects or autocorrelation, however you see fit.

answered Jun 15 '18 at 08:29

Axeman

199
1
11

Thanks, this is helpful and makes sense. However, what is your opinion of aggregating all the values and having only one observation per country? We can take sum/mean of all the features. Do you think it would make a bad model? – Ronak Shah Jun 15 '18 at 08:37
That seems like a bad idea. What if GDP goes up with X and down with Y? Then the mean may make it seem there is no predictive value of X or Y. – Axeman Jun 15 '18 at 08:40
I mean one row per country. Number of columns would be the same (X, Y and GDP) but as there is only one row for each country, we take mean of X, Y and GDP values for that country individually. – Ronak Shah Jun 15 '18 at 08:43
Then you lose time. If you want to forecasting to year 3, time is probably going to be important. I generally prefer to model the problem, instead of aggregating. – Axeman Jun 15 '18 at 08:46
yes, I agree. Thanks. I think I am going to go with this approach. – Ronak Shah Jun 15 '18 at 08:52

How to reshape data to apply linear regression?

1 Answers1