Question
I'm aware that generating features from existing data can be a valid method for adding new features for a regression/ML algorithm*, but can you add observations generated from existing data?
*related SO question: Do combinations of existing features make new features?
Example
Language: R
Given a data frame df
of three dependent variables (dv1, dv2, dv3
) and one response variable (rv
)
dv1 <- c("gr1", "gr2", "gr3", "gr3", "gr3", "gr3", "gr1", "gr2", "gr2", "gr1", "gr3", "gr2")
dv2 <- c("grA", "grA", "grB", "grB", "grB", "grA", "grB", "grA", "grB", "grA", "grB", "grB")
dv3 <- c(1,1,1,1,2,2,1,1,1,1,2,1)
rv <- c(1,2,3,3,2,1,1,2,3,3,2,1)
df <- data.frame(dv1, dv2, dv3, rv)
> head(df)
dv1 dv2 dv3 rv
1 gr1 grA 1 1
2 gr2 grA 1 2
3 gr3 grB 1 3
4 gr3 grB 1 3
5 gr3 grB 2 2
6 gr3 grA 2 1
Does it make statistical sense to engineer observations by grouping the variables, finding the 'total' rv
value for that group...
library(dplyr)
df_t <- df %>%
group_by(dv1, dv2) %>%
summarise(dv3 = sum(dv3),
rv = sum(rv))
> head(df_t)
Source: local data frame [6 x 4]
Groups: dv1
dv1 dv2 dv3 rv
1 gr1 grA 2 4
2 gr1 grB 1 1
3 gr2 grA 2 4
4 gr2 grB 2 4
5 gr3 grA 2 1
6 gr3 grB 6 10
... and then combining it with the original data...
df2 <- rbind(df, df_t)
> df2
dv1 dv2 dv3 rv
1 gr1 grA 1 1
2 gr2 grA 1 2
3 gr3 grB 1 3
4 gr3 grB 1 3
....
13 gr1 grA 2 4
14 gr1 grB 1 1
15 gr2 grA 2 4
16 gr2 grB 2 4
17 gr3 grA 2 1
18 gr3 grB 6 10
... and then using that data to train a regression model?