I'm new to regression, and confused about how to approach.
I am using lm.ridge()
from the MASS
package to perform a regression on a continuous response variable and about 280 predictor (binary) variables. The predictor variables are correlated, hence using ridge regression.
The dataset has about 13000 records.
I can formulate and run the functions to generate a model, but what I ultimately want is to figure out how much each predictor is contributing towards the response.
The model output is as:
> summary(mod)
#Length Class Mode
#coef 281 -none- numeric
#scales 281 -none- numeric
#Inter 1 -none- numeric
#lambda 1 -none- numeric
#ym 1 -none- numeric
#xm 281 -none- numeric
#GCV 1 -none- numeric
#kHKB 1 -none- numeric
#kLW 1 -none- numeric
and
> glimpse(mod)
#List of 9
# $ coef : Named num [1:281] -0.0063 -0.0063 -0.00477 -0.0063 -0.0063 ...
# ..- attr(*, "names")= chr [1:281] "id1" "id2" "id3" ....
# $ scales: Named num [1:281] 0.0595 0.0595 0.0595 0.0595 0.0595 ...
# ..- attr(*, "names")= chr [1:281] "id1" "id2" "id3" "id4" ...
# $ Inter : int 1
# $ lambda: num 0
# $ ym : num 0.105
# $ xm : Named num [1:281] 0.00356 0.00356 0.00356 0.00356 0.00356 ...
# ..- attr(*, "names")= chr [1:281] "id1"
# $ GCV : Named num Inf
# ..- attr(*, "names")= chr "0"
# $ kHKB : num -1.33e-24
# $ kLW : num -1.33e-24
# - attr(*, "class")= chr "ridgelm"
p.s. I have modified the ids from the actual output.
I would like to know how to use the coefficients to figure out the contribution of each id
towards the response variable.
What I'm asking is not for help with the programming aspect, but the interpretation of the coefficients. I have been through some material, and found some really helpful material on implementation details , about centering and scaling and broadly understand how to use the coef()
, model$coef
and model$scales
.
However,I would appreciate help/insights about the statistics part of interpreting the coefficients.
Can they be used directly and thought of weights for the inputs (like in a neural net) or because of
ridge regression
do they need to be processed somehow before they can be interpreted?What does it imply if the values are mostly negative? In a context that the response variable was a measure of change in a value with respect to a base value, and ranged between 0 and 10000, which was standardised.
Thank you.