r - odds ratio and glmnet

Question

I have this r code using caret and glmnet, and it's intended to export a table with values. I've been asked to include odds ratio. The problem is I'm uncertain how to explain the odds ratio in this context. How would I state the interpretation of the odds ratio in plain english?

This is what is printed:

Importance Mean Coefficient OddsRatio PClass.1st 0.15243 0.24162 0.15243 1.16466 PClass.2nd 0.00000 0.21726 0.00000 1.00000 PClass.3rd 0.20438 0.54112 -0.20438 0.81515 Age 0.00437 29.21214 -0.00437 0.99564 Sex.female 0.11873 0.35431 0.11873 1.12607 Sex.male 0.11849 0.64569 -0.11849 0.88826

And this is the code that makes it happen:

# titanicmodeling.R
# 
# based on https://amunategui.github.io/binary-outcome-modeling/

# load libraries
library(caret)
library(pROC)

# load data
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt',sep='\t')
titanicDF$Title <- ifelse(grepl('Mr ',titanicDF$Name),'Mr',ifelse(grepl('Mrs ',titanicDF$Name),'Mrs',ifelse(grepl('Miss',titanicDF$Name),'Miss','Nothing'))) 
titanicDF$Age[is.na(titanicDF$Age)] <- median(titanicDF$Age, na.rm=T)

# miso format
titanicDF <- titanicDF[c('PClass', 'Age',    'Sex',   'Title', 'Survived')]

# dummy variables for factors/characters
titanicDF$Title <- as.factor(titanicDF$Title)
titanicDummy <- dummyVars("~.",data=titanicDF, fullRank=F)
titanicDF <- as.data.frame(predict(titanicDummy,titanicDF))
print(names(titanicDF))


# what is the proportion of your outcome variable?
prop.table(table(titanicDF$Survived))

# save the outcome for the glmnet model
tempOutcome <- titanicDF$Survived  

# generalize outcome and predictor variables
outcomeName <- 'Survived'
predictorsNames <- names(titanicDF)[names(titanicDF) != outcomeName]

#################################################
# model it
#################################################
# get names of all caret supported models 
names(getModelInfo())

titanicDF$Survived <- ifelse(titanicDF$Survived==1,'yes','nope')

# pick model gbm and find out what type of model it is
getModelInfo()$gbm$type

# split data into training and testing chunks
set.seed(1234)
splitIndex <- createDataPartition(titanicDF[,outcomeName], p = .75, list = FALSE, times = 1)
trainDF <- titanicDF[ splitIndex,]
testDF  <- titanicDF[-splitIndex,]

# create caret trainControl object to control the number of cross-validations performed
objControl <- trainControl(method='cv', number=3, returnResamp='none', summaryFunction = twoClassSummary, classProbs = TRUE)


# run model
objModel <- train(trainDF[,predictorsNames], as.factor(trainDF[,outcomeName]), 
                  method='gbm', 
                  trControl=objControl,  
                  metric = "ROC",
                  preProc = c("center", "scale"))


# find out variable importance
summary(objModel)

# find out model details
objModel


################################################
# glmnet model
################################################

# pick model gbm and find out what type of model it is
getModelInfo()$glmnet$type

# save the outcome for the glmnet model
titanicDF$Survived  <- tempOutcome

# split data into training and testing chunks
set.seed(1234)
splitIndex <- createDataPartition(titanicDF[,outcomeName], p = .75, list = FALSE, times = 1)
trainDF <- titanicDF[ splitIndex,]
testDF  <- titanicDF[-splitIndex,]

# create caret trainControl object to control the number of cross-validations performed
objControl <- trainControl(method='cv', number=3, returnResamp='none')

# run model
objModel <- train(trainDF[,predictorsNames], trainDF[,outcomeName], method='glmnet',  metric = "RMSE")

# get predictions on your testing data
predictions <- predict(object=objModel, testDF[,predictorsNames])

# library(pROC)
# auc <- roc(testDF[,outcomeName], predictions)
# print(auc$auc)

postResample(pred=predictions, obs=testDF[,outcomeName])

# Creating a final dataframe ------------------------------------

impDF <- varImp(objModel, scale=F)
df <- impDF$importance
names(df)[names(df) == 'Overall'] <- 'Importance'

df$Mean <- colMeans(objModel$trainingData[1:ncol(objModel$trainingData)-1])

# calculate the coefficients for the best fit
bestcoef <- coef(objModel$finalModel, objModel$bestTune$lambda)
# before we transform coefficients, calculate the odds ratios for later
orDF <- data.frame(as.matrix(exp(bestcoef)))

# make it into a dataframe
coefDF <- data.frame(as.matrix(bestcoef))

# drop the intercept so coefficients are easily mapped to the dataframe
mask <- rownames(coefDF)[-(rownames(coefDF) == '(Intercept)')]
coefDF <- coefDF[mask, ]
df$Coefficient <- coefDF

# drop the intercept so odds ratios are easily mapped to the dataframe
mask <- rownames(orDF)[-(rownames(orDF) == '(Intercept)')]

# add odds ratio to the dataframe as a new col
df$OddsRatio <- orDF[mask, ]

# # temporarily add a magnitude for sorting
# df$magnitude <- (df$OddsRatio - 1)^2
# # sort by importance highest to lowest
# df <- df[with(df, order(-Importance, -magnitude, -Mean)), ]
# # drop the magnitude column in the final
# df <- df[ c('Mean', 'Importance', 'Coefficient', 'OddsRatio')]
df <- round(df, 5)
print(head(df))

It seems like there's something fundamentally wrong with this approach, but I can't see it. I would omit the odds ratio entirely, but that's not an option. I would need to replace it with something more robust, or just get confirmation that this is valid.

Please see https://stats.stackexchange.com/search?q=odds+ratio+interpretation. Since this search identifies hundreds of posts on interpreting odds ratios, you should have little trouble finding one that suits you. — whuber, Jun 08 '17 at 17:03

r - odds ratio and glmnet

0 Answers0