Stepwise not returning expected results in R

Question

Why would step() output different outcomes when a better fit can be produced? I have two datasets model that should have the same relationships with a set of variables. (in my understanding). IE some of the independent variables used in each are as such: (data4a <- 1- data4b), and the dependent variable for CaseA has the same relationship to the dependent variable for CaseB.

stepA: outcome is a model with a 0.73 adjusted R^2.
stepB: outcome is a model with a 0.05 R^2.

If I plug in the same formula that comes out of StepA, I get a .73 R^2, so the relationships is still the same. I understand why they get the same R^2 under the same equation, BUT I don't understand why they aren't producing the same outcomes from step(). Example of generalized code shown below. I'm not sure if this is a statistical issue or a programming issue.

#These data are the same for CaseA and CaseB:
df$data1 <- log(explanatory variable)
df$data2 <- explanatory variable
df$data3 <- explanatory variable #(between 0 and 1)

#These data are related to the Case, IE CaseA or CaseB.
df$data4a <- measured value #(between 0 and 1)
df$data4b <- 1 - df$data4a  #(between 0 and 1)
df$data5a <- measured value #(between 0 and 1)
df$data5b <- 1 - df$data5a  #(between 0 and 1)


df$perc_a <- measured value #(between 0 and 1)
dfb$perc_b <- 1 - df$perc_a  #(between 0 and 1)

ModelA <- lm(perc_a~df$data4a+df$data5a+df$data1+df$data2+df$data3,data=df)
ModelB <- lm(perc_b~df$data4b+df$data5b+df$data1+df$data2+df$data3,data=df)

stepA <- step(ModelA)
stepB <- step(ModelB)

stepA$call$formula
RETURNS: 
perc_a~df$data4a+df$data1+df$data3

stepB$call$formula
RETURNS:
perc_b~df$data4b+df$data5b`

Actually, stepwise selection does not give you any guarantees to pick the best model and it often leads to bad models, see https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection/20856#20856 — Tim, Nov 22 '17 at 18:50
Stepwise is a bad method and, if your dependent variable is bounded, you shouldn't be using ordinary least squares. — Peter Flom, Nov 22 '17 at 18:52
@PeterFlom --Thanks for taking the time to respond -- why is OLS an inappropriate method for dependent variables that are bounded? It's my understanding that as long as the variables can infinitely vary between 0 and 1, it would still be continuous. Based on this [link](http://www.statisticshowto.com/continuous-variable/), I would think my value is a continuous variable. What do you think would be a more appropriate model to use? — bwp8nt, Nov 22 '17 at 19:52
It's inappropriate because the errors are supposed to be normally distributed and when the DV is bounded, they can't be. Also because the results can be outside the bounds. Beta regression is one good method. — Peter Flom, Nov 23 '17 at 02:26

Stepwise not returning expected results in R

0 Answers0