2

I have 5 samples (each one contains ~380K records, 33 predictive variables and 1 binary Target):

  • one sample is used to train the models
  • the remaining 4 samples are used to validate the models

The following table compares the Gini's of the Logistic Regression against the Gini's of the Multilayer Perceptron (MLP) :

Logistic Regression MLP
Train sample 35.8 34.9
validation sample 1 40.0 34.4
validation sample 2 37.7 32.0
validation sample 3 37.5 31.5
validation sample 4 36.4 34.2

As you can see, the Gini's of the Logistic Regression are consistently higher than the Gini's of the MLP.

Why could that be?

Before running both the Logistic Regression and the MLP I have categorized the categorical variables and also scaled the numeric variables.

The code of the Logistic Regression is really simple and straightfoward:

Y=data['Target']  # this is the target
X=data[col_list]  # this is the list of 33 predictive features


X1=sm.add_constant(X)
   
logit=sm.Logit(Y,X1)
result=logit.fit()
print(result.summary())

The code of the MLP is this one:

def build_model():
    model = Sequential()
    model.add(Dense(5, input_dim=33, activation='relu'))
    model.add(Dense(5, activation='sigmoid'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

   
model = build_model()
model.fit(X, Y, epochs=4, batch_size=30, verbose=1)  # X=predictive features ; Y = target 

I don't understand why the MLP underperforms the Logistic Regression.

Pitouille
  • 1,506
  • 3
  • 5
  • 16
  • How many observations do you have? If you ave less than, say, $33 \cdot 100$, I would be surprised if a neural network did better than logistic regression! See [Minimum Training size for simple neural net](https://stats.stackexchange.com/questions/257292/minimum-training-size-for-simple-neural-net) and [Neural network modeling sample size](https://stats.stackexchange.com/questions/78289/neural-network-modeling-sample-size/78298) – kjetil b halvorsen Nov 10 '21 at 13:31
  • Actually I have 380,000 observations. Thanks for sharing the links: let me take a look. – Giampaolo Levorato Nov 10 '21 at 13:37
  • Is this in-sample or out-of-sample performance? – Dave Nov 10 '21 at 14:19
  • Both the in-sample and out-of-sample data have ~380,000 observations. – Giampaolo Levorato Nov 10 '21 at 14:38

1 Answers1

2

If the response conditioned on your predictive roughly follows a logistic curve then logistic regression will be superior. Despite ML hype, DL/NN do not always outperform simpler models.

Have you examined the output of the logistic model? What do the residuals look like? I believe this implementation automatically includes an L2 penalty which probably helps given the number of predictors.

Not sure what categorized the categorical variable means. If it's something like one-hot encoding that is not recommended for regression, the model implementation should automatically handle it.

Glen
  • 6,320
  • 4
  • 37
  • 59