How to explain low performance of naive Bayes on a dataset

Question

I'm working on a project from Udacity's ml nd, finding donors,

I'm making the initial test using three algorithms:

LogisticRegression -> RED
GaussianNB -> Green
AdaBoostClassifier -> Blue

This is the result I'm getting:

I wonder why nb has such a poor performance. This is some informations regarding the dataset:

1) Initial numerical features are not highly correlated:

2) There are categorical features on that were encoded increasing the number of features up to 100 and making the dataset more sparse.

Edit:

I also tried using decision Trees,these has a poor performace too, best case is around 0.45 for depth 8 over that it starts presenting a high variance, which can explain why Adaboost works well on this model since it's main advantage is it's capacity for improving the variance issue.

I still have the doubt of why NB and DT have such a poor performance on this dataset compared with the logistic regression which is a simple model too.

EDIT:

This is the code I have used

from sklearn.linear_model import  LogisticRegression 
# Import the three supervised learning models from sklearn
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
# Initialize the three models
clf_A = LogisticRegression(penalty='l2') 
clf_B = GaussianNB()
clf_C = AdaBoostClassifier()

# Calculate the number of samples for 1%, 10%, and 100% of the training data
def get_sample_size(percentage):
    return int((float(percentage)/100)*X_train.shape[0])

samples_1 = get_sample_size(1.0)
samples_10 = get_sample_size(10.0)
samples_100 = get_sample_size(100.0)

# Collect results on the learners
results = {}
for clf in [clf_A, clf_B, clf_C]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results[clf_name][i] = \
        train_predict(clf, samples, X_train, y_train, X_test, y_test)

# Run metrics visualization for the three supervised learning models chosen
vs.evaluate(results, accuracy, fscore)


def train_predict(learner, sample_size, X_train, y_train, X_test, y_test): 
    '''
    inputs:
       - learner: the learning algorithm to be trained and predicted on
       - sample_size: the size of samples (number) to be drawn from training set
       - X_train: features training set
       - y_train: income training set
       - X_test: features testing set
       - y_test: income testing set
    '''

    results = {}

    # TODO: Fit the learner to the training data using slicing with 'sample_size' using .fit(training_features[:], training_labels[:])
    start = time() # Get start time
    learner.fit(X_train[:sample_size], y_train[:sample_size])
    end = time() # Get end time

    # TODO: Calculate the training time
    results['train_time'] = end - start

    # TODO: Get the predictions on the test set(X_test),
    #       then get predictions on the first 300 training samples(X_train) using .predict()
    start = time() # Get start time
    predictions_test = learner.predict(X_test)
    predictions_train = learner.predict(X_train[0:300])
    end = time() # Get end time

    # TODO: Calculate the total prediction time
    results['pred_time'] =  end - start

    # TODO: Compute accuracy on the first 300 training samples which is y_train[:300]
    results['acc_train'] = accuracy_score(y_train[0:300],predictions_train)

    # TODO: Compute accuracy on test set using accuracy_score()
    results['acc_test'] = accuracy_score(y_test,predictions_test)

    # TODO: Compute F-score on the the first 300 training samples using fbeta_score()
    results['f_train'] = fbeta_score(y_test,predictions_test,0.5)

    # TODO: Compute F-score on the test set which is y_test
    results['f_test'] = fbeta_score(y_test,predictions_test,0.5)

    # Success
    print "{} trained on {} samples.".format(learner.__class__.__name__, sample_size)

    # Return the results
    return results

The dataset is available here:

that first set of plots shows NB dominating out of sample prediction...do you mean green? or are you concerned about training time? — Taylor, Aug 16 '17 at 04:18
Did you have a look at the confusion matrix?. One problem I can think of is when you have some overlap between classes (there are one or more weights which are relevant for two classes, and the one with bigger weights tend to overcome the other class). This problem arises from the independence assumption of NB and how weights are estimated. Do you preprocess your data?. Is your data skewed (how many samples per class)? — jpmuc, Aug 16 '17 at 11:28

einar · Answer 1 · 2017-08-28T19:32:07.300

I have two very general ideas about what might be the reason based on the fact that it looks as though you're using Python's scikit-learn library:

You say that you have lots of categorical predictors. Gaussian naive Bayes models these (and everything else) as following a normal distribution. But we know they are zeroes and ones, so that's not really a sensible model, hence you're probably throwing away some useful information.
For some unfathomable reason the logistic regression model in sklearn is always regularized and there is no sane way to turn that off. This is of course hidden to the user unless she looks for it. Since you have about 100 predictors, the regularization may very well help shrink away some noise.

I have used your code example and the provided data to test assertion 2 above, and it turns out that regularization isn't the source of the improved performance of logistic regression. My code is as follows:

import pandas as pd
from sklearn.linear_model import  LogisticRegression 
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score

# load data
df = pd.read_csv('census.csv', header=0, sep=',')


response = len(df.columns) - 1
X = pd.get_dummies(df.iloc[:, :response])
Y = df.iloc[:, response]

classifiers = [LogisticRegression(penalty='l2'), GaussianNB(), 
LogisticRegression(penalty='l2', C=5e10)]
names = ["logistic", "naive bayes", "logistic, no regularization"]

for i in range(len(names)):
scores = cross_val_score(classifiers[i], X, Y, cv=10)
print("Accuracy (%s): %0.2f (+/- %0.2f)" % (names[i], scores.mean(), scores.std() * 2))

As mentioned, the only way to not have regularization in sklearns perplexing logistic regression is to use the regularization strength parameter. Running my code gives:

Accuracy (logistic): 0.85 (+/- 0.01)
Accuracy (naive bayes): 0.80 (+/- 0.01)
Accuracy (logistic, no regularization): 0.85 (+/- 0.01)

Hence my best guess is point 1 above that logistic regression makes better use of the 0/1 indicator variables and that you're throwing away useful information by modeling them as normal. You might experiment with combining GaussianNB() with BernoulliNB(), which is a better model for 0/1 variables.

@LuisRamonRamirezRodriguez also: the `train_predict` method isn't defined in your code — einar, Aug 24 '17 at 09:02
@LuisRamonRamirezRodriguez thanks, I have expanded a bit based on your code and data. sorry it took so long — einar, Aug 28 '17 at 19:32

score 0 · Answer 2 · answered Aug 18 '17 at 06:35

In 1 you wrote about low correlation (I think you imply zero correlation means independence of features), but zero correlation has nothing to do with the independence property. So, even if you have zero correlation but features are dependent NB would perform not so well.

a bit more on the topic Why zero correlation does not necessarily imply independence

How to explain low performance of naive Bayes on a dataset

2 Answers2