How to interpret stable and overlapping learning curves?

Question

I have a training data size of about 80k.

I plotted a learning curve to check how much of the training sample is required to train the model. Although, after plotting my learning curve looks like this:

From How to know if a learning curve from SVM model suffers from bias or variance?, I came to know two main points:

If two curves are "close to each other" and both of them but have a low score. The model suffer from an under fitting problem (High Bias)

But both the curves have a high accuracy so, I am guessing it is not under-fitting

If training curve has a much better score but testing curve has a lower score, i.e., there are large gaps between two curves. Then the model suffer from an over fitting problem (High Variance)

It does not seem like a problem of over-fitting either.

1) What is can I infer from this graph? Is it normal to have the curves overlap each other?

2) What should I understand from this particular graph?

Edit: As suggested I have started the iteration from training sample of 0 - len(data).

Although the lines still overlap.

- I failed to mention that the data is highly skewed. 80-20 imbalance. So I am guessing the model just predicts everything to be the majority class and that is the reason the scores are high. I am not sure. Any suggestions?

@steffan: The training vector X, I have uploaded :Train Vector X and the respective target vector y at: Train target y as pickle files.

The code I have used is from the scikitlearn example:

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.01, 1.0, 5)):

    plt.figure(figsize = (13,9))
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

title = "Learning Curves (SVM, RBF kernel, $\gamma=0.001$)"
# SVC is more expensive so we do a lower number of CV iterations:
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = SVC(kernel = 'rbf', C=10000, gamma=0.001, class_weight='balanced')
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)
# X and y are the training vector and the target
plt.show()

The code is from here: SciKitlearn Example

I am not sure what score they are using in that code, I am sorry for my limited understanding here.

what is "score", which metric do you measure ? It is a binary classification problem ? I should have asked that earlier ;). What is the minimum number of samples tested ? The exact overlap at the far left looks fishy. Maybe you also want to add the code (yeah, we do not have access to the data (do we ?), but let's check anyways) — mlwida, Mar 13 '18 at 10:05
Thanks for your reply. I have added the code and uploaded the train and target vector as pickle files in the question. I knew the graphs looked fishy, but I am really unable to figure out why :( — RPT, Mar 13 '18 at 11:13
I had to restart the download of the train vector, now the link does not work anymore ({"success":false,"error":404,"message":"Not Found"}). Is this a one-time link ? — mlwida, Mar 13 '18 at 11:34
Hi, I have changed the link. It is not a one-time link anymore. You will now be able to download it successfully — RPT, Mar 13 '18 at 11:42
@steffen, just making sure - was the download was successful? — RPT, Mar 13 '18 at 13:51
yes, thank you. I had some trouble reading main_train, but [this](https://stackoverflow.com/questions/11305790/pickle-incompatibility-of-numpy-arrays-between-python-2-and-3) solved it. I am using python 3.6.3. — mlwida, Mar 13 '18 at 14:32
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/74449/discussion-between-steffen-and-r-p-t). — mlwida, Mar 13 '18 at 17:15

mlwida · Accepted Answer · 2018-03-13T18:21:34.857

It is no surprise that the learning curve highly depends on the capabilities of the learner and on the structure of the data set and prediction power of its features.

It might be the case that there is only little variance in the combination of feature values (predictors) and labels (response). In this case even a small sample size can allow a capable learner to find all detectable patterns, resulting in a early high score. If not all patters can be detected, no perfect score can be achieved.

Since the training score is slightly above the cv-score for 10000 samples, I'd expect that there is an even greater difference for less than 10000 samples. So I suggest to test that. If the the overlap and score remains (even for a small number of examples, let's say << 1000), you should double check for an error (e.g. accidental row duplication, error in cv calculation (if you have done it yourself)).

Edit regarding class imbalance

The class distribution is

y
0    77623
1     5436

so guessing the majority class leads to an score of ~ 0.935, which is exactly what we see in the learning curve. The scoring function used in the learning curve is one provided by the estimator, which is Accuracy in case of SVC (see SVC documentation). The scoring function can be changed in learning curve (search for "scoring" in the learning_curve documentation, also needed to pass the parameter via plot_learning_curve).

Rerunning the code, but not going up to full train size (due to the time complexity of SVC)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
import pickle

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.01, 1.0, 5)):
    # calc curve first to avoid premature opening of figure
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    # now create the plot
    plt.figure(figsize = (13,9))
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")

    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt



y = np.array(pickle.load(open('path-to-test', 'rb')))

# https://stackoverflow.com/questions/11305790/pickle-incompatibility-of-numpy-arrays-between-python-2-and-3
X = None
with open('path-to-train', 'rb') as f:
    u = pickle._Unpickler(f)
    u.encoding = 'latin1'
    X = np.array(u.load())

gamma = 0.001
C = 10000
title = "Learning Curves (SVM, RBF kernel, $\gamma=" + str(gamma) + ", C=" + str(C) + "$)"
# SVC is more expensive so we do a lower number of CV iterations:
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
estimator = SVC(kernel = 'rbf', C=C, gamma=gamma, class_weight='balanced')
plot_learning_curve(estimator, title, X, y, cv=cv, n_jobs=6, train_sizes=np.linspace(0.015,0.15,5))
plt.show()

leads to this graph

so there might be a code issue, maybe at initial loading / preprocessing of X and y.

I did fail to mention one important point in the question. My data is highly skewed. 80%-20% balance. I am guessing this is the reason why the learning curve is not making much sense because even it is predicts everything to be of the majority class - the scores are quite high. Am I right? — RPT, Mar 12 '18 at 19:52

score 2 · Answer 2 · answered Mar 12 '18 at 14:15

2

I would agree that at the first glance, there is no over-fitting and under-fitting.

One possibility is the learning task is easy and the model did an excellent job.

Think about the linear regression case. Suppose both training and testing data is generated from

$$y=\beta^Tx+\epsilon$$

So, both set will have the same distribution, and we cannot achieve 100% "accuracy" (RMSE in regression case), because of the irreducible error.

answered Mar 12 '18 at 14:15

Haitao Du

32,885
17
118
213

I did fail to mention one important point in the question. My data is highly skewed. 80%-20% balance. I am guessing this is the reason why the learning curve is not making much sense because even it is predicts everything to be of the majority class - the scores are quite high. Am I right? – RPT Mar 12 '18 at 19:51

How to interpret stable and overlapping learning curves?

2 Answers2