Is it correct to decrease the alpha parameter in sklearn SVM pipeline to improve performance?

Question

I'm trying to find the best parameters for a Fine-grained Sentiment Analysis of a dataset of movie reviews.

This is the current code:

class SVMSentiment(Base):
    """Predict sentiment scores using a linear Support Vector Machine (SVM).
    Uses a sklearn pipeline.
    """
    def __init__(self, model_file: str=None) -> None:
        super().__init__()
        # pip install sklearn
        from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
        from sklearn.linear_model import SGDClassifier
        from sklearn.svm import LinearSVC
        from sklearn.pipeline import Pipeline
        self.pipeline = Pipeline(
            [
 

                ('clf', SGDClassifier(
                    loss='hinge',
                    penalty='l2',
                    alpha=1e-4,
                    random_state=42,
                    max_iter=100,
                    learning_rate='optimal',
                    tol=None,

                )),
            ]
        )

    def predict(self, train_file: str, test_file: str, lower_case: bool) -> pd.DataFrame:
        "Train model using sklearn pipeline"
        train_df = self.read_data(train_file, lower_case)
        learner = self.pipeline.fit(train_df['text'], train_df['truth'])
        # Fit the learner to the test data
        test_df = self.read_data(test_file, lower_case)
        test_df['pred'] = learner.predict(test_df['text'])
        return test_df

If alpha = 1e-4, accuracy improves of about 0.5 percentage and I was wondering if that was correct and if so why, as I have seen online the default value is 1e-3.

score 1 · Accepted Answer · answered Sep 03 '20 at 15:55

In SGDClassifier, alpha is a parameter. It is the regularization parameter and even though there is a default value (1e-3 as you said) it is in fact not optimal at all in practice we should always change this default value and try other values of alpha using cross-validation to get a better score. Be careful to use cross-validation or similar technique to validate that the change of parameter is indeed better than no changing.

In short, this regularization parameter tries to find a balance between overfitting and underfitting and increasing this parameter will result in a classifier that is smoother, depending on the problem at hand, the task may be very hard to approximate with a smooth function or at the opposite you may need to smooth things up to avoid overfitting. For more information on the choice of alpha in particular, see What exactly is overfitting?. In general, the hyperparameters given as default in algorithms are not bad in a number of cases but you should always be careful that by changing from the default parameters, you could gain a lot of performance.

Thank you! So in this case, I read that SVM since it uses L2 regularization it has low risks of overfitting; since it is just a small increase (but still significant) should I still go with it without being worried? — Anna, Sep 03 '20 at 16:30
To get any definite answer you need to divide you dataset in a training dataset and validation dataset or use cross-validation, having L2 regulaization does not necessarily avoid overfitting, you can still overfit. I don"t know how you quantified the "accuracy improves 0.5 percent", if you did cross-validation or divided dataset then you don't have to worry, it should hold. If not, I would advise cross-validation as it is the easiest and most practical with scikit-learn with https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html — TMat, Sep 03 '20 at 16:57

Is it correct to decrease the alpha parameter in sklearn SVM pipeline to improve performance?

1 Answers1