Questions tagged [scikit-learn]

A machine-learning library for Python. Use this tag for any on-topic question that (a) involves scikit-learn either as a critical part of the question or expected answer, & (b) is not just about how to use scikit-learn.

A machine learning framework for Python.

scikit-learn is a machine-learning library for Python that provides simple and efficient tools for data analysis and data mining. It is accessible to everybody and reusable in various contexts. It is built on NumPy, SciPy, and matplotlib. The project is open source and commercially usable (BSD license).

1653 questions
75
votes
1 answer

How to split the dataset for cross validation, learning curve, and final evaluation?

What is an appropriate strategy for splitting the dataset? I ask for feedback on the following approach (not on the individual parameters like test_size or n_iter, but if I used X, y, X_train, y_train, X_test, and y_test appropriately and if the…
tobip
  • 1,450
  • 4
  • 14
  • 11
73
votes
3 answers

One-hot vs dummy encoding in Scikit-learn

There are two different ways to encoding categorical variables. Say, one categorical variable has n values. One-hot encoding converts it into n variables, while dummy encoding converts it into n-1 variables. If we have k categorical variables, each…
60
votes
5 answers

How does one interpret SVM feature weights?

I am trying to interpret the variable weights given by fitting a linear SVM. (I'm using scikit-learn): from sklearn import svm svm = svm.SVC(kernel='linear') svm.fit(features, labels) svm.coef_ I cannot find anything in the documentation that…
Austin Richardson
  • 928
  • 1
  • 8
  • 10
56
votes
3 answers

Logistic Regression: Scikit Learn vs Statsmodels

I am trying to understand why the output from logistic regression of these two libraries gives different results. I am using the dataset from UCLA idre tutorial, predicting admit based on gre, gpa and rank. rank is treated as categorical variable,…
hurrikale
  • 853
  • 1
  • 8
  • 7
53
votes
2 answers

Pandas / Statsmodel / Scikit-learn

Are Pandas, Statsmodels and Scikit-learn different implementations of machine learning/statistical operations, or are these complementary to one another? Which of these has the most comprehensive functionality? Which one is actively developed…
Nik
  • 1,279
  • 2
  • 13
  • 19
44
votes
1 answer

what does the numbers in the classification report of sklearn mean?

I have below an example I pulled from sklearn 's sklearn.metrics.classification_report documentation. What I don't understand is why there are f1-score, precision and recall values for each class where I believe class is the predictor label? I…
jxn
  • 749
  • 2
  • 7
  • 15
43
votes
2 answers

Mean absolute percentage error (MAPE) in Scikit-learn

How can we calculate the Mean absolute percentage error (MAPE) of our predictions using Python and scikit-learn? From the docs, we have only these 4 metric functions for Regressions: metrics.explained_variance_score(y_true,…
Nyxynyx
  • 885
  • 3
  • 9
  • 15
43
votes
2 answers

Area under Precision-Recall Curve (AUC of PR-curve) and Average Precision (AP)

Is Average Precision (AP) the Area under Precision-Recall Curve (AUC of PR-curve) ? EDIT: here is some comment about difference in PR AUC and AP. The AUC is obtained by trapezoidal interpolation of the precision. An alternative and usually…
mrgloom
  • 1,687
  • 4
  • 25
  • 33
40
votes
4 answers

Polynomial regression using scikit-learn

I am trying to use scikit-learn for polynomial regression. From what I read polynomial regression is a special case of linear regression. I was hopping that maybe one of scikit's generalized linear models can be parameterised to fit higher order…
35
votes
4 answers

Ensemble of different kinds of regressors using scikit-learn (or any other python framework)

I am trying to solve the regression task. I found out that 3 models are working nicely for different subsets of data: LassoLARS, SVR and Gradient Tree Boosting. I noticed that when I make predictions using all these 3 models and then make a table of…
Maksim Khaitovich
  • 658
  • 1
  • 7
  • 12
32
votes
2 answers

PCA in numpy and sklearn produces different results

Am i misunderstanding something. This is my code using sklearn import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn import decomposition from sklearn import datasets from sklearn.preprocessing…
aceminer
  • 813
  • 1
  • 9
  • 20
28
votes
3 answers

XGBoost vs Python Sklearn gradient boosted trees

I am trying to understand how XGBoost works. I already understand how gradient boosted trees work on Python sklearn. What is not clear to me is if XGBoost works the same way, but faster, or if there are fundamental differences between it and the…
Fairly Nerdy
  • 877
  • 1
  • 8
  • 16
27
votes
4 answers

Multilabel classification metrics on scikit

I am trying to build a multi-label classifier so as to assign topics to existing documents using scikit I am processing my documents passing them through the TfidfVectorizer the labels through the MultiLabelBinarizer and created a…
mobius
  • 271
  • 1
  • 3
  • 7
27
votes
3 answers

How to systematically remove collinear variables (pandas columns) in Python?

Thus far, I have removed collinear variables as part of the data preparation process by looking at correlation tables and eliminating variables that are above a certain threshold. Is there a more accepted way of doing this? Additionally, I am aware…
orange1
  • 557
  • 1
  • 4
  • 9
26
votes
2 answers

Why is Python's scikit-learn LDA not working correctly and how does it compute LDA via SVD?

I was using the Linear Discriminant Analysis (LDA) from the scikit-learn machine learning library (Python) for dimensionality reduction and was a little bit curious about the results. I am wondering now what the LDA in scikit-learn is doing so that…
1
2 3
99 100