Python library for data manipulation, implementing R-style data frames.
Questions tagged [pandas]
151 questions
53
votes
2 answers
Pandas / Statsmodel / Scikit-learn
Are Pandas, Statsmodels and Scikit-learn different implementations of machine learning/statistical operations, or are these complementary to one another?
Which of these has the most comprehensive functionality?
Which one is actively developed…

Nik
- 1,279
- 2
- 13
- 19
21
votes
1 answer
What does (pandas) autocorrelation graph show?
I am a beginner and I am trying to understand what an autocorrelation graph shows.
I have read several explanations from different sources such as this page or the related Wikipedia page among others that I am not citing here.
I have this very…

Koray Tugay
- 313
- 1
- 2
- 8
13
votes
1 answer
Using iloc to set values
This line returns the first 4 rows in the dataframe combined for feature_a
combined.iloc[0:4]["feature_a"]
As expected, this next line returns the 2nd, 4th, and 16th rows in the dataframe for column…

Doug7
- 313
- 1
- 3
- 7
12
votes
5 answers
How do I interpret this Scatter Plot?
I have a scatter plot which has sample size which is equal to the number of people on x axis and median salary on y axis, I am trying to find out if the sample size has any effect on the median salary.
This is the plot:
How do I interpret this plot…

Sameed
- 415
- 1
- 4
- 10
10
votes
1 answer
How to calculate mutual information?
I am a bit confused. Can someone explain to me how to calculate mutual information between two terms based on a term-document matrix with binary term occurrence as weights?
$$
\begin{matrix}
& 'Why' & 'How' & 'When' & 'Where' \\
Document1…

user18075
- 617
- 1
- 6
- 14
9
votes
3 answers
What is no ' information rate ' algorithm?
I plan to implement ' no information rate ' as part of summary statistics. This statistic is implemented in r (Optimise SVM to avoid false-negative in binary classification) but not in Python (at least I cannot find a reference) .
Is there a…

blue-sky
- 609
- 1
- 7
- 17
8
votes
1 answer
Incremental learning with decision trees (scikit-learn)
I'm trying to train a regression tree with some very large data I have: approx 3Tb.
I'm using scikit-learn and of course there is no way I can load that amount of data on memory.
Doing some online research I found that some scikit-learn algorithms…

Ambesh
- 303
- 4
- 10
7
votes
1 answer
Multi-Seasonal Time Series function in Python
I am wondering if Python has any implementations of Multi-Seasonal Time Series like msts method in forecast library under R.
So far, I have found only following packages for Time Series in Python:
FBprophet
TSFresh (feature engineering for…

SpanishBoy
- 233
- 2
- 8
7
votes
2 answers
matched pairs in Python (Propensity score matching)
Is there a function in python to create a matched pairs dataset?
e.g.
df_matched = construct_matched_pairs(df_users_who_did_something,
df_all_other_users,
…

volodymyr
- 329
- 1
- 3
- 8
7
votes
1 answer
Inferring likely dates based on other related dates in incomplete data set
I'm taking my first steps in data science and machine learning. I'm experimenting with a project where I have no idea even what approaches I might start with, so I'd appreciate any leads:
I have a dataset (for explanation's sake) of student…

somewhatoff
- 71
- 4
5
votes
1 answer
Method for a hypothesis testing non normal distribution number of retweets
I am a stats-beginner, Using pandas I am analysing a small dataset. There are 60 data-points, 22 of which are from Group A and 38 are from Group B. The dataset is made up of the number of retweets gained by a single tweet. The Null Hypothesis is…

five_inshallah's
- 53
- 1
- 4
4
votes
2 answers
Getting very large coefficients from linear regression
I'm currently looking at rates for a study that vary between 0 and 100 with most of the rates falling between 0 and 1. I am running a linear regression on 70 dummy variables (coded 0-1) and nearly 100,000 lines of observations. When I run the…

Michael Paolucci
- 41
- 1
- 3
4
votes
1 answer
Engineering features using pandas
I have a dataframe with around 37,000 rows and 54 columns. Out of these 54 columns two columns namely 'user_id' and 'mail_id' are provided in avery creepy format as shown below:
user_id mail_id …

enterML
- 284
- 2
- 12
4
votes
1 answer
GridSearchCV and KFold
I noticed that in some cases, a GridSearchCV is applied on the output of KFold. For example, like in the code below. Why is it needed? I thought that something equivalent to KFold is already applied as part of GridSearchCV, by specifying the…

Alex
- 51
- 1
- 1
- 3
3
votes
1 answer
What are good methods to deal with outliers when calculating mean of data?
I have a dataframe with yearly energy uses of buildings over 5 years. In order to have a representative yearly energy use for data modelling, I'll have to take the mean of those data. As the data can contain outliers, I want to deal with outliers…

Matthi9000
- 135
- 4