Highest Voted 'pandas' Questions - Statistical Analysis Stack Exchange

53

votes

2 answers

Pandas / Statsmodel / Scikit-learn

Are Pandas, Statsmodels and Scikit-learn different implementations of machine learning/statistical operations, or are these complementary to one another? Which of these has the most comprehensive functionality? Which one is actively developed…

asked Jan 17 '13 at 01:02

Nik

1,279
2
13
19

21

votes

1 answer

What does (pandas) autocorrelation graph show?

I am a beginner and I am trying to understand what an autocorrelation graph shows. I have read several explanations from different sources such as this page or the related Wikipedia page among others that I am not citing here. I have this very…

python autocorrelation pandas

asked Jul 16 '18 at 00:21

Koray Tugay

313
1
2
8

13

votes

1 answer

Using iloc to set values

This line returns the first 4 rows in the dataframe combined for feature_a combined.iloc[0:4]["feature_a"] As expected, this next line returns the 2nd, 4th, and 16th rows in the dataframe for column…

python pandas

asked Jun 04 '17 at 22:48

Doug7

313
1
3
7

12

votes

5 answers

How do I interpret this Scatter Plot?

I have a scatter plot which has sample size which is equal to the number of people on x axis and median salary on y axis, I am trying to find out if the sample size has any effect on the median salary. This is the plot: How do I interpret this plot…

data-visualization median scatterplot pandas

asked Sep 05 '17 at 04:59

Sameed

415
1
4
10

10

votes

1 answer

How to calculate mutual information?

I am a bit confused. Can someone explain to me how to calculate mutual information between two terms based on a term-document matrix with binary term occurrence as weights? $$ \begin{matrix} & 'Why' & 'How' & 'When' & 'Where' \\ Document1…

python information-theory mutual-information numpy pandas

asked Dec 29 '12 at 14:29

user18075

617
1
6
14

9

votes

3 answers

What is no ' information rate ' algorithm?

I plan to implement ' no information rate ' as part of summary statistics. This statistic is implemented in r (Optimise SVM to avoid false-negative in binary classification) but not in Python (at least I cannot find a reference) . Is there a…

machine-learning python model confusion-matrix pandas

asked May 19 '17 at 17:14

blue-sky

609
1
7
17

8

votes

1 answer

Incremental learning with decision trees (scikit-learn)

I'm trying to train a regression tree with some very large data I have: approx 3Tb. I'm using scikit-learn and of course there is no way I can load that amount of data on memory. Doing some online research I found that some scikit-learn algorithms…

python cart scikit-learn pandas ram

asked Sep 26 '17 at 12:25

Ambesh

303
4
10

7

votes

1 answer

Multi-Seasonal Time Series function in Python

I am wondering if Python has any implementations of Multi-Seasonal Time Series like msts method in forecast library under R. So far, I have found only following packages for Time Series in Python: FBprophet TSFresh (feature engineering for…

r time-series python arima pandas

asked Apr 12 '17 at 17:29

SpanishBoy

233
2
8

7

votes

2 answers

matched pairs in Python (Propensity score matching)

Is there a function in python to create a matched pairs dataset? e.g. df_matched = construct_matched_pairs(df_users_who_did_something, df_all_other_users, …

python matching propensity-scores pandas

asked Apr 12 '16 at 10:26

volodymyr

329
1
3
8

7

votes

1 answer

Inferring likely dates based on other related dates in incomplete data set

I'm taking my first steps in data science and machine learning. I'm experimenting with a project where I have no idea even what approaches I might start with, so I'd appreciate any leads: I have a dataset (for explanation's sake) of student…

machine-learning time-series survival interval-censoring pandas

asked Dec 29 '15 at 16:15

somewhatoff

71
4

5

votes

1 answer

Method for a hypothesis testing non normal distribution number of retweets

I am a stats-beginner, Using pandas I am analysing a small dataset. There are 60 data-points, 22 of which are from Group A and 38 are from Group B. The dataset is made up of the number of retweets gained by a single tweet. The Null Hypothesis is…

hypothesis-testing python pandas

asked Aug 30 '15 at 15:58

five_inshallah's

53
1
4

4

votes

2 answers

Getting very large coefficients from linear regression

I'm currently looking at rates for a study that vary between 0 and 100 with most of the rates falling between 0 and 1. I am running a linear regression on 70 dummy variables (coded 0-1) and nearly 100,000 lines of observations. When I run the…

regression python pandas

asked Aug 09 '19 at 21:41

Michael Paolucci

41
1
3

4

votes

1 answer

Engineering features using pandas

I have a dataframe with around 37,000 rows and 54 columns. Out of these 54 columns two columns namely 'user_id' and 'mail_id' are provided in avery creepy format as shown below: user_id mail_id …

categorical-data feature-engineering pandas

asked Aug 31 '16 at 18:18

enterML

284
2
12

4

votes

1 answer

GridSearchCV and KFold

I noticed that in some cases, a GridSearchCV is applied on the output of KFold. For example, like in the code below. Why is it needed? I thought that something equivalent to KFold is already applied as part of GridSearchCV, by specifying the…

machine-learning cross-validation python pandas

asked Apr 18 '16 at 03:07

Alex

51
1
1
3

3

votes

1 answer

What are good methods to deal with outliers when calculating mean of data?

I have a dataframe with yearly energy uses of buildings over 5 years. In order to have a representative yearly energy use for data modelling, I'll have to take the mean of those data. As the data can contain outliers, I want to deal with outliers…

python mean outliers pandas

asked May 30 '20 at 20:12

Matthi9000

135
4

Questions tagged [pandas]