Questions tagged [pandas]

Python library for data manipulation, implementing R-style data frames.

151 questions
53
votes
2 answers

Pandas / Statsmodel / Scikit-learn

Are Pandas, Statsmodels and Scikit-learn different implementations of machine learning/statistical operations, or are these complementary to one another? Which of these has the most comprehensive functionality? Which one is actively developed…
Nik
  • 1,279
  • 2
  • 13
  • 19
21
votes
1 answer

What does (pandas) autocorrelation graph show?

I am a beginner and I am trying to understand what an autocorrelation graph shows. I have read several explanations from different sources such as this page or the related Wikipedia page among others that I am not citing here. I have this very…
Koray Tugay
  • 313
  • 1
  • 2
  • 8
13
votes
1 answer

Using iloc to set values

This line returns the first 4 rows in the dataframe combined for feature_a combined.iloc[0:4]["feature_a"] As expected, this next line returns the 2nd, 4th, and 16th rows in the dataframe for column…
Doug7
  • 313
  • 1
  • 3
  • 7
12
votes
5 answers

How do I interpret this Scatter Plot?

I have a scatter plot which has sample size which is equal to the number of people on x axis and median salary on y axis, I am trying to find out if the sample size has any effect on the median salary. This is the plot: How do I interpret this plot…
Sameed
  • 415
  • 1
  • 4
  • 10
10
votes
1 answer

How to calculate mutual information?

I am a bit confused. Can someone explain to me how to calculate mutual information between two terms based on a term-document matrix with binary term occurrence as weights? $$ \begin{matrix} & 'Why' & 'How' & 'When' & 'Where' \\ Document1…
user18075
  • 617
  • 1
  • 6
  • 14
9
votes
3 answers

What is no ' information rate ' algorithm?

I plan to implement ' no information rate ' as part of summary statistics. This statistic is implemented in r (Optimise SVM to avoid false-negative in binary classification) but not in Python (at least I cannot find a reference) . Is there a…
blue-sky
  • 609
  • 1
  • 7
  • 17
8
votes
1 answer

Incremental learning with decision trees (scikit-learn)

I'm trying to train a regression tree with some very large data I have: approx 3Tb. I'm using scikit-learn and of course there is no way I can load that amount of data on memory. Doing some online research I found that some scikit-learn algorithms…
Ambesh
  • 303
  • 4
  • 10
7
votes
1 answer

Multi-Seasonal Time Series function in Python

I am wondering if Python has any implementations of Multi-Seasonal Time Series like msts method in forecast library under R. So far, I have found only following packages for Time Series in Python: FBprophet TSFresh (feature engineering for…
SpanishBoy
  • 233
  • 2
  • 8
7
votes
2 answers

matched pairs in Python (Propensity score matching)

Is there a function in python to create a matched pairs dataset? e.g. df_matched = construct_matched_pairs(df_users_who_did_something, df_all_other_users, …
volodymyr
  • 329
  • 1
  • 3
  • 8
7
votes
1 answer

Inferring likely dates based on other related dates in incomplete data set

I'm taking my first steps in data science and machine learning. I'm experimenting with a project where I have no idea even what approaches I might start with, so I'd appreciate any leads: I have a dataset (for explanation's sake) of student…
5
votes
1 answer

Method for a hypothesis testing non normal distribution number of retweets

I am a stats-beginner, Using pandas I am analysing a small dataset. There are 60 data-points, 22 of which are from Group A and 38 are from Group B. The dataset is made up of the number of retweets gained by a single tweet. The Null Hypothesis is…
4
votes
2 answers

Getting very large coefficients from linear regression

I'm currently looking at rates for a study that vary between 0 and 100 with most of the rates falling between 0 and 1. I am running a linear regression on 70 dummy variables (coded 0-1) and nearly 100,000 lines of observations. When I run the…
4
votes
1 answer

Engineering features using pandas

I have a dataframe with around 37,000 rows and 54 columns. Out of these 54 columns two columns namely 'user_id' and 'mail_id' are provided in avery creepy format as shown below: user_id mail_id …
enterML
  • 284
  • 2
  • 12
4
votes
1 answer

GridSearchCV and KFold

I noticed that in some cases, a GridSearchCV is applied on the output of KFold. For example, like in the code below. Why is it needed? I thought that something equivalent to KFold is already applied as part of GridSearchCV, by specifying the…
Alex
  • 51
  • 1
  • 1
  • 3
3
votes
1 answer

What are good methods to deal with outliers when calculating mean of data?

I have a dataframe with yearly energy uses of buildings over 5 years. In order to have a representative yearly energy use for data modelling, I'll have to take the mean of those data. As the data can contain outliers, I want to deal with outliers…
Matthi9000
  • 135
  • 4
1
2 3
10 11