Most Popular

1500 questions
56
votes
5 answers

Training a decision tree against unbalanced data

I'm new to data mining and I'm trying to train a decision tree against a data set which is highly unbalanced. However, I'm having problems with poor predictive accuracy. The data consists of students studying courses, and the class variable is the…
chrisb
  • 715
  • 1
  • 7
  • 8
56
votes
4 answers

What is the proper usage of scale_pos_weight in xgboost for imbalanced datasets?

I have a very imbalanced dataset. I'm trying to follow the tuning advice and use scale_pos_weight but not sure how should I tune it. I can see that RegLossObj.GetGradient does: if (info.labels[i] == 1.0f) w *= param_.scale_pos_weight so a gradient…
ihadanny
  • 2,596
  • 3
  • 19
  • 31
56
votes
9 answers

How do R and Python complement each other in data science?

In many tutorials or manuals the narrative seems to imply that R and python coexist as complementary components of the analysis process. To my untrained eye, however, it seems that both languages sort of do the same thing. So my question is if there…
BioHazZzZard
  • 319
  • 1
  • 4
  • 5
56
votes
2 answers

What is a difference between random effects-, fixed effects- and marginal model?

I am trying to expand my knowledge of statistics. I come from a physical sciences background with a "recipe based" approach to statistical testing, where we say is it continuous, is it normally distributed -- OLS regression. In my reading I have…
N26
  • 1,705
  • 3
  • 18
  • 22
56
votes
2 answers

Neural Network: For Binary Classification use 1 or 2 output neurons?

Assume I want to do binary classification (something belongs to class A or class B). There are some possibilities to do this in the output layer of a neural network: Use 1 output node. Output 0 (<0.5) is considered class A and 1 (>=0.5) is…
robert
  • 881
  • 1
  • 9
  • 12
56
votes
7 answers

Effect of switching response and explanatory variable in simple linear regression

Let's say that there exists some "true" relationship between $y$ and $x$ such that $y = ax + b + \epsilon$, where $a$ and $b$ are constants and $\epsilon$ is i.i.d normal noise. When I randomly generate data from that R code: x <- 1:100; y <- ax + b…
Greg Aponte
  • 663
  • 1
  • 6
  • 6
56
votes
3 answers

Logistic Regression: Scikit Learn vs Statsmodels

I am trying to understand why the output from logistic regression of these two libraries gives different results. I am using the dataset from UCLA idre tutorial, predicting admit based on gre, gpa and rank. rank is treated as categorical variable,…
hurrikale
  • 853
  • 1
  • 8
  • 7
56
votes
2 answers

Cross-Entropy or Log Likelihood in Output layer

I read this page: http://neuralnetworksanddeeplearning.com/chap3.html and it said that sigmoid output layer with cross-entropy is quite similiar with softmax output layer with log-likelihood. what happen if I use sigmoid with log-likelihood or…
malioboro
  • 851
  • 1
  • 11
  • 19
56
votes
16 answers

Recommended books on experiment design?

What are the panel's recommendations for books on design of experiments? Ideally, books should be still in print or available electronically, although that may not always be feasible. If you feel moved to add a few words on what's so good about the…
walkytalky
  • 1,857
  • 2
  • 22
  • 24
56
votes
2 answers

A/B tests: z-test vs t-test vs chi square vs fisher exact test

I'm trying to understand the reasoning by choosing a specific test approach when dealing with a simple A/B test - (i.e. two variations/groups with a binary respone (converted or not). As an example I will be using the data below Version Visits …
56
votes
13 answers

Software for drawing bayesian networks (graphical models)

I am searching for [free] software that can produce nice looking graphical models, e.g. Any suggestions would be appreciated.
C. Reed
  • 537
  • 1
  • 8
  • 14
56
votes
4 answers

Can a random forest be used for feature selection in multiple linear regression?

Since RF can handle non-linearity but can't provide coefficients, would it be wise to use random forest to gather the most important features and then plug those features into a multiple linear regression model in order to obtain their coefficients?…
56
votes
6 answers

Practical hyperparameter optimization: Random vs. grid search

I'm currently going through Bengio's and Bergstra's Random Search for Hyper-Parameter Optimization [1] where the authors claim random search is more efficient than grid search in achieving approximately equal performance. My question is: Do people…
Bar
  • 2,492
  • 3
  • 19
  • 31
56
votes
9 answers

Are we exaggerating importance of model assumption and evaluation in an era when analyses are often carried out by laymen

Bottom line, the more I learn about statistics, the less I trust published papers in my field; I simply believe that researchers are not doing their statistics well enough. I'm a layman, so to speak. I'm trained in biology but I have no formal…
Adam Robinsson
  • 2,083
  • 3
  • 19
  • 39
56
votes
4 answers

Logistic Regression - Error Term and its Distribution

On whether an error term exists in logistic regression (and its assumed distribution), I have read in various places that: no error term exists the error term has a binomial distribution (in accordance with the distribution of the response…