Most Popular
1500 questions
56
votes
5 answers
Training a decision tree against unbalanced data
I'm new to data mining and I'm trying to train a decision tree against a data set which is highly unbalanced. However, I'm having problems with poor predictive accuracy.
The data consists of students studying courses, and the class variable is the…

chrisb
- 715
- 1
- 7
- 8
56
votes
4 answers
What is the proper usage of scale_pos_weight in xgboost for imbalanced datasets?
I have a very imbalanced dataset. I'm trying to follow the tuning advice and use scale_pos_weight but not sure how should I tune it.
I can see that RegLossObj.GetGradient does:
if (info.labels[i] == 1.0f) w *= param_.scale_pos_weight
so a gradient…

ihadanny
- 2,596
- 3
- 19
- 31
56
votes
9 answers
How do R and Python complement each other in data science?
In many tutorials or manuals the narrative seems to imply that R and python coexist as complementary components of the analysis process. To my untrained eye, however, it seems that both languages sort of do the same thing.
So my question is if there…

BioHazZzZard
- 319
- 1
- 4
- 5
56
votes
2 answers
What is a difference between random effects-, fixed effects- and marginal model?
I am trying to expand my knowledge of statistics. I come from a physical sciences background with a "recipe based" approach to statistical testing, where we say is it continuous, is it normally distributed -- OLS regression.
In my reading I have…

N26
- 1,705
- 3
- 18
- 22
56
votes
2 answers
Neural Network: For Binary Classification use 1 or 2 output neurons?
Assume I want to do binary classification (something belongs to class A or class B). There are some possibilities to do this in the output layer of a neural network:
Use 1 output node. Output 0 (<0.5) is considered class A and 1 (>=0.5) is…

robert
- 881
- 1
- 9
- 12
56
votes
7 answers
Effect of switching response and explanatory variable in simple linear regression
Let's say that there exists some "true" relationship between $y$ and $x$ such that $y = ax + b + \epsilon$, where $a$ and $b$ are constants and $\epsilon$ is i.i.d normal noise. When I randomly generate data from that R code: x <- 1:100; y <- ax + b…

Greg Aponte
- 663
- 1
- 6
- 6
56
votes
3 answers
Logistic Regression: Scikit Learn vs Statsmodels
I am trying to understand why the output from logistic regression of these
two libraries gives different results.
I am using the dataset from UCLA idre tutorial, predicting admit based
on gre, gpa and rank. rank is treated as categorical variable,…

hurrikale
- 853
- 1
- 8
- 7
56
votes
2 answers
Cross-Entropy or Log Likelihood in Output layer
I read this page:
http://neuralnetworksanddeeplearning.com/chap3.html
and it said that sigmoid output layer with cross-entropy is quite similiar with softmax output layer with log-likelihood.
what happen if I use sigmoid with log-likelihood or…

malioboro
- 851
- 1
- 11
- 19
56
votes
16 answers
Recommended books on experiment design?
What are the panel's recommendations for books on design of experiments?
Ideally, books should be still in print or available electronically, although that may not always be feasible. If you feel moved to add a few words on what's so good about the…

walkytalky
- 1,857
- 2
- 22
- 24
56
votes
2 answers
A/B tests: z-test vs t-test vs chi square vs fisher exact test
I'm trying to understand the reasoning by choosing a specific test approach when dealing with a simple A/B test - (i.e. two variations/groups with a binary respone (converted or not). As an example I will be using the data below
Version Visits …

L Xandor
- 1,119
- 2
- 9
- 13
56
votes
13 answers
Software for drawing bayesian networks (graphical models)
I am searching for [free] software that can produce nice looking graphical models, e.g.
Any suggestions would be appreciated.

C. Reed
- 537
- 1
- 8
- 14
56
votes
4 answers
Can a random forest be used for feature selection in multiple linear regression?
Since RF can handle non-linearity but can't provide coefficients, would it be wise to use random forest to gather the most important features and then plug those features into a multiple linear regression model in order to obtain their coefficients?…

Hidden Markov Model
- 938
- 1
- 8
- 16
56
votes
6 answers
Practical hyperparameter optimization: Random vs. grid search
I'm currently going through Bengio's and Bergstra's Random Search for Hyper-Parameter Optimization [1] where the authors claim random search is more efficient than grid search in achieving approximately equal performance.
My question is: Do people…

Bar
- 2,492
- 3
- 19
- 31
56
votes
9 answers
Are we exaggerating importance of model assumption and evaluation in an era when analyses are often carried out by laymen
Bottom line, the more I learn about statistics, the less I trust published papers in my field; I simply believe that researchers are not doing their statistics well enough.
I'm a layman, so to speak. I'm trained in biology but I have no formal…

Adam Robinsson
- 2,083
- 3
- 19
- 39
56
votes
4 answers
Logistic Regression - Error Term and its Distribution
On whether an error term exists in logistic regression (and its assumed distribution), I have read in various places that:
no error term exists
the error term has a binomial distribution (in accordance with the distribution of the response…

user61124
- 563
- 1
- 5
- 4