Questions tagged [missing-data]

When the data present lack of information (gaps), i.e., are not complete. Hence, it is important to consider this feature when performing an analysis or test.

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

Tag wiki reference: Wikipedia

1427 questions
62
votes
7 answers

Why doesn't Random Forest handle missing values in predictors?

What are theoretical reasons to not handle missing values? Gradient boosting machines, regression trees handle missing values. Why doesn't Random Forest do that?
Fedorenko Kristina
  • 723
  • 1
  • 6
  • 6
38
votes
3 answers

How does R handle missing values in lm?

I'd like to regress a vector B against each of the columns in a matrix A. This is trivial if there are no missing data, but if matrix A contains missing values, then my regression against A is constrained to include only rows where all values are…
David Quigley
  • 483
  • 1
  • 4
  • 7
35
votes
5 answers

Why do some people use -999 or -9999 to replace missing values?

I have a dataset. There are lots of missing values. For some columns, the missing value was replaced with -999, but other columns, the missing value was marked as 'NA'. Why would we use -999 to replace the missing value?
qqqwww
  • 493
  • 1
  • 4
  • 8
34
votes
3 answers

Propensity score matching after multiple imputation

I refer to this paper: Hayes JR, Groner JI. "Using multiple imputation and propensity scores to test the effect of car seats and seat belt usage on injury severity from trauma registry data." J Pediatr Surg. 2008 May;43(5):924-7. In this study,…
Joe King
  • 3,024
  • 6
  • 32
  • 58
31
votes
2 answers

Why is the Expectation Maximization algorithm guaranteed to converge to a local optimum?

I have read a couple of explanations of EM algorithm (e.g. from Bishop's Pattern Recognition and Machine Learning and from Roger and Gerolami First Course on Machine Learning). The derivation of EM is ok, I understand it. I also understand why the…
michal
  • 1,138
  • 3
  • 11
  • 14
30
votes
3 answers

R caret and NAs

I very much prefer caret for its parameter tuning ability and uniform interface, but I have observed that it always requires complete datasets (i. e. without NAs) even if the applied "naked" model allows NAs. That is very bothersome, regarding that…
Fredrik
  • 671
  • 1
  • 5
  • 8
29
votes
1 answer

How do decision tree learning algorithms deal with missing values (under the hood)

What are the methods that decision tree learning algorithms use to deal with missing values. Do they simply full the slot in using a value called missing? Thanks.
user1172468
  • 1,505
  • 5
  • 21
  • 36
28
votes
5 answers

Imputation of missing values for PCA

I used the prcomp() function to perform a PCA (principal component analysis) in R. However, there's a bug in that function such that the na.action parameter does not work. I asked for help on stackoverflow; two users there offered two different ways…
user969113
  • 611
  • 1
  • 5
  • 8
27
votes
2 answers

Full information maximum likelihood for missing data in R

Context: Hierarchical regression with some missing data. Question: How do I use full information maximum likelihood (FIML) estimation to address missing data in R? Is there a package you would recommend, and what are typical steps? Online…
Sootica
  • 1,178
  • 1
  • 14
  • 24
27
votes
5 answers

A statistical approach to determine if data are missing at random

I have a large set of feature vectors which I will use to attack a binary classification problem (using scikit learn in Python). Before I start to think about imputation, I am interested in trying to determine from the remaining parts of the data if…
graffe
  • 1,799
  • 1
  • 22
  • 34
26
votes
1 answer

Difference between missing data and sparse data in machine learning algorithms

What are main differences between sparse data and missing data? And how does it influences machine learning? More specifically, what effect sparse data and missing data have on classification algorithms and regression (predicting numbers) type of…
tired and bored dev
  • 855
  • 2
  • 9
  • 17
26
votes
5 answers

Machine learning algorithms to handle missing data

I am trying to develop a predictive model using high-dimensional clinical data including laboratory values. The data space is sparse with 5k samples and 200 variables. The idea is to rank the variables using a feature selection method (IG, RF etc)…
Khader Shameer
  • 663
  • 1
  • 7
  • 14
25
votes
4 answers

EM maximum likelihood estimation for Weibull distribution

Note: I am posting a question from a former student of mine unable to post on his own for technical reasons. Given an iid sample $x_1,\ldots,x_n$ from a Weibull distribution with pdf $$ f_k(x) = k x^{k-1} e^{-x^k} \quad x>0 $$ is there a useful…
24
votes
6 answers

What are the disadvantages of using mean for missing values?

I have an assignment (Data Mining course) and there is a part which asks: "What are the disadvantages of using mean for missing values?" in Missing Value section. So I searched a little bit and the most common answer was: "Because it reduces the…
23
votes
1 answer

How the 'NA' values are treated in glm in R

I have a data table T1, that contains nearly a thousand variables (V1) and around 200 million data points. The data is sparse and most of the entries are NA. Each datapoints have a unique id and date pair to distinguish from another. I have another…
user1140126
  • 789
  • 4
  • 12
  • 20
1
2 3
95 96