Questions tagged [binary-data]

A binary variable takes one of two values, typically coded as "0" and "1".

In a broader sense "binary variable" is a synonym of "dichotomous variable": any variable that can take on only one of two values. In a narrower sense it refers to dichotomous data coded as "1" or "0". (Sometimes "1" is supposed to mean "is present" and "0" to mean "is absent", which may require handling the two values asymmetrically in some statistical analyses (see e.g. Jaccard indices).)

A binary response variable occurs as a result of Bernoulli trials, whose analysis commonly involves contingency tables or logistic/probit regression.

The term 'binary' also refers to data stored as machine-readable binary numbers rather than numbers recorded in strings of ASCII (or Unicode, or other human-readable) numerals.

In econometrics binary variables are also called dummy variables.

1249 questions
69
votes
4 answers

Reduce Classification Probability Threshold

I have a question regarding classification in general. Let $f$ be a classifier, which outputs a set of probabilities given some data D. Normally, one would say: well, if $P(c|D) > 0.5$, we will assign a class 1, otherwise 0 (let this be a binary…
sdgaw erzswer
  • 1,199
  • 1
  • 9
  • 13
64
votes
5 answers

Is it meaningful to calculate Pearson or Spearman correlation between two Boolean vectors?

There are two Boolean vectors, which contain 0 and 1 only. If I calculate the Pearson or Spearman correlation, are they meaningful or reasonable?
Zhilong Jia
  • 785
  • 1
  • 6
  • 9
59
votes
7 answers

Binary classification with strongly unbalanced classes

I have a data set in the form of (features, binary output 0 or 1), but 1 happens pretty rarely, so just by always predicting 0, I get accuracy between 70% and 90% (depending on the particular data I look at). The ML methods give me about the same…
59
votes
10 answers

Measuring entropy/ information/ patterns of a 2d binary matrix

I want to measure the entropy/ information density/ pattern-likeness of a two-dimensional binary matrix. Let me show some pictures for clarification: This display should have a rather high entropy: A) This should have medium entropy: B) These…
46
votes
4 answers

Would PCA work for boolean (binary) data types?

I want to reduce the dimensionality of higher order systems and capture most of the covariance on a preferably 2 dimensional or 1 dimensional field. I understand this can be done via principal component analysis, and I have used PCA in many…
44
votes
5 answers

Should you ever standardise binary variables?

I have a data set with a set of features. Some of them are binary $(1=$ active or fired, $0=$ inactive or dormant), and the rest are real valued, e.g. $4564.342$. I want to feed this data to a machine learning algorithm, so I $z$-score all the…
siamii
  • 1,767
  • 5
  • 21
  • 29
38
votes
1 answer

Doing principal component analysis or factor analysis on binary data

I have a dataset with a large number of Yes/No responses. Can I use principal components (PCA) or any other data reduction analyses (such as factor analysis) for this type of data? Please advise how I go about doing this using SPSS.
Cathy
  • 381
  • 1
  • 4
  • 3
35
votes
2 answers

How to use both binary and continuous variables together in clustering?

I need to use binary variables (values 0 & 1) in k-means. But k-means only works with continuous variables. I know some people still use these binary variables in k-means ignoring the fact that k-means is only designed for continuous variables. This…
GeorgeOfTheRF
  • 5,063
  • 14
  • 42
  • 51
34
votes
1 answer

Is there Factor analysis or PCA for ordinal or binary data?

I have completed the principal component analysis (PCA), exploratory factor analysis (EFA), and confirmatory factor analysis (CFA), treating data with likert scale (5-level responses: none, a little, some,..) as a continuous variable. Then, using…
user116948
  • 383
  • 1
  • 4
  • 6
30
votes
2 answers

Clustering a binary matrix

I have a semi-small matrix of binary features of dimension 250k x 100. Each row is a user and the columns are binary "tags" of some user behavior e.g. "likes_cats". user 1 2 3 4 5 ... ------------------------- A 1 0 1 0 1 B …
wije
  • 581
  • 1
  • 4
  • 9
28
votes
7 answers

Why is gender typically coded 0/1 rather than 1/2, for example?

I understand the logic of coding for data analysis. My question below is on the use of a specific code. Is there a reason why gender is often coded as 0 for female and 1 for male? Why is this coding considered 'standard'? Compare this with Female…
Adhesh Josh
  • 2,935
  • 16
  • 50
  • 67
26
votes
2 answers

Similarity Coefficients for binary data: Why choose Jaccard over Russell and Rao?

From Encyclopedia of Statistical Sciences I understand that given $p$ dichotomous (binary: 1=present; 0=absent) attributes (variables), we can form a contingency table for any two objects i and j of a sample: j 1 0 ------- …
wflynny
  • 455
  • 1
  • 6
  • 10
23
votes
3 answers

Visualizing the calibration of predicted probability of a model

Suppose I have a predictive model that produces, for each instance, a probability for each class. Now I recognize that there are many ways to evaluate such a model if I want to use those probabilities for classification (precision, recall, etc.). …
23
votes
3 answers

Generate random correlated data between a binary and a continuous variable

I want to generate two variables. One is binary outcome variable (say success / failure) and the other is age in years. I want age to be positively correlated with success. For example there should be more successes in the higher age segments than…
user333
  • 6,621
  • 17
  • 44
  • 54
19
votes
2 answers

optimizing auc vs logloss in binary classification problems

I am performing a binary classification task where the outcome probability is fair low (aroung 3%). I am trying to decide whether to optimize by AUC or log-loss. As much as I have understood, AUC maximizes the model's ability to discriminate between…
Giorgio Spedicato
  • 3,444
  • 4
  • 29
  • 39
1
2 3
83 84