Before clustering binary data objects, may one remove attributes that are constant?

Question

The question is made up and was inspired by a broader somewhat similar one.

Say there is a dataset, cases (objects) by binary (dichotomous) attributes (attribute either present 1 or absent 0). The cases are to undergo cluster analysis appropriate for binary data. Some of the attributes are constants - they do not vary, they are just 1 (or, perhaps, just 0) for all cases. However, these attributes were included as meaningful, and the fact they lack variation doesn't automatically make them useless in the clustering task.

Will these attributes be informative for clustering process, i.e. w.r.t. its actual result? In other words, will deletion of these constant attributes influence the results of clustering? For, it is possible that despite they are meaningful they are needless mathematically.

Any clustering method valid for binary data may be considered. (At the first place, methods like hierarchical clustering, allowing various distance functions to define on binary data should be considered - since they are so flexible for binary data.)

[This question is not inferential/estimational of some population, rather, the analysis of a given dataset (sample = population) is in the focus. However, answers or comments touching the effect of constant (or almost constant) binary features removal on cluster analysis w.r.t. overfitting or future prediction - are also welcome.]

Can you please explain what you mean "just dross conceptually" with an example related to a use-case? Also even if they *do* influence the resulting clustering are you asking if *should* they? (ie. the increased dimensionality of the dataset because of the inclusion of "near-constant" features simply obscures "true" information or if it these features actually help regularise the dataset by not allowing clustering "on noise".) — usεr11852, Jun 17 '17 at 17:24
@usεr11852, Thank you. I edited it for your 1st point, to clarify. In regards to your 2nd point. My current question refers not to population/estimation/overfitting themes, therefore "regularization" or "noise" are not considered. The question is more "technical", not statistical/inferential. However, answers touching these implications/issues _are_ welcome, to expand the scope. — ttnphns, Jun 20 '17 at 12:42

ttnphns · Accepted Answer · 2017-08-18T18:40:40.740

Here is an investigation of how adding (or removal, doesn't matter) of constant attributes to/from a dataset of binary attributes affects distances computed between cases. I tested it for various popular binary data distance measures using SPSS.

This answer helps to decide whether one may or may not delete attributes which are constant (or, to extend tentatively, almost constant, i.e. extremely skewed) before computing proximities b/w cases in binary data - to input the matrix then to procedures such as hierarchical cluster analysis (HAC) or multidimensional scaling (MDS).

I used to generate repeatedly 15 random binary variables (with random moderate skewnesses, and coming both from noncorrelated or correlated population; it didn't change the results) and computed a proximity measure between cases. Then I would add 5 more variables, now constants - either all =1 or all =0, and computed the proximity matrix again. Then I would observe the scatterplot of the 15-variable vs 20-variable based proximity values.

All proximity measures for binary data available currently in SPSS, were examined. In a binary variable, 1 means "attribute is present" and 0 means "attribute is absent". The results are shown below.

equal: constant variables do not affect distance measure anyhow (they're just ignored by the measure)
proportional: exact proportional relation
linear: exact linear (i.e. with intercept term) relation
monotonic: exact somewhat curved relation
scatter: no exact relation, scatterplot is a cloud (the shape of the cloud depends on the measure, it can be ellipsoid or half-ellipsoid or more peculiar).

An acronym nearby the name of a measure is its SPSS syntax keyword. The found relationships are:

    1) When the 5 constant variables are 1 (attribute is present)
    Similarities
       Russell and Rao (simple joint prob)    RR          linear
       Simple matching (or Rand)              SM          linear
       Jaccard                                JACCARD     scatter
       Dice (or Czekanowski or Sorenson)      DICE        scatter
       Sokal and Sneath 1                     SS1         monotonic, almost linear
       Rogers and Tanimoto                    RT          monotonic, almost linear
       Sokal and Sneath 2                     SS2         scatter
       Kulczynski 1                           K1          scatter
       Sokal and Sneath 3                     SS3         linear, however sometimes prox value can be 1 in both regimes (i.e. away from the line) 
       Kulczynski 2                           K2          scatter
       Sokal and Sneath 4                     SS4         scatter
       Hamann                                 HAMANN      linear
       Ochiai (or cosine)                     OCHIAI      scatter
       Sokal and Sneath 5                     SS5         scatter
       Phi (or Pearson) correlation           PHI         scatter
       Goodman and Kruskal’s lambda           LAMBDA      scatter
       Anderberg’s D                          D           scatter
       Yule’s Y                               Y           scatter
       Yule’s Q (Goodman and Kruskal’s gamma) Q           scatter
       Dispersion similarity                  DISPER      scatter
    Dissimilarities
       Euclidean distance                     BEUCLID     equal
       Squared Euclidean distance             BSEUCLID    equal
       Size difference                        SIZE        proportional
       Pattern difference                     PATTERN     proportional
       Shape difference                       BSHAPE      scatter
       Variance dissimilarity                 VARIANCE    proportional
       Lance-and-Williams dissimilarity       BLWMN       scatter

    2) When the 5 constant variables are 0 (attribute is absent)
    Similarities
       Russell and Rao (simple joint prob)    RR          proportional
       Simple matching (or Rand)              SM          linear
       Jaccard                                JACCARD     equal
       Dice (or Czekanowski or Sorenson)      DICE        equal
       Sokal and Sneath 1                     SS1         monotonic, almost linear
       Rogers and Tanimoto                    RT          monotonic, almost linear
       Sokal and Sneath 2                     SS2         equal
       Kulczynski 1                           K1          equal
       Sokal and Sneath 3                     SS3         linear, however sometimes prox value can be 1 in both regimes (i.e. away from the line)
       Kulczynski 2                           K2          equal
       Sokal and Sneath 4                     SS4         scatter
       Hamann                                 HAMANN      linear
       Ochiai (or cosine)                     OCHIAI      equal
       Sokal and Sneath 5                     SS5         scatter
       Phi (or Pearson) correlation           PHI         scatter
       Goodman and Kruskal’s lambda           LAMBDA      scatter
       Anderberg’s D                          D           scatter
       Yule’s Y                               Y           scatter
       Yule’s Q (Goodman and Kruskal’s gamma) Q           scatter
       Dispersion similarity                  DISPER      scatter
    Dissimilarities
       Euclidean distance                     BEUCLID     equal
       Squared Euclidean distance             BSEUCLID    equal
       Size difference                        SIZE        proportional
       Pattern difference                     PATTERN     proportional
       Shape difference                       BSHAPE      scatter
       Variance dissimilarity                 VARIANCE    proportional
       Lance-and-Williams dissimilarity       BLWMN       equal

The instruction is straightforward. You should not remove (or add) constant attributes (provided they are meaningful to you) in any scatter case because it will affect the distances in the matrix in a nonsystematic way. In proportional or linear case your decision to remove or to leave should take into consideration the nature of the specific clustering or MDS method. For example, in complete or single linkage$^1$ HAC clustering any proportional or linear transform of distances will not affect results; even monotonic transform won't (though it might influence the decision of the number of clusters to leave). In MDS, results may differ whether you treat your distances as ratio, interval or ordinal level. Hence, depending on your choice of that, proportional or linear effects of constant (or almost constant) attribute deletion will or will not take place in the results of your analysis.

$^1$ Actually, only these two linkage methods in HAC are 100% theoretically warranted for binary data. Average linkage methods already bear some heuristicity, and "geometric" methods like centroid or Ward should, further, be avoided with binary data.

There is some error here. Clearly, a constant 1 term does not affect e.g. Euclidean distance at all. Shouldn't that be `equal`? — Has QUIT--Anony-Mousse, Jun 24 '17 at 11:16
@Anony-Mousse, thank you for noticing. It was a paste lapse. Corrected. — ttnphns, Jun 24 '17 at 18:40

Before clustering binary data objects, may one remove attributes that are constant?

1 Answers1

Linked