R package for combining factor levels for datamining?

Question

Wondering if anyone has run across a package/function in R that will combine levels of a factor whose proportion of all the levels in a factor is less than some threshold? Specifically, one of the first steps in data preparation I conduct is to collapse sparse levels of factors together (say into a level called 'Other') that do not constitute at least, say, 2% of the total. This is done unsupervised and is done when the objective is to model some activity in marketing (not fraud detection, where those very small occurrences could be extremely important). I am looking for a function that will collapse levels until some threshold proportion is met.

UPDATE:

Thanks to these great suggestions I wrote a function pretty easily. I did realize though that it was possible to collapse levels with proportion < the minimum and still have that recoded level be < the minimum, requiring the addition of the lowest level with proportion > the minimum. Likely can be more efficient but it appears to work. The next enhancement would be to figure out how to capture the "rules" for applying the collapse logic to new data (a validation set or future data).

collapseFactors<- function(tableName,minPercent=5,fillIn ="RECODED" )
{
    for (i in 1:ncol(tableName))
        {   

            if(is.factor(tableName[,i]) == TRUE) #process just factors
            {


                sortedTable<-sort(prop.table(table(tableName[,i])))
                numberToCollapse<-length(sortedTable[sortedTable<(minPercent/100)])

                if (sum(sortedTable[1:numberToCollapse])<(minPercent/100))
                    {
                        numberToCollapse=numberToCollapse+1 #add next level if < minPercent
                    }

                if(numberToCollapse>1) #if not >1 then nothing to collapse
                {
                    lf <- names(sortedTable[1:numberToCollapse])
                    levels(tableName[,i])[levels(tableName[,i]) %in% lf] <- fillIn
                }
            }#end if a factor


        }#end for loop

    return(tableName)

}#end function

For another approach: https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 — kjetil b halvorsen, May 16 '17 at 23:33

score 11 · Accepted Answer · answered Dec 21 '10 at 10:16

11

It seems it's just a matter of "releveling" the factor; no need to compute partial sums or make a copy of the original vector. E.g.,

set.seed(101)
a <- factor(LETTERS[sample(5, 150, replace=TRUE, 
                           prob=c(.1, .15, rep(.75/3,3)))])
p <- 1/5
lf <- names(which(prop.table(table(a)) < p))
levels(a)[levels(a) %in% lf] <- "Other"

Here, the original factor levels are distributed as follows:

 A  B  C  D  E 
18 23 35 36 38

and then it becomes

Other     C     D     E 
   41    35    36    38

It may be conveniently wrapped into a function. There is a combine_factor() function in the reshape package, so I guess it could be useful too.

Also, as you seem interested in data mining, you might have a look at the caret package. It has a lot of useful features for data preprocessing, including functions like nearZeroVar() that allows to flag predictors with very imbalanced distribution of observed values (See the vignette, example data, pre-processing functions, visualizations and other functions, p. 5, for example of use).

answered Dec 21 '10 at 10:16

chl

50,972
18
205
364

@CHI Thanks. I have studied the caret package and have used it to tune meta parameters. very useful!. – B_Miner Dec 21 '10 at 13:03
@chl +1, nice one. I wrote my function solely because code a[levels(a) %in% lf] – mpiktas Dec 21 '10 at 13:04
@mpiktas Thx. You can work at the vector level with e.g., `a[as.character(a) %in% lf] – chl Dec 21 '10 at 15:27
+1. a[levels(a) %in% lf] – Christopher Aden Dec 21 '10 at 18:40
But note that a[a=="a"] – mpiktas Dec 21 '10 at 21:04
@Christopher, a[levels(a) %in% lf] – mpiktas Dec 21 '10 at 21:06
@mpiktas `a[a=="a"]` is _subsetting_; `a[a=="A"] – chl Dec 21 '10 at 22:44
Put differently, you can't change the levels of an existing factor by changing its (vector) elements.[1] I think factors are a bit harder to grasp precisely because attributes usually don't play such a big role in working with R, but with factors, they often need to be explicitly manipulated. [1] Ok, maybe `drop=TRUE` for subsetting counts. – caracal Dec 22 '10 at 00:39
@chl, do not follow, how a=="a" is different from as.character(a) %in% "a"? They both create boolean vectors of length length(a). – mpiktas Dec 22 '10 at 07:20
@mpiktas Right. Wrote to quickly (neither is of length `length(a)`): `a[a=="a"]` is _subsetting_, `a=="a"` is _indexing_. The point was about assigning an unknown level to `a` which is a factor. The `%in%` operator or simply `intersect()` is cool when you want to match more than one case, e.g. `as.character(a) %in% c("a","c")` is equivalent to `a=="a" | a=="c"`. – chl Dec 22 '10 at 09:52

score 5 · Answer 2 · answered Dec 21 '10 at 04:51

The only problem with Christopher answer is that it will mix up the original ordering of the factor. Here is my fix:

 Merge.factors <- function(x, p) {
     t <- table(x)
     levt <- cbind(names(t), names(t)) 
     levt[t/sum(t)<p, 2] <- "Other"
     change.levels(x, levt)
 }

where change.levels is the following function. I wrote it some time ago, so I suspect there might be better ways of achieving what it does.

 change.levels <- function(f, levt) {
     ##Change the the names of the factor f levels from
     ##substitution table levt.
     ## In the first column there are the original levels, in
     ## the second column -- the substitutes
     lv <- levels(f)
     if(sum(sort(lv) != sort(levt[, 1]))>0)
     stop ("The names from substitution table does not match given level names")
     res <- rep(NA, length(f))

     for(i in lv) {
          res[f==i] <- as.character(levt[levt[, 1]==i, 2])
     }
     factor(res)
}

Christopher Aden · Answer 3 · 2010-12-21T06:58:11.473

I wrote a quick function that will accomplish this goal. I'm a novice R user, so it may be slow with large tables.

Merge.factors <- function(x, p) { 
    #Combines factor levels in x that are less than a specified proportion, p.
    t <- table(x)
    y <- subset(t, prop.table(t) < p)
    z <- subset(t, prop.table(t) >= p)
    other <- rep("Other", sum(y))
    new.table <- c(z, table(other))
    new.x <- as.factor(rep(names(new.table), new.table))
    return(new.x)
}

As an example of it in action:

> a <- rep("a", 100)
> b <- rep("b", 1000)
> c <- rep("c", 1000)
> d <- rep("d", 1000)
> e <- rep("e", 400)
> f <- rep("f", 100)
> x <- factor(c(a, b, c, d, e, f))
> summary(x)
   a    b    c    d    e    f 
 100 1000 1000 1000  400  100 
> prop.table(table(x))
x
         a          b          c          d          e          f 
0.02777778 0.27777778 0.27777778 0.27777778 0.11111111 0.02777778 
> 
> w <- Merge.factors(x, .05)
> summary(w)
    b     c     d     e Other 
 1000  1000  1000   400   200 
> class(w)
[1] "factor"

Thanks for the observation, John. I have changed it a little to make it a factor. All I did was remake the original vector from the table though, so if there's a way to skip that step, this will be faster. — Christopher Aden, Dec 21 '10 at 03:11
Thanks to everyone who responded. My R is weak but the ability to do this with so few lines of code is testament to how powerful it is and makes me want to learn. — B_Miner, Dec 21 '10 at 14:16

R package for combining factor levels for datamining?

3 Answers3