7

The mission

I am trying to find a way to do Iterative Proportional Fitting in R. The logic of the procedure is like this: one has a table with e.g. sample distribution of some variables. Let us say it is this one:

sample1 <- structure(c(6L, 14L, 46L, 16L, 6L, 21L, 62L, 169L, 327L, 174L, 
44L, 72L, 43L, 100L, 186L, 72L, 23L, 42L), .Dim = c(6L, 3L), .Dimnames = list(
    c("Primary", "Lowersec", "Highersec", "Highershort", "Higherlong", 
    "University"), c("B", "F", "W")))

Another table is from some other source, say another sample:

sample2 <- structure(c(171796L, 168191L, 240671L, 69168L, 60079L, 168169L, 
954045L, 1040981L, 1872732L, 726410L, 207366L, 425786L, 596239L, 
604826L, 991640L, 323215L, 134066L, 221696L), .Dim = c(6L, 3L
), .Dimnames = list(c("Primary", "Lowerse", "Highersec", "Highershort", 
"Higherlong", "University"), c("B", "F", "W")))

Now, we want to preserve the relationships between variables found in sample1, but we want to apply these relationships to the marginal distribution we find in sample2. Iterative Proportional Fitting does this as is described here (I could not possibly offer a better explanation). I have tried doing it in LEM, with the following result:

                    B         F        W
Primary     124204.64  960173.6 637701.7
Lowerse     119749.12 1081459.0 612789.9
Highersec   336934.21 1792001.6 976107.2
Highershort  90512.27  736464.1 291816.6
Higherlong   43486.91  238593.0 119431.0
University  163186.85  418628.6 233835.5

I am not 100% sure that this result is, but chances are (say 99%) that it is. The odds ratios from the first table are preserved in the resulting table, while the marginal distributions (row and column sums) are identical to the second input table.

The problem

Strangely, this quite useful algorithm is not readily available in R, at least not in a user-friendly form. One function that is likely to be relevant is cat::ipf(). However, I cannot figure out how to use the margins= argument. I am certainly not alone in this question. The help example uses a 3-dimensional table, which makes things even more confusing.

In addition, there exist some user-written functions, one is here and the other is to be found here. The first one gives an erroneous result, unfortunately. The second one is also very non-transparent, requiring specifically pre-formatted CSV files as input, instead of R matrix objects.

The question

  1. Can anyone please explain how to actually use the cat::ipf() function?
  2. Are there any alternative functions to achieve the IPF adjustment task, using matrices as input?
  3. (SOLVED) Can this function be fixed to deliver a proper result?

Thank you.

ADDENDUM: I was able to get a proper output from the function in (3). After some consideration in turned out that the function does not accept matrix as input for marginal distribution, but simply a list of these marginal distributions. So practically the question is solved. However, a proper answer for 1 and 2 would be of use to the larger community, since IPF is quite essential in loglinear models.

Also reproducing a working IPF function here seems like a good idea, for those who search in the future:

ipf <- function(Margins_, seedAry, maxiter=100, closure=0.001) {
    #Check to see if the sum of each margin is equal
    MarginSums. <- unlist(lapply(Margins_, sum))
    if(any(MarginSums. != MarginSums.[1])) warning("sum of each margin
                                                   not equal")

    #Replace margin values of zero with 0.001
    Margins_ <- lapply(Margins_, function(x) {
        if(any(x == 0)) warning("zeros in marginsMtx replaced with
                                0.001") 
        x[x == 0] <- 0.001
        x
    })

    #Check to see if number of dimensions in seed array equals the number of
    #margins specified in the marginsMtx
    numMargins <- length(dim(seedAry))
    if(length(Margins_) != numMargins) {
        stop("number of margins in marginsMtx not equal to number of
             margins in seedAry")
    }

    #Set initial values
    resultAry <- seedAry
    iter <- 0
    marginChecks <- rep(1, numMargins)
    margins <- seq(1, numMargins)

    #Iteratively proportion margins until closure or iteration criteria are met
    while((any(marginChecks > closure)) & (iter < maxiter)) {
        for(margin in margins) {
            marginTotal <- apply(resultAry, margin, sum)
            marginCoeff <- Margins_[[margin]]/marginTotal
            marginCoeff[is.infinite(marginCoeff)] <- 0
            resultAry <- sweep(resultAry, margin, marginCoeff, "*")
            marginChecks[margin] <- sum(abs(1 - marginCoeff))
        }    
        iter <- iter + 1
    }

    #If IPF stopped due to number of iterations then output info
    if(iter == maxiter) cat("IPF stopped due to number of iterations\n")

    #Return balanced array
    resultAry
    }

Note that this function will not accept a matrix for marginal distribution. The following will help:

m1 <- rowSums(sample2)
m2 <- colSums(sample2)
m <- list(m1,m2)

And then supply m as the first argument.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Maxim.K
  • 560
  • 7
  • 20
  • 2
    Another function for IPF is rake from the survey package. –  May 15 '13 at 09:26
  • @Henrico I have seen it, but it is for survey weights, requires `surveydesign` object as input, and thus works in an entirely different fashion. – Maxim.K May 15 '13 at 09:36
  • 1
    @Maxim.K The glm function uses iteratively weighted least squared algorithm for its fits. When the error function is `binomial` or `poisson`, the results is a log-linear model. It's not clear what advantage you saw in making your own function. – DWin Aug 09 '16 at 15:27

2 Answers2

3

This is old, but here we go:

As @Henrico wrote, it seems that what are you trying to achieve is indeed raking. Instead of using survey::rake you might fit the distribution to the marginals "by hand" using a Poisson GLM, as suggested by @DWin. To get the right frequencies you need to use an offset.

Let

  • $n_{ij}$ be your sample1
  • $N_{ij}$ the expected frequencies for the adjusted table
  • $\hat{N}_{ij}$ the fitted values for the adjusted table

We need to fit a model (see Little & Wu 1991 in JASA):

$$\log \left( \frac{N_{ij}}{n_{ij}} \right) = \lambda + \lambda^1_i + \lambda^2_j$$

thus we have

$$\log \hat{N}_{ij} - \log n_{ij} = \hat{\lambda} + \hat{\lambda^1_i} + \hat{\lambda^2_j}$$

where $\log n_{ij}$ is the mentioned offset.

You can estimate it with any GLM software by

  1. Creating an artificial table of $N_{ij}$ that will satisfy independence and have the desired marginals.
  2. Fit a main effects log-linear/Poisson model to $N_{ij}$ with $n_{ij}$s (observed frequencies, sample1) as an offset.
  3. Get the fitted values.

For example, this will get you the target frequencies f:

# Your data
sample1 <- structure(c(6L, 14L, 46L, 16L, 6L, 21L, 62L, 169L, 327L, 174L, 
                       44L, 72L, 43L, 100L, 186L, 72L, 23L, 42L), .Dim = c(6L, 3L), .Dimnames = list(
                         c("Primary", "Lowersec", "Highersec", "Highershort", "Higherlong", 
                           "University"), c("B", "F", "W")))

sample2 <- structure(c(171796L, 168191L, 240671L, 69168L, 60079L, 168169L, 
                       954045L, 1040981L, 1872732L, 726410L, 207366L, 425786L, 596239L, 
                       604826L, 991640L, 323215L, 134066L, 221696L), .Dim = c(6L, 3L
                       ), .Dimnames = list(c("Primary", "Lowersec", "Highersec", "Highershort", 
                                             "Higherlong", "University"), c("B", "F", "W")))

library(dplyr)

# Turn to a data frame
d1 <- as_data_frame( as.table(sample1), stringsAsFactors = FALSE)

# Create artificial freqs based on sample2 and join with d1
N <- sum(sample2)
d <- outer(rowSums(sample2)/N, colSums(sample2)/N) %>%
  as.table() %>%
  as_data_frame() %>%
  mutate(
    p = n / sum(n),
    N = round(p * sum(sample2))
    ) %>%
  select(Var1, Var2, p, N) %>%
  left_join(d1)
#> Joining, by = c("Var1", "Var2")

# Fit the model
mod <- glm( N ~ Var1 + Var2 + offset(log(n)), data=d, family=poisson("log") )

# Get the fitted values
d$f <- predict(mod, type="response")
d
#> # A tibble: 18 x 6
#>           Var1  Var2           p       N     n          f
#>          <chr> <chr>       <dbl>   <dbl> <int>      <dbl>
#> 1      Primary     B 0.018763534  168442     6  124197.33
#> 2     Lowersec     B 0.019765059  177432    14  119743.66
#> 3    Highersec     B 0.033832098  303713    46  336937.75
#> 4  Highershort     B 0.012190206  109432    16   90514.26
#> 5   Higherlong     B 0.004374806   39273     6   43486.57
#> 6   University     B 0.008887215   79781    21  163193.43
#> 7      Primary     F 0.111702426 1002761    62  960181.53
#> 8     Lowersec     F 0.117664672 1056285   169 1081463.51
#> 9    Highersec     F 0.201408086 1808056   327 1792009.27
#> 10 Highershort     F 0.072570318  651469   174  736456.24
#> 11  Higherlong     F 0.026043943  233798    44  238592.76
#> 12  University     F 0.052907063  474951    72  418616.69
#> 13     Primary     W 0.061364877  550877    43  637701.15
#> 14    Lowersec     W 0.064640297  580281   100  612790.82
#> 15   Highersec     W 0.110645603  993274   186  976095.99
#> 16 Highershort     W 0.039867250  357891    72  291821.50
#> 17  Higherlong     W 0.014307508  128440    23  119431.67
#> 18  University     W 0.029065039  260919    42  233840.87
Michał
  • 275
  • 2
  • 7
  • 2
    Illustrating the advantages of `predict( ..., type ="response")` is very useful. – DWin Aug 09 '16 at 15:29
  • Thank you, excellent answer! The problem is long solved, but your solution can certainly be useful to other people. Your answer is clear, refers to theory, and leads to correct results (same as the function I mention in my own edits). I think it is only appropriate to accept it as final. Perhaps you could provide a full reference to Little & Wu (1991)? – Maxim.K Aug 10 '16 at 09:20
3

Make a glm fit to the marginals with Poisson errors (yielding a log-linear model) and then use predict on expand.grid data.frame from the the row and column values based of the second sample. (There's no particular advantage that I can see in using IPF to estimate a log-linear model of this sort.)

require(reshape2)
Loading required package: reshape2
> melt(sample1)
          Var1 Var2 value
1      Primary    B     6
2     Lowersec    B    14
3    Highersec    B    46
4  Highershort    B    16
5   Higherlong    B     6
6   University    B    21
7      Primary    F    62
8     Lowersec    F   169
9    Highersec    F   327
10 Highershort    F   174
11  Higherlong    F    44
12  University    F    72
13     Primary    W    43
14    Lowersec    W   100
15   Highersec    W   186
16 Highershort    W    72
17  Higherlong    W    23
18  University    W    42
> m_sample1<- melt(sample1)
> glm( value ~ Var1+Var2, data=m_sample1)

Call:  glm(formula = value ~ Var1 + Var2, data = msample)

Coefficients:
    (Intercept)    Var1Highersec  Var1Highershort     Var1Lowersec      Var1Primary  
         -36.56           162.00            63.00            70.00            12.67  
 Var1University            Var2F            Var2W  
          20.67           123.17            59.50  

Degrees of Freedom: 17 Total (i.e. Null);  10 Residual
Null Deviance:      121200 
Residual Deviance: 22510    AIC: 197.4 

That was the linear model. This is the multiplicative (log-linear) model:

> glm( value ~ Var1+Var2, data= m_sample1, family="poisson")

Call:  glm(formula = value ~ Var1 + Var2, family = "poisson", data = m_sample1)

Coefficients:
    (Intercept)    Var1Highersec  Var1Highershort     Var1Lowersec      Var1Primary  
         1.7213           2.0357           1.2779           1.3550           0.4191  
 Var1University            Var2F            Var2W  
         0.6148           2.0515           1.4528  

Degrees of Freedom: 17 Total (i.e. Null);  10 Residual
Null Deviance:      1287 
Residual Deviance: 21.05    AIC: 139.1 

> predict(glm( value ~ Var1+Var2, data=msample,family="poisson"), data.frame(Var1="Lowersec", Var2="B") )
       1 
3.076272 

Edit; More detail requested:

Multiply the grand sum by combinations of the appropriate entries from exp(coef(fit)). The non-Intercept entries in coef(fit) let you compute the estimated ratios of proportions in "non-corner" cells to the "corner cell". The Var1:University with Var2:F cell would have an estimate of exp( 1.7213 + 0.6148+ 2.0515) in the original model (which is what predict(fit) or predict(fit, expand.grid(data.frame( rows=rowMeans(m_sample1), cols=colMeans(m_sample1)))) should give you). You then need to multiply by the ratio of the grand sums of the new data to the grand sum of the fit data.

DWin
  • 7,005
  • 17
  • 32
  • 1
    Would you please care to expand your answer to use `expand.grid()` as you mention? I don't see how prediction in its current form leads to the derivation of the resulting table above. Thanks. – Maxim.K May 16 '13 at 10:29
  • The "more details" are REALLY hard to understand.... – Michał Aug 08 '16 at 13:16
  • @Michal: Downvoting an answer to a three year-old question is not an effective tactic for motivating me to make alterations. – DWin Aug 08 '16 at 15:51
  • 1
    @DWin, pity... :) I think it is an indicator for people looking for answers, whatever the age of the question... – Michał Aug 08 '16 at 22:48
  • No, No, No. You need to encourage people. Not discourage them. Downvotes are negative ... by definition. – DWin Aug 09 '16 at 04:56