The mission
I am trying to find a way to do Iterative Proportional Fitting in R. The logic of the procedure is like this: one has a table with e.g. sample distribution of some variables. Let us say it is this one:
sample1 <- structure(c(6L, 14L, 46L, 16L, 6L, 21L, 62L, 169L, 327L, 174L,
44L, 72L, 43L, 100L, 186L, 72L, 23L, 42L), .Dim = c(6L, 3L), .Dimnames = list(
c("Primary", "Lowersec", "Highersec", "Highershort", "Higherlong",
"University"), c("B", "F", "W")))
Another table is from some other source, say another sample:
sample2 <- structure(c(171796L, 168191L, 240671L, 69168L, 60079L, 168169L,
954045L, 1040981L, 1872732L, 726410L, 207366L, 425786L, 596239L,
604826L, 991640L, 323215L, 134066L, 221696L), .Dim = c(6L, 3L
), .Dimnames = list(c("Primary", "Lowerse", "Highersec", "Highershort",
"Higherlong", "University"), c("B", "F", "W")))
Now, we want to preserve the relationships between variables found in sample1
, but we want to apply these relationships to the marginal distribution we find in sample2
. Iterative Proportional Fitting does this as is described here (I could not possibly offer a better explanation). I have tried doing it in LEM, with the following result:
B F W
Primary 124204.64 960173.6 637701.7
Lowerse 119749.12 1081459.0 612789.9
Highersec 336934.21 1792001.6 976107.2
Highershort 90512.27 736464.1 291816.6
Higherlong 43486.91 238593.0 119431.0
University 163186.85 418628.6 233835.5
I am not 100% sure that this result is, but chances are (say 99%) that it is. The odds ratios from the first table are preserved in the resulting table, while the marginal distributions (row and column sums) are identical to the second input table.
The problem
Strangely, this quite useful algorithm is not readily available in R, at least not in a user-friendly form. One function that is likely to be relevant is cat::ipf()
. However, I cannot figure out how to use the margins=
argument. I am certainly not alone in this question. The help example uses a 3-dimensional table, which makes things even more confusing.
In addition, there exist some user-written functions, one is here and the other is to be found here. The first one gives an erroneous result, unfortunately. The second one is also very non-transparent, requiring specifically pre-formatted CSV files as input, instead of R matrix objects.
The question
- Can anyone please explain how to actually use the
cat::ipf()
function? - Are there any alternative functions to achieve the IPF adjustment task, using matrices as input?
- (SOLVED) Can this function be fixed to deliver a proper result?
Thank you.
ADDENDUM: I was able to get a proper output from the function in (3). After some consideration in turned out that the function does not accept matrix as input for marginal distribution, but simply a list of these marginal distributions. So practically the question is solved. However, a proper answer for 1 and 2 would be of use to the larger community, since IPF is quite essential in loglinear models.
Also reproducing a working IPF function here seems like a good idea, for those who search in the future:
ipf <- function(Margins_, seedAry, maxiter=100, closure=0.001) {
#Check to see if the sum of each margin is equal
MarginSums. <- unlist(lapply(Margins_, sum))
if(any(MarginSums. != MarginSums.[1])) warning("sum of each margin
not equal")
#Replace margin values of zero with 0.001
Margins_ <- lapply(Margins_, function(x) {
if(any(x == 0)) warning("zeros in marginsMtx replaced with
0.001")
x[x == 0] <- 0.001
x
})
#Check to see if number of dimensions in seed array equals the number of
#margins specified in the marginsMtx
numMargins <- length(dim(seedAry))
if(length(Margins_) != numMargins) {
stop("number of margins in marginsMtx not equal to number of
margins in seedAry")
}
#Set initial values
resultAry <- seedAry
iter <- 0
marginChecks <- rep(1, numMargins)
margins <- seq(1, numMargins)
#Iteratively proportion margins until closure or iteration criteria are met
while((any(marginChecks > closure)) & (iter < maxiter)) {
for(margin in margins) {
marginTotal <- apply(resultAry, margin, sum)
marginCoeff <- Margins_[[margin]]/marginTotal
marginCoeff[is.infinite(marginCoeff)] <- 0
resultAry <- sweep(resultAry, margin, marginCoeff, "*")
marginChecks[margin] <- sum(abs(1 - marginCoeff))
}
iter <- iter + 1
}
#If IPF stopped due to number of iterations then output info
if(iter == maxiter) cat("IPF stopped due to number of iterations\n")
#Return balanced array
resultAry
}
Note that this function will not accept a matrix for marginal distribution. The following will help:
m1 <- rowSums(sample2)
m2 <- colSums(sample2)
m <- list(m1,m2)
And then supply m
as the first argument.