9

I have three variables:

  • distance (continuous, variable range negative infinity to positive infinity)
  • isLand (discrete categorical/ Boolean, variable range 1 or 0)
  • occupants (discrete categorical, variable range 0-7)

I want to answer the following statistical questions:

  • How to I compare distributions that have both categorical and continuous variable. For example, I like to determine if the data distribution of distance vs occupants varies depending on the value of isLand.
  • Given two of the three variables, can I predict the third using some equation?
  • How can I determine independence with more than two variables?
Elpezmuerto
  • 1,125
  • 4
  • 14
  • 22
  • 1
    I would recommend that you to split this across three separate questions. – Shane Sep 13 '10 at 15:20
  • Actually, now that I read this a little closer, I see that the answer for each is very closely related. – Shane Sep 13 '10 at 15:25
  • I felt that the heart of the question is comparing two different distributions, I just happen to list three different ways to do it. – Elpezmuerto Sep 13 '10 at 16:14
  • For `occupants` what you've got is an ordinal variable, so I wouldn't think of it as categorical. Especially with 8 values, it's almost continuous. – Mike Dunlavey Sep 14 '10 at 00:07

2 Answers2

5

I would recommend reading about logistic or log-linear models in particular, and methods of categorical data analysis in general. The notes on the following course are pretty good for a start: Analysis of Discrete Data. The textbook by Agresti is quite good. You might also consider Kleinbaum for a quick start.

ars
  • 12,160
  • 1
  • 36
  • 54
  • I actually have the Agresti textbook on my desk right now and I have been using it. The problem is that I didn't know what specific methodology I should be using. – Elpezmuerto Sep 13 '10 at 16:32
  • 2
    @Elpezmuerto Very briefly, to complement @ars answer, question 1 can be answered with a conditional or trellis plot, e.g. sth like `dist ~ occ | isLand` using Lattice, or see the `coplot()` function in the `vcd` package -- this is for exploratory purpose; question 2 calls for a prediction model; depending on the variable you consider as your outcome, it may be logistic regression (e.g. if Y=isLand), a linear regression (e.g. if Y=distance), or directly a log-linear model providing you categorize your continuous measurement; question 3 is clearly a log-linear model as suggested by @ars. – chl Sep 13 '10 at 19:10
  • 1
    @Elpezmuerto @ars Thanks to the work of Laura Thompson, Agresti's book is available in R too, http://j.mp/9fXheu :-) – chl Sep 13 '10 at 19:12
  • 2
    @chl: that's a great find! Thank you. @Elpezmuerto: There's a series of examples in Agresti concerning crabs -- I'm pretty sure there's a continuous variable (size of crab?) along with a color (range) and a boolean (can't recall). So fairly close to your case -- it's probably instructive to read through those examples which span at least 2 chapters (one chapter is logistic regression I believe). – ars Sep 13 '10 at 19:33
  • @ars These are esp. chapters 4 and 5, with carapace width and weight as continuous variables and spine condition as another categorical (ordinal) variable, used in Poisson and Logistic regression :) – chl Sep 13 '10 at 19:51
  • @Ars, I actually have been using the crab examples before you mentioned it. So atleast I know I am on the right track – Elpezmuerto Sep 13 '10 at 20:18
  • @chl: totally impressed by your consistent thoroughness in all responses! You're setting a high bar though. :) @Elpezmuerto: totally on the right track. I'm guessing you're more or less aware of the methodology and just need to start applying it to your data. Then ask specific questions if you hit a wall or are unsure. It's the only way to learn modeling, I think. – ars Sep 14 '10 at 01:25
  • @Ars, I felt I hit an information wall but you guys are getting me on the right track. I am also considering using a generalized linear model because max occupants is a discrete response variable that is a count as a possible outcome. Agresti (page 74) suggests generalized linear model then. – Elpezmuerto Sep 14 '10 at 13:51
2
  1. To examine the relationship between a continuous and categorical factor, a good start is to use side-by-side box plots, continuous on the left, categorical on the bottom. Are the means different? Use ANOVA to check.

  2. To examine the relationship between categorical factors, a good start is to use a mosaic plot, as well as a contingency table. You could group first then make separate plots.

  3. To predict occupants, ordinal logistic regression is probably the best way to go.

  4. To predict isLand, (binomial) logistic regression should do the trick.

  5. To predict distance, OLS regression will work.

Neil McGuigan
  • 9,292
  • 13
  • 54
  • 62