Can I use logistic regression if the distribution of proportions is skewed & lies in the middle of the [0,1] interval?

Question

I am conducting a logistic regression in order to predict the service point win percentage of a tennis player.

In terms of data - I have (for each player A) approx 300 matches - for each match I have the total number of player A service points (points where he is the server), total number of player A service point wins and total number of player A service point losses.

To do so, I have service point win percentage as the DV, and my independent variables are:

+Average service win percentage of last 3 matches
+ln(player's ranking points)
+ln(opposition's ranking points)
+surface the match was played on

My dependent variable data, service win percentage, lies usually in the range of 0.4-0.8, there are pretty much no values greater that 0.8 (about 2.8% of values and this drops to < 1% at around 0.84) and there exists no values less than 0.22. In addition my data is much more concentrated above 0.5 than it is below 0.5.

Thus, I worry that since my data doesn't have points close to zero or 1, and is not symmetrical around 0.5 (like the sigmoidal curve of logistic regression) that I am wasting my time with this model type. The results it is giving for my preliminary model outlined above are, although not shocking, pretty volatile.

I am conducting this in R and using the weights command to allow me use a proportion in the DV, giving the total number of trials as the weights. I use ln(points) because ranking points are exponential in nature.

The goal is to predict / forecast the service point win percentage of the player based on the IV's. Considering my data distribution, and my goal, does logistic regression make sense? If not is there any other type of model that makes more sense?

Do you know how many serves there were for each player? Ie, when you have 80% wins, do you know if that was 8/10 or 40/50? Do you have multiple data points for players (say from different matches)? — gung - Reinstate Monica, Aug 12 '15 at 23:58
@gung yes I have total number of service points for each match, total service points won and total service points lost. From this I get my service win %. I should probably make that clearer in the post, thanks for asking. Yes, I have multiple data points for each player - for example I would have 300 Rafael Nadal matches Each match contributes an observation of service point win % (or if you like, total number of service points, service points won, service points lost etc.) — Stevie Kvothe, Aug 13 '15 at 00:05
@Stevie Kvothe: maybe take a look at the answers to http://stats.stackexchange.com/questions/164120/interesting-logistic-regression-idea-problem-data-not-currently-in-0-1-form/164127#164127, the weights parameter can be used if you have a priori knowledge about the number of successes and if this prior value is different from the number of success in the sample that you use to estimate the parameters. — , Aug 13 '15 at 07:14
@gung -they are volatile in terms of predictions. Some are pretty accurate and then some of them are 10% + off the mark in relation to the actual observations. — Stevie Kvothe, Aug 13 '15 at 09:08
@Glen_b, "skewed" is my term. It was an effort to give a more descriptive title in accordance with the text in the body of the question. (See the revision history.) Feel free to re-edit if you think best. — gung - Reinstate Monica, Aug 13 '15 at 13:21
@StevieKvothe, unfortunately it's impossible to diagnose that without a lot more information, but the fact that your observed %s are in the middle shouldn't be a problem for a logistic regression model. — gung - Reinstate Monica, Aug 13 '15 at 13:26
@gung I don't mind which term is used to describe it... I'm just not certain whether I have correctly understood what is being described and figured a picture would convey more clearly whatever it might be. — Glen_b, Aug 13 '15 at 13:56
@Glen_b skewed is actually pretty accurate I suppose. Basically, the range of data lies in the 0.3-0.8 range mostly. Within that range, there is a far heavier concentration of data in the 0.6-0.8 range. I'll add a plot in the morning, its just a little tricky considering there is 17000 data points or so — Stevie Kvothe, Aug 13 '15 at 21:49
The marginal distribution of Y is not very useful, since that depends on the distribution of x; how does y vs x look? how does a smooth of it look? What's the distribution of y for small values of x vs the distribution for large x? (e.g. split the x range into slices and give the distribution of x in one near the left and one near the right) If you think that 17K points is too many, you could always randomly sample say 5% of them and plot that, but if you split the range up so we can see the conditional distribution that shouldn't matter anyway, since the number in each slice should be low — Glen_b, Aug 13 '15 at 23:51

Placidia · Answer 1 · 2015-08-13T01:45:48.773

Logistic regression looks like a good choice here. You don't need responses centered on 0.5. I'm not so sure about the weights. If you have a column of successes (say r) and a column of trials (total service points), you can do

glm(cbind(r,n-r) ~ IV1 + IV2 + IV3, family=binomial(), data=tennisData)

and the estimation takes care of things.

If each player has several matches and each match has several service points, you might want to include a random effect for matches. A random effect for players is also possible, but you are probably doing better by including past player outcomes in the model. In other words, I doubt you would be able to estimate a player effect in a model that also contains the "average of previous matches" variable. If there is a match effect, it is important to include it so as obtain appropriate confidence intervals for your predictions.

score 1 · Answer 2 · answered Aug 13 '15 at 01:26

You should be fine. Logistic regression assumes that your response variable is binomial, which yours is. It is not required that your data span a sufficient range of values in any independent variable that the entire sigmoid shape is reproduced.

A different issue is that logistic regression makes no assumptions regarding the distribution of your independent variables. Thus, you do not need to take the log of points, for example (you certainly can if you want to for other reasons, though).

On the other hand, regular old logistic regression assumes the data are independent. Since you have multiple data from the same player, that won't be true. To account for this, you need to fit a mixed effects logistic regression, i.e., a generalized linear mixed model (GLMM).

Can I use logistic regression if the distribution of proportions is skewed & lies in the middle of the [0,1] interval?

2 Answers2