3

I am trying to fit a Poisson regression in some soccer matches. I want to be able to predict matches of the first league for a new season, which means that there will be some new teams that have been promoted from last seasons second league. Hence I am creating a data frame with all matches from last year's league 1 and League 2, and then I am doing a Poisson regression. My problem is that there is a team whos coefficient cannot be estimated (NA). How can I overcome this?

My code:

prepare_data =  function(dataframe) {



  dataframe = dataframe[c("HomeTeam", "AwayTeam", "FTHG", "FTAG")]

  dataframe.temp = dataframe[, c(2,1,4)]


  names(dataframe.temp) = c("HomeTeam", "AwayTeam", "Goals")


  dataframe = dataframe[c("HomeTeam", "AwayTeam", "FTHG")]
  names(dataframe) = c("HomeTeam", "AwayTeam", "Goals")


  dataframe = rbind(dataframe, dataframe.temp)


  dataframe$Home<- rep(c(1,0), each = nrow(dataframe) / 2)


  dataframe

}

# try to train model using two leagues
mydata3a <- read.csv("https://www.football-data.co.uk/mmz4281/1718/E0.csv", header = TRUE, stringsAsFactors = TRUE)
mydata3b <- read.csv("https://www.football-data.co.uk/mmz4281/1718/E1.csv", header = TRUE, stringsAsFactors = TRUE)

mydata3a = prepare_data(mydata3a)
mydata3b = prepare_data(mydata3b)

mydata3 = rbind(mydata3a, mydata3b)

model <- glm(Goals ~ Home + HomeTeam + AwayTeam, family=poisson, data=mydata3)

and my problem is

  AwayTeamWolves  
        NA 

I cannot predict matches of Wolves.

Fierce82
  • 379
  • 2
  • 10
  • How many match results do you have with that team? – kjetil b halvorsen Aug 06 '19 at 21:30
  • 23 as Home team and 23 as Away team – Fierce82 Aug 06 '19 at 21:43
  • 1
    My guess is perfect co-linearity, but I cannot confirm because I am not familiar with R. You can try fit a linear model with ``AwayTeamWolves`` as response variable and other covariates used in Poisson model as covariates to see if you can get a perfect fit. – user158565 Aug 07 '19 at 01:52
  • I am voting to leave this open. I don't think it's really about R (see my answer). – Peter Flom Aug 07 '19 at 12:12
  • 1
    Your model is basically estimating attack (and defence) abilities of each team, _relative_ to the team selected as the reference team (the first on in the alphabet by default). Howeever, when you include data on League 2, the problem is that none of the teams in League 2 plays against any of the teams in League 1, and hence the data contains no information of the abilities of any team in League 2 relative to the reference team (in League 1). So the model becomes unidentifiable and as pointed out by @user158565, and you'll see this as perfect co-linearity between the dummy variables. – Jarle Tufto Aug 07 '19 at 13:03
  • 1
    @whuber I don't agree that this is a duplicate. The issue is something else, see my above comment. – Jarle Tufto Aug 07 '19 at 13:05
  • @Jarle You're likely right--but since at present there is no evidence explicit in the question that it differs from the apparent duplicate, I would prefer to see the question updated before voting to reopen it. – whuber Aug 07 '19 at 13:09
  • Could you save the dataset ``mydata3`` in ascii format to some place and give a link, so that I can use other software to explore the problem? I got ``mydata3`` in R, but I do not know how to get it out the R. – user158565 Aug 07 '19 at 18:18
  • 1
    @jarletuffo what you suggest makes sense. If i use more data from ptevious years i manage to get results, probably because combinations and linking paths between all teama can be found. – Fierce82 Aug 07 '19 at 19:21
  • @TomZinger Yes, combining overlapping sets of teams should work. But many years of data may also require dynamics models, see e.g. https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1467-9876.2012.01046.x – Jarle Tufto Aug 13 '19 at 13:10

1 Answers1

0

I'm not 100% sure, but this seems like the same issue you will have with categorical independent variables: One of them is the reference category. One clue that this is happening is that "Wolves" is last, alphabetically. If I am right, you are now getting (for each awayteam other than Wolves) how many goals would be predicted compared to AwayteamWolves.

There are various parameterizations of categorical variables; these have been discussed here before and you may also find help on R specific lists. See, e.g. here.

This isn't unique to R - it's an issue whenever you have a categorical IV.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • Since this is an FAQ, it would be better to find a duplicate. – whuber Aug 07 '19 at 12:37
  • I tried searching on "categorical variable parameterization" and didn't see a duplicate. Then I tried "reference level categorical variable" and saw 333 responses. I paged through the first few pages and didn't see a duplicate. I'm sure it's there. Maybe someone can find it. – Peter Flom Aug 07 '19 at 12:45
  • Thank you for looking. I found many promising hits with https://stats.stackexchange.com/search?q=dummy+reference. – whuber Aug 07 '19 at 13:00
  • Actually there is one team missing, arsenal i bwliwvw, which acts as a baseline for the remaing ones. – Fierce82 Aug 07 '19 at 19:22