I'm working on a project where I have to predict rain overflow due to rain for 5 sewer locations. I have a file which tells me if there is a rain overflow (=1) at a given date for a given sewer or no (=0). My data consists of daily measurements of rain for 5 rain stations and other information for 170 sewers. For my variables I chose the 3 closest rain stations for each of the 5 sewer locations and did an average of the rain quantity for each day, plus I took the max measurement of rain for each day.
Now, my question is I don't use the information of the other sewers (165) at all in my model. In fact, my 5 models for the particular 5 sewers only extract information about the sewer itself. I got a f1 score of 81 using SVM on average with my 5 models.
To improve my model, I thought I would merge these 5 models into 1 and add a categorical data which identifies the sewer, and added an extra sewer in my model (though I don't have to predict overflow for it). But my accuracy decreased to 60.
I would like to get advice. Does this approach decreases the accuracy or am I doing something wrong, or do I have to add more sewers in my model?
Edit :
I ended up taking the five stations as variables for each of the 5 sewer. Here's some pictures :
- Column 3 indicates if there is a sewer overflow (1) for a given id sewer and a given date :
- Rain measurements for each hour of each day from 2015 to 2018 at 5 different rain stations:
So for each of the 5 sewers I made a model with 15 variables : 3 variables per rain stations which are :
- The sum of the amount of rain for a day
- The max rain measurement for a given hour
- The number of hours where it rained