Rain overflow modeling: Categorical variables or separated models?

Question

I'm working on a project where I have to predict rain overflow due to rain for 5 sewer locations. I have a file which tells me if there is a rain overflow (=1) at a given date for a given sewer or no (=0). My data consists of daily measurements of rain for 5 rain stations and other information for 170 sewers. For my variables I chose the 3 closest rain stations for each of the 5 sewer locations and did an average of the rain quantity for each day, plus I took the max measurement of rain for each day.

Now, my question is I don't use the information of the other sewers (165) at all in my model. In fact, my 5 models for the particular 5 sewers only extract information about the sewer itself. I got a f1 score of 81 using SVM on average with my 5 models.

To improve my model, I thought I would merge these 5 models into 1 and add a categorical data which identifies the sewer, and added an extra sewer in my model (though I don't have to predict overflow for it). But my accuracy decreased to 60.

I would like to get advice. Does this approach decreases the accuracy or am I doing something wrong, or do I have to add more sewers in my model?

Edit :

I ended up taking the five stations as variables for each of the 5 sewer. Here's some pictures :

Column 3 indicates if there is a sewer overflow (1) for a given id sewer and a given date :

Rain measurements for each hour of each day from 2015 to 2018 at 5 different rain stations:

So for each of the 5 sewers I made a model with 15 variables : 3 variables per rain stations which are :

The sum of the amount of rain for a day
The max rain measurement for a given hour
The number of hours where it rained

Do you also have amount of overflow? Why did you average the rain for three stations, why not use the five individual rain gauges as variables? Could you specify the models you fitted with equations, please? And share (a link to) the data) — kjetil b halvorsen, Dec 14 '19 at 15:08
the data is not public so I shared some pictures (please see updated post) — dah fox, Dec 14 '19 at 20:50

score 3 · Answer 1 · answered Dec 15 '19 at 00:20

You have spatio-temporal data, maybe some ideas at spatio-temporal. To use information from other sewers, first investigate if there is some spatial correlation, there might also be useful temporal correlation.

So, what I would try as a start. Use logistic regression, but formulated via a latent variable as in How is Logistic Regression related to Logistic Distribution?. This latent variable could be modeled both spatially and in time. Some interesting papers in this stored google search.

score 1 · Answer 2 · answered Dec 15 '19 at 18:28

The recommendations of @kjetil_b_halvorsen to take advantage of spatio-temporal correlations and to model as a latent variable are valuable (+1). I have a few additional suggestions on details of how to put that into practice.

The principles of storm-drain overflow (I'm assuming that's what you mean by sewer)* are pretty simple in outline. Either more water is presented to the drain per minute than the grate over the drain can accommodate (e.g., a covering grate clogged by leaves or ice), or the inflow minus outflow (into a storm sewer system or into the ground with a local system) integrated over time has exceeded the capacity of the associated local catch basin.

These principles suggest that you should be focusing on appropriate scales both of time and space. Rainfall sums over a day, or maximum hourly rainfall, or numbers of hours of rain per day do not tell the whole story. Consider looking in detail at the patterns of local rainfall before individual overflow events; you do seem to have hourly rainfall data for this purpose. Plots of local rainfall versus time before overflow events should provide a good deal of insight into the problem.

Other things to consider for choosing scales of space and time: The rate at which a drain can empty into a sewer system will depend on what is being presented to the other drains in the system. The rate of emptying of a locally draining catch basin into the ground might depend on the season and on prior rainfalls integrated over days or weeks. The probability of inflow blockage by leaves or ice will depend on the season.

Modern machine-learning capabilities can lead to over-reliance on automated or semi-automated modeling to solve problems like yours. When you know a fair amount about the underlying principles, however, starting from those principles in deciding how to formulate the model in the first place might be a better way to proceed. Considering estimates of time-integrated local rainfall as the input to the sewer, some longer-term temporal and broader spatial integration (along with seasonal characteristics) setting the outflow capacity, and perhaps some additional consideration of a seasonally-associated probability of inflow blockage, might give you a more useful model than simply throwing all your data into a logistic regression and seeing what comes out.

*If you are considering larger-scale systems than individual storm drains, you need to scale up these principles accordingly.

Rain overflow modeling: Categorical variables or separated models?

2 Answers2