1

I have a machine learning problem and have been working in Sklearn/Pandas with Python to come up with an accurate model. I find myself deep in a rabbit hole trying to learn the best approach and how many variables are too many variables while trying to avoid overfitting.

Each model is for a different area with the variables indicated below:

x = monthly precipitation departure (this can be used as overall monthly averages over an area, or can be broken down into sub-areas from the overall area of interest to add additional variables) For example Kansas Group 1 can be treated as a whole or could be separated into sub-areas with monthly averages for each area.

y = monthly availability of a resource (eg. Jan = 0.003827)

n = 16 years or 192 months of data

I have tried many different approaches.

The first approach was modeling each month separately so a model for January (n=16) and a model for February (n=16) etc. using the following modeling techniques:

  • LinearRegression using my own assigned weighting variables as a weighted running mean analysis
    • RandomTrees with tuning variables
    • RandomForest with tuning variables
    • ExtraTrees with tuning variables

Then in order to try and improve the model, I most recently employed a time series model (n=192) - SARIMAX with tuning variables p,d,q,P,D,Q,12

Any advice or resources are greatly appreciated.

Ashley B.
  • 11
  • 3
  • (1) How do you calculate accuracy for a numerical target? (2) Can you edit your post to include your data? (3) The "accuracy needed" is not necessarily attainable. [How to know that your machine learning problem is hopeless?](https://stats.stackexchange.com/q/222179/1352) – Stephan Kolassa Apr 11 '19 at 14:40
  • (1) Great point, I was also using r2 of the training and r2 of the testing dataset, and many of my testing r2 are so bad they are not between 0 and 1. (2) I will add example data, yes! (3) Thank you for the resource! – Ashley B. Apr 11 '19 at 14:51

1 Answers1

0

Time series analysis (ARIMA) is often flawed when the original series contains deterministic structure see @AdamO's wise reflections Interrupted Time Series Analysis - ARIMAX for High Frequency Biological Data? . The arima model building/identification process as part of a Transfer Function analysis is not a one and done but rather a sequence of steps to form a useful model following the paradigm presented here Is it possible to automate time series forecasting? . This is easily extended to include user-specified X series leading to a Transfer Function.

In general your mission is to form a model following the thread ( particularly the Tsay article ) as discussed here Why to use ARMA model as a time series is either over-differenced or under-differenced? and broadly here https://stats.stackexchange.com/search?q=user%3A3382+intervention+detection

Finally https://autobox.com/pdfs/vegas_ibf_09a.pdf slide 16- illustrates the flaw of not dealing with pulses as one can be lead down the path of unwarranted power transformations for the now famous AIRLINE SERIES ( 144 monthly values) .

Using lagged explanatory variables to forecast future value of depended speaks to your issue with respect to how to use helping X's that are pre-specified and to identify latent structure reflecting/proxying OMITTED series.

EDITED AFTER RECEIPT OF DATA:

I took your 192 values and partitioned then to 156 and 36 and obtained a Weighted Mape of 21.2% . enter image description here and enter image description here

The model included a log xform and an arima of (1,0,0) along with 7 seasonal pulses refelecting a monthly determiistic structure while incorporating your predictor variable contemporaneously and adjusting for 7 anomalies (pulses).

The residuals from this model suggested randomness enter image description here.

The Actual/Fit and Forecast graph is here

enter image description here enter image description here

Finally a derailed loo at the 36 period forecasts vis-s-vis the actual is here

enter image description here

IrishStat
  • 27,906
  • 5
  • 29
  • 55
  • I just found out the data I received for Kansas was previously normalized. Does this mean I should normalize the input as well (precip departure) - could this improve the model? – Ashley B. Apr 17 '19 at 19:29
  • normalization has no effect on model formulation .. it just scales the coefficients – IrishStat Apr 17 '19 at 19:42