I have a dataset, which looks like this and tried supervised learning already.
Author_ID First_author Second_author Total_citations Year Articles_per_year Author_Impact
111 0 0 12 2017 2 10.591
111 0 0 12 2018 1 4.743
111 0 0 12 2019 0 0
222 2 2 44 2017 4 14.682
222 2 2 44 2018 7 79.04
222 2 2 44 2019 3 14.487
I am trying to predict the impact of an author for 2020 (one year forward), having only data for 2017-2019. I have y variable for 2020, but no independent variables for 2020. I can include y
for 2020.
First_author
, Second_author
, Total_citations
are aggregated for 3 years, that is why values are the same.
I am thinking about Regression with ARIMA errors (example here) and Arimax (but pyflux
does not have good evaluation metrics). Hyndman suggests the first option.
Questions:
- Does these models make sense for such data with only 3 years to predict the fourth? If no, what can I try?
- Will be including aggregated exogenous variables (with the same aggregated data for each author row) a right approach?
- Any good tutorials in R/Python for implementing models? How can use Hyndman's packages here?