0

I have a dataset, which looks like this and tried supervised learning already.

Author_ID   First_author Second_author  Total_citations Year Articles_per_year  Author_Impact
111         0            0              12              2017    2               10.591
111         0            0              12              2018    1               4.743
111         0            0              12              2019    0               0
222         2            2              44              2017    4               14.682
222         2            2              44              2018    7               79.04
222         2            2              44              2019    3               14.487

I am trying to predict the impact of an author for 2020 (one year forward), having only data for 2017-2019. I have y variable for 2020, but no independent variables for 2020. I can include y for 2020.

First_author, Second_author, Total_citations are aggregated for 3 years, that is why values are the same.

I am thinking about Regression with ARIMA errors (example here) and Arimax (but pyflux does not have good evaluation metrics). Hyndman suggests the first option.

Questions:

  • Does these models make sense for such data with only 3 years to predict the fourth? If no, what can I try?
  • Will be including aggregated exogenous variables (with the same aggregated data for each author row) a right approach?
  • Any good tutorials in R/Python for implementing models? How can use Hyndman's packages here?

0 Answers0