0

There's this dataset of the top 1000 streamers on Twitch at 2020. I'm currently solving a challenge problem, to predict the amount of Followers gained based on the other features of each channel.

The X value (features) contains the values of all columns excluding Followers gained. The y value (result) is the values for column Followers gained.

In the major scale, I made three different approaches so far:

The first approach, the very basic one.

  • After a few simple EDA processes, trained a 3-hidden layers neural network to make predictions.
    -> Obviously this didn't go well; resulted in the highest RMSE loss value among all attempts.

The second approach.

  • As the dataset only contains 1000 rows of data, its size is too small to train a ML model for making accurate regressions. Thus, the dataset is random & artificially augmented to contain around 10000 rows; then, trained a 7-hidden layers NN to make predictions.
    ->This showed significant improvement in lower RMSE loss relative to the first attempt, but the loss was still too large.

The third approach.

  • After applying a natural log transformation on the numeric values on the dataset, the outliers on the dataset are removed; then repeating the identical procedure from the second attempt, generated random & artificial data. This time, around 2500, as the models got bigger and their variety has increased. Using this data, a 7-hidden layers NN and a stacked model containing XGBRegressor, LGBMRegressor, and RandomForestRegressor are trained; the final prediction resulting as the average value between the two predictions.
    -> While I was expecting this to perform better in terms of lower RMSE loss value, this did not in fact. RMSE got slightly higher when comparted with the second approach.

In general, is there a commonly used approach for ML in a situation where accurate regression predictions should be made using a small amount of data? Or any recommendations for further approaches?

When you take an overview of the dataset, the amount of Followers gained generally contain 6 or 7 digits. Currently, the best RMSE metrics I got has 6 digits, but nearly close to a 5 digit value. To be short, what are the ways for reducing this RMSE loss to a further extent?

  • The best result obtained from an analysis conducted by someone else, has a RMSE loss of around 81000. While my model's best performance is around 101000. I do think there's a way for improvement, though the problem is I still did not found it yet. Anyway, thanks for sharing! – Centauri_42 Sep 22 '21 at 09:07

0 Answers0