Regression - many samples have the same target

Question

I have a machine learning problem in which I have a many-to-one relationship from samples to targets. I have ~3k samples but only 11 targets with a shared key YEAR

My first approach was to reshape the targets to get a one-to-one relationship, but then i get a lot of samples with the same target value. Attached is a figure that shows many vertical lines. XGBoost seems to struggle with that.

import itertools
import pandas as pd
dummy_data = pd.DataFrame(list(itertools.product(["A","B","C"],["R","S","T"],range(2006,2016))),columns=["CAT1","CAT2","YEAR"])
dummy_data["F1"] = np.random.rand(90)
dummy_data["F2"] = np.random.rand(90)
dummy_data["F3"] = np.random.rand(90)

target = pd.DataFrame(np.random.rand(11).tolist(),range(2006,2017),columns=["Target"])
target.index.name = "YEAR"

train = pd.merge(dummy_data,target.reset_index(),how="left",on="YEAR")

Also, I would like to learn the function how the different CAT1,CAT2 provide to the target.

Is there a better python library I can use to do that?

I've been looking into structured learning and parsimony, but another problem in my data is that the number of CAT1,CAT2 combinations is not consistent.

Another approach was to concatenate all CAT1,CAT2 to one sample per YEAR but then I have plenty of NaNs for missing values and my feature size is increadibly high > 15k

What machine learning approach can I use?

Does it make sense to take the mean (average) or median value for each of the multiples? — James Phillips, Dec 20 '18 at 19:33

Regression - many samples have the same target

0 Answers0