I have a machine learning problem in which I have a many-to-one relationship from samples to targets. I have ~3k samples but only 11 targets with a shared key YEAR
My first approach was to reshape the targets to get a one-to-one relationship, but then i get a lot of samples with the same target value. Attached is a figure that shows many vertical lines. XGBoost seems to struggle with that.
import itertools
import pandas as pd
dummy_data = pd.DataFrame(list(itertools.product(["A","B","C"],["R","S","T"],range(2006,2016))),columns=["CAT1","CAT2","YEAR"])
dummy_data["F1"] = np.random.rand(90)
dummy_data["F2"] = np.random.rand(90)
dummy_data["F3"] = np.random.rand(90)
target = pd.DataFrame(np.random.rand(11).tolist(),range(2006,2017),columns=["Target"])
target.index.name = "YEAR"
train = pd.merge(dummy_data,target.reset_index(),how="left",on="YEAR")
Also, I would like to learn the function how the different CAT1,CAT2
provide to the target.
Is there a better python library I can use to do that?
I've been looking into structured learning and parsimony, but another problem in my data is that the number of CAT1,CAT2
combinations is not consistent.
Another approach was to concatenate all CAT1,CAT2
to one sample per YEAR
but then I have plenty of NaNs for missing values and my feature size is increadibly high > 15k
What machine learning approach can I use?