I'm currently working on a predictive modeling project. I have to predict $Y$ given variables $X_1,X_2,X_3$ and $X_4$ that are not necessarily independent. Our first idea was to propose a linear regression model defined as $$Y = \beta_0+\beta_1 X_1 + \beta_2 X_2+ \beta_3 X_3 + \beta_4 X_4.$$
In my dataset ($10^5$ observations), I have observed that a lot of data is kind of 'grouped'. To clarify 'grouped', I have data $(x_{1i}, x_{2i},x_{3i},x_{4i},y_{i})$ and $(x_{1j},y_{2j},x_{3j},x_{4j},y_{j})$ where $$x_{1i} = x_{1j}, x_{2i} = x_{2j}, x_{3i} = x_{3j}, x_{4i} \neq x_{4j}, y_i \neq y_j.$$
where $1 \leq i,j \leq 10^5$ and where $x_{kl}$ is the $l$th observation of variable $X_k$ with $k \in \{1,2,3,4\}$.
Hence, a lot of data where $X_1,X_2$ and $X_3$ coincide and where the $X_4$'s and the $Y$'s are relatively different. After fitting the model, the performance was really bad. I believe that this 'grouped' data has a great impact on the goodness of the fit since the model tries to fit as much data as possible leading to overfitting.
Is there some kind of way to deal with this?
Thanks in advance!