I have a dataset with approximately 4000 rows and 150 columns. I want to predict the values of a single column (= target).
The data is on cities (demography, social, economic, ... indicators). A lot of these are highly correlated, so I want to do a PCA - Principal Component Analysis.
The problem is, that ~40% of the values are missing.
My current approach is: Remove target indicator and do PCA with mean/median imputation of missing values. Select x principal components (PC). Append target indicator to these PC. Use PC as predictors for the target variable and try common regression techniques, e.g. knn, linear regression, random forest etc.
With this approach, I'm getting quite good results. My metric is RMSE% - root mean squared relative prediction error. I tried this for all columns in the dataset, the RMSE% is between 0.5% and 8% (depending on the column). These errors are for values I actually know, NOT imputed values.
So, here's my problem: I'm not sure how much my data is distorted by replacing the missing values with the column mean/median. Is there any other way of imputing the missing values with minimal effect on the PCA results?