According to the answer in this question by IrishStat, the reason that you pre-whiten X is to identify a filter that can transform Y and X into y and x where x is white noise.
Assume that X is following ARIMA(p, q, d):
$\left( 1 - \sum_{i=1}^p \phi_i L^i \right) (1-L)^d X_t = \left( 1 + \sum_{i=1}^q \theta_i L^i \right) \varepsilon_t $
The L.H.S. is the filter of ARI(p, q) and the R.H.S is a linear combination of i.i.d. white noise, which should also be a white noise. So by applying the filter of ARI(p, q), we can then achieve the objective - transform X into white noise. Is my understanding correct?