Cameron Trivedi (2005) (Microeconometrics: Methods and Applications) on p.834 give a very informative description of the variance estimator when using clustered errors for a linear model:
$$\widehat{\text{var}} \left( \hat{\beta} \right)_{\text{cluster}} = \left( \sum_{c=1}^C x'_cx_c \right)^{-1} \left[ N^{-1} \sum_{c = 1}^C \sum_{j = 1}^{N_c} \sum_{k = 1}^{N_c} \hat{u}_{jc} \hat{u}_{kc} x_{jc} x'_{kc} \right] \left( \sum_{c=1}^C x'_cx_c \right)^{-1}$$
Here there are $C$ clusters, each containing $N_c$ observations. $\hat{u}_{jc}$ represents the residual of observation $j$ from cluster $c$ and $x_{jc}$ represents the $x$ variable values for that observation.
Now, compare this to the HW robust variance estimator, that does not account for clustering:
$$\widehat{\text{var}} \left( \hat{\beta} \right)_{HW} = (X'X)^{-1} \left[ \sum_{i=1}^n \hat{u}_i^2 x_ix_i' \right] (X'X)^{-1}$$
There are several instructive observations that related to your situation:
- Note that if each observation had its own cluster, then the cluster estimator would be identical to the HW estimator.
- Focus on the middle term (the three nested summations) in the cluster estimator. Notice that where $ j = k $ in the inner two summations, we'll be multiplying a residual by itself, so that will always be positive. Thus, we'll always have the standard sum of $ \hat{u}_{jc}^2 x_{jc}x'_{jc} $ like we do in the basic HW estimator.
- If the errors within a cluster are indeed independent of each other, than we should have, by definition of that independence, $ \mathbf{E} \left[ \hat{u}_{jc} \hat{u}_{kc} \right] = 0 $ for $ j \neq k $.
- Thus likewise we'd have $ \mathbf{E} \left[ \hat{u}_{jc} \hat{u}_{kc} x_{jc} x'_{kc} \right] = 0 $ for $ j \neq k $ since the $ x $ observations are considered non-random.
- Thus, if all of the observations within a cluster are indeed completely independent of each other, then this estimator will be identical (at least asymptotically) to the basic HW estimator.
- But, to the extent that we have $ \mathbf{E} \left[ \hat{u}_{jc} \hat{u}_{kc} \right] \neq 0 $ for $ j \neq k $ within the cluster, then the two will differ.
- Generally speaking, this will mean that $ \mathbf{E} \left[ \hat{u}_{jc} \hat{u}_{kc} \right] > 0 $ - in other words, that the errors are positively correlated within the cluster. In this case, the cluster estimator will then be greater than the HW estimator, because we're adding more positive terms in the inner summation.
- But, it is in theory possible to have $ \mathbf{E} \left[ \hat{u}_{jc} \hat{u}_{kc} \right] < 0 $ - i.e. negative correlation of errors within a cluster. In this case then, the cluster variance estimator would be less than the basic HW one.
So, what does this mean for your situation in particular? As one of the comments to your question noted, clustering at the larger level is more conservative (and likely better). This allows for the possibility of correlations between the errors in the observations within a given country, since the residuals from all the observations within a country will get multiplied by the residuals of all the other observations in that country. To the extent that those residuals within the country actually are independent of each other (e.g. the residuals from two different regions in a country), then we'll have, at least in expectation, $ \mathbf{E} \left[ \hat{u}_{jc} \hat{u}_{kc} \right] = 0 $ for $ j \neq k $. But, to the extent that there actually is correlation across regions in a country, then you'll have $ \mathbf{E} \left[ \hat{u}_{jc} \hat{u}_{kc} \right] \neq 0 $ for $ j \neq k $. In that case, those terms will add to the variance of your estimator, but that is the appropriate thing in such a situation, since if your observations and errors are correlated, then you don't actually have as many independent observations as your simple sample size would indicate. As they say, the innocent (i.e. genuinely independent observations) have nothing to hide.