Suppose you have a population and some measurement which you could do on each member of the population (e.g. the population could be all the people in the world, and the measurement could be height). So one can regard this measurement as a random variable $X$ on the population, with some mean $\mu$ and variance $\sigma^2$; $\mu$ is known, $\sigma^2$ may or may not be known.
Now suppose you have a subset of the population, a sample of size $N$, and you wish to know whether these people are significantly different than the overall population with respect to this measurement. You can measure them and find the mean $\bar{x}$ and variance $s^2 = \frac 1N\sum_1^N(x_i-\bar{x})^2$ where the $x_i$ are the individual measurements of the people in your sample. One way to determine the significance of your measurements is to do the following:
Let $X_i \sim_{\mathrm{iid}} X$ for $i = 1, 2, \dots, N$ and let $Y = \frac 1N\sum_1^NX_i$. Estimate a distribution for $Y$. Based on this estimate, determine the probability $P(|Y - E[Y]| > |\bar{x} - E[Y]|)$ and if this probability is larger than some predetermined threshold, then you reject the null hypothesis (which in this case would've roughly captured the hypothesis that your sample population is not different from the overall population).
My questions are about the estimated distribution for $Y$. The Central Limit Theorem says that if $N$ is large, then we may assume $Y$ is normally distributed. But if $N$ is small, we're supposed to use Student's t-distribution [Disclaimer: I'm sure it's more complicated than that, but this is what I'm supposed to teach my students so I need to know the reason that this might be a reasonable thing to teach them]. So my first (multi-part) question is: What is the conventional cutoff for small $N$/large $N$, why is that cutoff conventionally accepted, and why wouldn't we just always use Student's t-distribution even for large $N$?
Once we know what kind of distribution to use, we still need to know the parameters. It's not hard to see that $Y$ will have mean $\mu$ and variance $\frac{\sigma^2}{N}$. Now if $\sigma^2$ isn't known, then we estimate it by $\hat{s}^2 = \frac{N}{N-1}s^2$. So my next (multi-part) question is: Why precisely the $\frac{N}{N-1}$ factor, and is there ever a case where we would use $\hat{s}^2$ even if $\sigma^2$ were known?