There are several issues here.
(1) The model needs to be explicitly probabilistic. In almost all cases there will be no set of parameters for which the lhs matches the rhs for all your data: there will be residuals. You need to make assumptions about those residuals. Do you expect them to be zero on the average? To be symmetrically distributed? To be approximately normally distributed?
Here are two models that agree with the one specified but allow drastically different residual behavior (and therefore will typically result in different parameter estimates). You can vary these models by varying assumptions about the joint distribution of the $\epsilon_{i}$:
$$\text{A:}\ y_{i} =\beta_{0} \exp{\left(\beta_{1}x_{1i}+\ldots+\beta_{k}x_{ki} + \epsilon_{i}\right)}$$
$$\text{B:}\ y_{i} =\beta_{0} \exp{\left(\beta_{1}x_{1i}+\ldots+\beta_{k}x_{ki}\right)} + \epsilon_{i}.$$
(Note that these are models for the data $y_i$; there usually is no such thing as an estimated data value $\hat{y_i}$.)
(2) The need to handle zero values for the y's implies the stated model (A) is both wrong and inadequate, because it cannot produce a zero value no matter what the random error equals. The second model above (B) allows for zero (or even negative) values of y's. However, one should not choose a model solely on such a basis. To reiterate #1: it is important to model the errors reasonably well.
(3) Linearization changes the model. Typically, it results in models like (A) but not like (B). It is used by people who have analyzed their data enough to know this change will not appreciably affect the parameter estimates and by people who are ignorant of what is happening. (It is hard, many times, to tell the difference.)
(4) A common way to handle the possibility of a zero value is to propose that $y$ (or some re-expression thereof, such as the square root) has a strictly positive chance of equally zero. Mathematically, we are mixing a point mass (a "delta function") in with some other distribution. These models look like this:
$$\eqalign{
f(y_i) &\sim F(\mathbf{\theta}); \cr
\theta_j &= \beta_{j0} + \beta_{j1} x_{1i} + \cdots + \beta_{jk} x_{ki}
}$$
where $\Pr_{F_\theta}[f(Y) = 0] = \theta_{j+1} \gt 0$ is one of the parameters implicit in the vector $\mathbf{\theta}$, $F$ is some family of distributions parameterized by $\theta_1, \ldots, \theta_j$, and $f$ is the reexpression of the $y$'s (the "link" function of a generalized linear model: see onestop's reply). (Of course, then, $\Pr_{F_\theta}[f(Y) \le t]$ = $(1 - \theta_{j+1})F_\theta(t)$ when $t \ne 0$.) Examples are the zero-inflated Poisson and Negative Binomial models.
(5) The issues of constructing a model and fitting it are related but different. As a simple example, even an ordinary regression model $Y = \beta_0 + \beta_1 X + \epsilon$ can be fit in many ways by means of least squares (which gives the same parameter estimates as Maximum Likelihood and almost the same standard errors), iteratively reweighted least squares, various other forms of "robust least squares," etc. The choice of fitting is often based on convenience, expedience (e.g., availability of software), familiarity, habit, or convention, but at least some thought should be given to what is appropriate for the assumed distribution of the error terms $\epsilon_i$, to what the loss function for the problem might reasonably be, and to the possibility of exploiting additional information (such as a prior distribution for the parameters).