I have a model $f$ that predicts human performance on a simple perceptual task (performance is quantified as $Y$) as a function of some information about the stimuli ($X$) and parameters $\theta$. The model is nonlinear and deterministic (edit: by deterministic I mean that Y is determined given X and theta):
$Y = f(X,\theta)$
I would like to infer the parameters of this model using a dataset that I have collected $(x_1,y_1)...(x_n,y_n)$. I can think of two ways to do this:
1) Pick the set of parameters that minimize the mean squared error between the predicted and measured performance scores$\sum_{i=1}^n(\hat y_i-y_i)^2$
2) Add a noise term to the model $Y = f(X,\theta) + N(0,\sigma)$ and maximize the log likelihood of $\theta$. My logic is that by adding the noise term, the likelihood can be calculated as the log sum of the probability of observing residuals $(\hat y_i-y_i) $ given $N(0,\sigma)$.
I have three questions:
1) Are either of these methods clearly incorrect?
2) Is one of these methods more correct than the other?
2) If maximizing the log likelihood is the way to go, what is the best way to go about choosing $\sigma$? My intuition is that this parameter shouldn't be fit.
A similar topic has come up before but didn't hit exactly on this same question.