1

If I have a Weibull distribution with a CDF of: $P_\theta (\lambda ,k) = (\frac{k}{\lambda })(\frac{x_i}{\lambda})e^{-(\frac{x_i}{\lambda})^k}$ for $x_i > 0$

And a MSE of:

$J_{\lambda ,k}(x) = \frac{1}{2m}\Sigma^{m}_{i = 1}(P_{\lambda ,k}(x_i) - y_i)^2$

I was wondering if my $x_i$ values should be the data I have and $y_i$ should be given by $y_i = \frac{x_i}{m}$, am I thinking about this correctly?

I need to know these values for the MSE function so that I can try to optimize the parameters $k, \lambda$.

rmiller415
  • 11
  • 2
  • 1
    Unless you have strong evidence that distribution is actually Weibull. Why not just test an entire collection of PDF's and see which one works best? There are routines that do just that. If you don't find one that you are comfortable with, you might just use an empirical distribution. Why use machine learning? What relationship does that have to regression or multiple other choices? – Carl Nov 04 '21 at 18:02
  • I did that, but this if for a project that I need to use a ML algorithm for. Otherwise I would just generate a vector of lambdas and k's and find the best fit and maybe do some fine tuning, but that isn't the point of the project I am working on. – rmiller415 Nov 04 '21 at 19:07
  • ML can be used for regression. It is not clear to me how you want to use ML and why you need machine learning to do so. Please clarify enough so that the reader goes not have to guess what among a very large list of possible goals you are aiming for. – Carl Nov 04 '21 at 20:10
  • Thanks for the advice, I've removed the extraneous information on the question. I'm really just asking if my reasoning for $x_i$ to $y_i$ is correct. – rmiller415 Nov 04 '21 at 21:09
  • _"I was wondering if my $x_i$ values should be the data I have and $y_i$ should be given by $y_i = \frac{x_i}{m}$"_ : For what purpose? Without stating it there is no way of answering – Firebug Nov 04 '21 at 21:28
  • I guess I'm being really bad at communicating today, sorry for that. I have an MSE function for the distribution that I would like to optimize and I need the values $(x_i,y_i)$ so that I can find the best parameters. – rmiller415 Nov 04 '21 at 22:09
  • Mean squared error is usually an inappropriate way to assess the fit of a distribution to data. Are you really trying to ask about effective ways to estimate Weibull parameters based on a random sample from a distribution you suspect to be Weibull? – whuber Nov 04 '21 at 22:19
  • Yes, that's effectively what I am doing. Is there a more suitable way to determine fitness of data? – rmiller415 Nov 04 '21 at 22:34
  • There are several in common use, of which perhaps the best known (and very powerful) is Maximum Likelihood. [Searching our site](https://stats.stackexchange.com/search?tab=votes&q=weibull%20maximum%20likelihood) turns up explanations, software, and more. One of the hits, https://stats.stackexchange.com/questions/8960, looks exactly like what you are trying to ask. – whuber Nov 04 '21 at 23:47
  • @whuber Doesn't that depend on what the loss function is? Actually I would use root mean squared error for $\frac{1}{y^2}$ weighted least squares, but certainly not OLS. I think it depends on why one uses regression, which makes it, as I said, critical to know what the goal of the investigation is before commenting. – Carl Nov 04 '21 at 23:49
  • @Carl This question appears to be one about distribution fitting rather than regression. The negative log likelihood is the loss for MLE. – whuber Nov 05 '21 at 00:36
  • 1
    @whuber If it is an rv, and if it is not right truncated, then OK. Otherwise, no. I would really like to hear what it is from the OP, wouldn't you? – Carl Nov 05 '21 at 01:14
  • @Carl I asked for clarification above and received it: please review the comment thread. – whuber Nov 05 '21 at 14:06
  • @whuber I did and do not see what $x_i$ measures explicitly enough to venture a guess as to whether ML usage would be physically correct in this case. It may be that this is infrequently an issue for you, but each of us may have arrived at our respective opinions honestly. – Carl Nov 05 '21 at 18:33
  • the values of x are accelerations of vehicles. If I randomly selected a vehicle, then y would be the probability of selecting a vehicle with acceleration x. – rmiller415 Nov 08 '21 at 15:06
  • @whuber I think what they did in that post is pretty much what I have coded in python at the moment and I was going to use a gradient descent method to find the optimal parameters. To do that I just need to know if I can take my x_i and divide by the total number of data points to get my y_i. – rmiller415 Nov 08 '21 at 20:15
  • That would be incorrect: the fitting must be done on the basis of the $x_i.$ It's a relatively easy numerical problem, because $\lambda$ (a scale variable) is easily estimated for any candidate value of the shape parameter $k.$ We have several threads that discuss various different approaches. An approach close to yours is presented at https://stats.stackexchange.com/questions/230109. – whuber Nov 08 '21 at 20:55

0 Answers0