0

I had run artificial data using y=a+ax1+ax2+e. x1 is generated using Normal Distribution and e generated using Cauchy and Normal Distribution. The model i want to compare is ANN and SVM. When using Cauchy as disturbance in artificial data. The model tend to generate high RMSE. Why is that happen?

bbadyalina
  • 743
  • 2
  • 7
  • 20
  • Answering with a question: why did you choose Cauchy at the first place..? – Tim Jan 21 '17 at 10:55
  • Because from my real data, the regression model residual fit with Cauchy distribution – bbadyalina Jan 21 '17 at 11:03
  • 1
    Possible duplicate of [Model fitting when errors take a Cauchy distribution](http://stats.stackexchange.com/questions/68596/model-fitting-when-errors-take-a-cauchy-distribution) – Tim Jan 21 '17 at 11:15
  • 1
    [Read more about Cauchy distribution](https://stats.stackexchange.com/questions/188126/why-do-i-get-worse-regression-metrics-when-i-add-more-instances-to-the-problem/188132#188132) it has undefined mean, so *if* your data followed Cauchy distribution then classical regression would be a hopeless approach, this is why your simulation returns huge errors. – Tim Jan 21 '17 at 11:15
  • Can you explain a tiny bit of Cauchy distribution? What i mean from perspective the random number Cauchy distribution generated. – bbadyalina Jan 21 '17 at 11:17
  • Have you checked the both links? What is unclear for you? Second link provides worked example. – Tim Jan 21 '17 at 11:19
  • Yes, I read about heavy tail, but i am not clear about. So if it a heavy tail distribution, so the min and maximum number a quite large in value different if generated Cauchy random number? "Correct Me if i am wrong" – bbadyalina Jan 21 '17 at 11:35
  • Another one Tim, It applicable to all model not only just linear regresioon. Just in my case study i used ANN – bbadyalina Jan 21 '17 at 11:47
  • I don't understand your question, but the main point is that Cauchy distribution has no mean (see the example in the second link), it's not only that it has fat tails. If you do not understand this distribution, then I wouldn't recommend using it in your simulation, since it won't help you understand the problem better if you don't understand what you've simulated... – Tim Jan 21 '17 at 11:58
  • Sorry for my lack understanding. I have model an extreme value of flood quantile. Which the data show the the max data is very far from mean and the min data are very far. I model linear regression, and the residual show fitted Cauchy distribution. So I want to simulate this problem to see which model can handle if the data exhibit Cauchy distribution. – bbadyalina Jan 21 '17 at 13:23

1 Answers1

1

If memory serves me, flood data is known to follow a Cauchy law. I am working from memory on a book either by Mandelbrot or Sornette. You cannot use SVM and you probably cannot use ANN. The problem is that if you have a Cauchy likelihood function, or density function if you are using a null hypothesis method, then $$\lim_{n\to\infty}RMSE(n)\to\infty.$$

Your precision falls as your data grows. The reason for this is that a sample statistic should point to a population parameter, or in nicer mathematical terms, converge to. There is no such population parameter to converge to. In least squares style models, the slope estimator is a variant on the idea of a sample mean. Because of this SVM is not usable.

A different problem exists for neural networks. Many neural networks use sums and means explicitly or implicitly. Some of these work out okay because of the transformations that happen in the process, but some will not. If you decide to use a neural network, then you really will want to perform formal verification and validation of your model as a mathematical construction. Even if you find one that "works" with the data, that does not mean it really "works," or will work out of sample. You will have to go at this through first principles.

The problem with sums, averages, and by inheritance, slopes, can be seen in the distribution of the sample mean of the standard Cauchy distribution. If $$p(x)=\frac{1}{\pi}\frac{1}{1+x^2},$$ then the sum of $n$ Cauchy random variates is $$S_n=\sum_{i=1}^nx_i.$$ The sample mean is simply $$\bar{x}=\frac{S_n}{n}.$$

The sampling distribution of the sample mean can be found through its characteristic function, which is $$\phi(\tau)=e^{-|\tau|}.$$

The characteristic function of $n$ independent draws is the product of their individual characteristic functions, resulting in a joint characteristic function of $$\phi(\tau)^n.$$ When you invert the process, you get a sampling distribution for the sum of $$p(S_n)=\frac{1}{2\pi}\int_{-\infty}^\infty\exp(-iS_n\tau-n|\tau|)\mathrm{d}\tau.$$ This in turn is resolved as $$p(S_n)=\frac{1}{\pi}\frac{n}{n^2+S_n^2}.$$

The solution for the sample mean is $$p(\bar{x})=p(S_n)\frac{\mathrm{d}S_n}{\mathrm{d}\bar{x}}=\frac{1}{\pi}\frac{1}{1+\bar{x}^2}.$$ Notice that your sampling distribution for $\bar{x}$ is not dependent upon $n$. Adding data does not improve your information, unlike the sampling distribution of the Gaussian sample mean which improves at a rate of $\sqrt{n}$. Likewise, sums worsen their precision at a rate of $n$.

Fortunately, there is a known solution. For a simple linear regression, $$\Pr(\beta,\alpha,\gamma|(x_1,y_1)\dots(x_i,y_i))\propto\prod_{i=1}^n\frac{\gamma}{\gamma^2+(y_i-\beta{x_i}-\alpha)^2},\forall(\beta,\alpha,\gamma)\in\Theta,$$ where $\Theta$ is the parameter space. If you put a proper prior distribution over $\{\beta,\alpha,\gamma\},$ then you are assured a valid decision rule can be constructed.

If your goal is predictive, rather than inferential, then you can show that no admissible Frequentist solution exists, though an admissible maximum likelihood solution may exist. If your goal is inferential and your null is a sharp null hypothesis, that is $\beta=k,$ for example, as opposed to $\beta\le{k},$ then you can have a valid Frequentist or Likelihoodist solution because you can condition on $\gamma.$ If you have a binary hypothesis, but it is not sharp, and you either intend to form a prediction or need to assure the validity of your decision rule, then you should use the Bayesian method above.

If you do choose to use the above method and apply a good, proper prior, then I suggest using a Metropolis-Hasting algorithm, although if you only have $\beta,\alpha,\gamma$ and no more dimensions, you really could get away with acceptance/rejection testing. With the number of computations that are possible with modern computing, acceptance/rejection testing would be less grief.

Dave Harris
  • 6,957
  • 13
  • 21