2

I have a dataset about solar panels' output power. After visually inspecting the data distribution, I found it is not normal distribution and is a right-skewed distribution with many zeroes. I used the interquartile range rule to detect outliers, and I found nearly 9 percent of the data is out of range. Is that possible for a dataset to have this percentage of outliers?

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • Outliers of right-skewed distributions are a tricky topic. – Michael M Jan 30 '22 at 14:22
  • OK, so zero output relates to no or marginal light conditions, High values relate to perfect days with panel positioning, clarity of the atmosphere, time of year and hour of the day,... If that is the case, that is not an outlier but a predictable repeating high optimal output day. – AJKOER Jan 30 '22 at 21:01
  • As @AJKOER has addressed the problem for zeros, I would like to say the problem is not actually with zeros. Currently, I have removed all values greater than 4kwh because based on the building and the number of solar panels, it is impossible for five solar panels to produce more than 4kwh. So I don't know how to treat these values as outliers. Remove them or replace them with mean? – graphicart86 Jan 30 '22 at 21:26
  • The best way to address the problem of an output that clearly is not reflective of solar data (too large), is to further research a sample of instances. Could be a reporting issue of mixed data, or pure solar data from more recently installed solar panels, or even fraud,. Having a valid explanation is important for several reasons, including data analysis. – AJKOER Jan 30 '22 at 22:50
  • @graphicart86 Your comment asks a substantially different question than the main body of your Question. Perhaps you could ask a new Question about these circumstances; however, for it to be answerable, you'll need to provide more information about what, specifically, you want to learn from the data and why detecting and potentially removing outliers is important for that goal. – Sycorax Feb 01 '22 at 17:46

2 Answers2

3

Paraphrasing the specific question

Is it possible for a dataset to have this percentage [9%] of outliers?

Of course it is possible. Here's a simple example.

Imagine a Bernoulli r.v. that takes on the value 0 with probability 0.91 and the value 1 with probability 0.09. The central quartiles are both 0 with high probability (because about 91% of a large sample data are 0) and the rest are 1, therefore about 9% of the data is "an outlier" according to this rule. We can contrive a real-world context in which these "outliers" might arise; perhaps the 1s are defects in some delicate manufacturing process.


If your feeling is that labeling 9% of the data as outliers is too many, then it's fruitful to consider whether your procedure for detecting outliers makes sense in the context of the problem you're trying to solve. Instead of naively applying a "rule of thumb" to the problem of outlier detection, I would suggest thinking carefully about what your problem is and how outlier detection purports to help solve it.

There is not a single correct outlier detection method because there is little agreement about what an "outlier" is! See: Rigorous definition of an outlier?

Sycorax
  • 76,417
  • 20
  • 189
  • 313
0

As the log-normal distribution is right skewed, consider applying a log transform to your data. This is particularly appropriate if the data relates to percent change as commonly occurs with economic data.

Now, apply the usual normal distribution based test.

So, if you applied the natural log function transform, namely ln(x), it is reversed by taking exp(x) function. Note, with respect to interpretation, the center of a derived confidence interval, for example now, however, relates to the median of the untransformed data.

You might wish to review the literature on the transformation of data (start with the Box-Cox method or more advanced "Bayesian analysis of the Box-Cox transformation model based on left-truncated and right-censored data) before removing data points form possibly non-normal data and therein losing data information content.

I hope this helps.

AJKOER
  • 1,800
  • 1
  • 9
  • 9
  • 2
    These techniques will not be helpful in the present instance. No transformation of a dataset with "many zeros" is going to make it look anywhere near Normal. – whuber Jan 30 '22 at 15:40
  • Actually, there is a form of Box-Cox, which I programmed myself, that allows adding a constant to the data and then applies the transform. I would suggest just adding 1 here (else one can search for a more optimal value) as zeros become 1 and its log value is 0. Note: I am only 1st suggesting the exploration of a transform so as to not reduce potentially interesting information content. – AJKOER Jan 30 '22 at 20:15
  • I would also note that a zero here is likely a distinct value (from the rest of the data) likely relating in this solar power data case of simply the absence of any significant light. The data when there is light may be non-normal being the a product of a photo-chemical process subject to varying cloud coverage. Why not best assess the non-zero power distribution with a data transform? – AJKOER Jan 30 '22 at 20:32
  • Adding a constant in the Box-Cox transformation won't fix the problem. Perhaps a good solution is to change the model, as you suggest in your comment--but there's no way that a transformation is going to make these data look approximately Normal until that spike at the zeros is separately dealt with. – whuber Jan 31 '22 at 01:54
  • Thanks @whuber and what is your suggestion for all zero values? More than 50 percent of the data is zero – graphicart86 Jan 31 '22 at 08:51
  • 1
    You haven't described your data or your objectives well enough to provide any advice. You haven't said whether this is an explanatory or response variable, nor given any indication of your objectives or planned analyses. – whuber Jan 31 '22 at 16:28