Violin plot of 2 numpy arrays with seaborn

Question

I would like to compare the distribution of 2 numpy arrays using a violin plot made with seaborn.

The maximal value in both arrays is 1. The plot suggests a higher maximum.

Am I misunderstanding the violin plot?

import numpy as np
import seaborn as sns

# 2 numpy arrays, numpy version 1.19.1

a0 = np.array([0.9875, 1., 0.9989, 0.9314, 0.9955, 0.8229, 0.9875, 1., 1., 0.9984, 
               0.8838, 0.8446, 1., 0.9989, 1., 0.9896, 1., 0.9912, 0.9871, 1., 
               0.9733, 0.9984, 0.9873, 0.9964, 0.9907, 1., 0.9948, 0.9851, 0.9984, 1., 
               0.9915, 1., 0.9984, 0.8637, 1.])

a1 = np.array([0.9867, 1, 0.9989, 0.9263, 0.9951, 0.807 , 0.9873, 1, 1, 0.9984, 
               0.879 , 0.7893, 1, 0.9989, 1, 0.9867, 1, 0.9908, 0.9807, 1, 
               0.9732, 0.9984, 0.9873, 0.9954, 0.936 , 1, 0.9932, 0.9838, 0.9984, 1, 
               0.9914, 1, 0.9984, 0.859 , 1])

# make violin plot with seaborn 0.11.0
sns.violinplot(data=[a0, a1])

Violin plot of the two arrays:

violin plot of the 2 arrays

Using matplotlib the result is quite different. Is there a bug in seaborn?

    import matplotlib.pyplot as plt
    plt.violinplot(dataset=[a0, a1])

matplotlib.pyplot.violinplot

I have never heard the term "Violin plot" before, but according to Wikipedia, *A violin plot is a method of plotting numeric data. It is similar to a box plot, **with the addition of a rotated kernel density plot on each side.*** So I guess it is to be expected that the plot extends beyond the minimum and maximum values in the data. (I guess "on each side" refers to the left and the right in the picture, but the kernel density smoothing extends the plot to the bottom and the top, which I think is what you are asking about) — , Sep 26 '20 at 12:22
Your actual data is the tiny black box in the middle of each Violin plot, I guess. — , Sep 26 '20 at 12:25
@mkrieger1- The tiny black box represents the interquartile range, as in a whisker box plot, and not all data. As you can see some values are smaller than 0.9. — , Sep 27 '20 at 14:10

Demetri Pananos · Answer 1 · 2020-09-28T14:32:10.310

Violin plots appear to utilize kernel density estimates. These are in essence a tiny model, and if the data are bounded above are below and are close to that boundary then the model is a poor representation of the data. My recommendation is not to use violin plots and instead plot the data with a jitter and an alpha so we can see overlapping datapoints.

Here is an example using swarmplots. You can see a little clearer (although not completely so) that the data have a ceiling effect. Below is the code to produce this figure.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

a0 = np.array([0.9875, 1., 0.9989, 0.9314, 0.9955, 0.8229, 0.9875, 1., 1., 0.9984, 
               0.8838, 0.8446, 1., 0.9989, 1., 0.9896, 1., 0.9912, 0.9871, 1., 
               0.9733, 0.9984, 0.9873, 0.9964, 0.9907, 1., 0.9948, 0.9851, 0.9984, 1., 
               0.9915, 1., 0.9984, 0.8637, 1.])

a1 = np.array([0.9867, 1, 0.9989, 0.9263, 0.9951, 0.807 , 0.9873, 1, 1, 0.9984, 
               0.879 , 0.7893, 1, 0.9989, 1, 0.9867, 1, 0.9908, 0.9807, 1, 
               0.9732, 0.9984, 0.9873, 0.9954, 0.936 , 1, 0.9932, 0.9838, 0.9984, 1, 
               0.9914, 1, 0.9984, 0.859 , 1])


df = pd.melt( pd.DataFrame( {"A":a0, "B":a1}), var_name = 'x', value_name = 'y') 


sns.swarmplot(data = df, x = 'x', y = 'y', hue = 'x')
plt.show()
```

Thank you for your explication and recommendation. Could you please add an example of the (scatter) plot with jitter and alpha based on the 2 arrays? — Guido Cattani, Sep 28 '20 at 11:08
@GuidoCattani PLease see my edits. If you think this post has answered your question, please consider accepting and upvoting. If this post has not answered your question, please advise how it could be improved. — Demetri Pananos, Sep 28 '20 at 14:32
Thank you for your valuable help. Your plots are indeed very clear. I gave you positive feedback, but I have to less reputation yet. My votes are recorded, but do not change the publicly displayed post score. — Guido Cattani, Sep 28 '20 at 18:39

score 1 · Answer 2 · answered Sep 26 '20 at 12:27

1

I think this plot tells you that both the datasets are very very equal with the feature you are using,you can see the spread of blue one to be little less but that is not the case which will matter much, violen plot basically tells you the mean, percentiles,spread ,and if see it vertically,it also tells the overlapping.(basically the whole univariate analysis is in one graph).

In your case its just telling both the data sets are equal,with huge overlapping.

answered Sep 26 '20 at 12:27

Raghav Agarwal

11
1

I agree with you that the 2 arrays have the same distribution. The point is that the range on the y axes on the violin plot is larger than the range of the data. – Sep 27 '20 at 14:38
I think the plot you are looking for is box plot,violen plot also shows the box plot,the range of the distribution is not the whole plot,but in your case the back dot in the middel of the plots.[those tells you that how much is the spread of the data, and the whole graph,look it vertically it will show the uvivariate pdf distribution.,if you wanna check the spread[i.e mean variance and percentiles,use box plot] its in the sns lib. – Raghav Agarwal Sep 28 '20 at 03:00

Violin plot of 2 numpy arrays with seaborn

2 Answers2