I am building a machine learning model for a binary classification task in Python/ Jupyter Notebook. I am currently in the "Exploratory data analysis" phase and try to create multiple plots/ graphs for my data set.
My data set consists of 20 columns (19 features and 1 labeled target). Each row in my data set represents a person. There are many categorical/ nominal features in my data set and only few numerical/ continuous ones. Unfortunately I cannot upload the real data set, so I will create a dummy one.
personID | age | car | TARGET_happiness |
---|---|---|---|
1 | 27 | ford | 0 |
2 | 41 | tesla | 1 |
3 | 55 | bmw | 0 |
4 | 34 | tesla | 1 |
5 | 62 | ford | 1 |
6 | 38 | ford | 1 |
7 | 51 | bmw | 0 |
8 | 46 | tesla | 1 |
9 | 72 | bmw | 0 |
10 | 59 | tesla | 0 |
11 | 48 | ford | 0 |
12 | 51 | bmw | 1 |
My aim is to create a plot/ graph to visualize the relationship between the binary variable TARGET_happiness
(meaning "is the person happy?") and the categorical variable car
(meaning "which car does this person own").
The plot I've used for binary TARGET_happiness
vs. continuous age
is a box plot, see:
This seems fine. Now I also try to use a box plot for binary TARGET_happiness
vs. categorical car
:
I'm not sure if this plot is useful / appropriate. Sure, you can see that Tesla owners seem to be happier than BMW owners. But the box for Ford owners looks strange.
Which type of plot/ graph can I use to better visualize the relationship between binary and categorical data?