I have 2 data sets. One data set called user data and the other called end song data.
User data has the following information:
- gender of user (m or f)
- age range of user (7 different age ranges and some are nan)
- country of user (over 50 countries)
- account age in weeks
- user id
End song data has the following information:
- song the user played
- milliseconds the song was played
- context in which the song was played (playlist, album, collection, etc)
- track_id
- product (open or closed)
- end_timestamp
I am new to data science and come from a more computer science background, so I need some help with the statistical aspect. This is a lot more data in many more dimensions I have ever worked with, and I don't know where to start by analyzing it.
I am trying to learn how to go about analyzing this data set. What I do know is basic statistical measures to compare 2 numerical columns based on their correlation coefficient. I am looking to learn something that's more higher level to compare correlations across different columns. I am also confused as to how to include the qualitative data in my calculations.
Does anyone have any ideas how I can look for correlations between user demographic features (or their behavior) and their overall listening, or their average session lengths using Python? Thank you for your help.
Ideas I had: 1. First divide the data set up by country and male and female and then find correlation matrixes for the remainder of the numerical data for all the males and females separately in each country? 2. Is there some sort of way to do clustering on this?