Analyzing the Spotify dataset to gain insights in the music industry
Building a strong foundation through the pandas library by working on the 'Spotify' dataset. We will discuss some very basic tools that pandas provide to help gain insights into any dataset in the music domain.
In the past two posts within our Pandas series, we analyzed data from Chipotle restaurant and Flipkart online store. Today, we're going to look at Spotify dataset from a perspective of a recording studio start-up.
Btw if you are looking for roles in AI do check out https://www.deeplearning.careers
Imagine wanting to get started in the music industry. You believe you have a skill for spotting great talent and promoting them to become stars. Yet, apart from your ingenious ability, you need to be familiar with facts.
You want to know:
- What tracks are most popular amongst Spotify users?
- How many tracks gained popularity over 90 out of 100.
- Which tracks released in March 2020 gained popularity among over 80.
- Is there a correlation between popularity and a track's traits?
- How long should an average track last according to today's standards?
- What is the correlation between tracks' different features?
- Who is currently most popular and what genres do they represent?
Data to download: https://www.kaggle.com/lehaknarnauli/spotify-datasets
For a detailed explanation of the dataset: https://developer.spotify.com/documentation/web-api/reference/#category-tracks
Before you continue reading, check out our new ChatGPT Tool for Job Seekers
Job hunting can be a demanding and time-consuming process. We believe our tool can take some of the stress out of it by leveraging AI to help you write compelling, personalized cover letters. We're excited to share this tool with you and look forward to hearing about your experiences and successes.
Experience the future of job application preparation. Give our ChatGPT-powered tool a try today!
https://jobseekerai.netlify.app/
Pre-processing the data
Today we'll only start with the Pandas library.
import pandas as pd
The Spotify dataset is quite huge and there are several files containing slightly different data. Today we'll use tracks and artists' datasets. We'll start with the tracks dataset.
# Loading the datset
df_tracks = pd.read_csv('/content/drive/MyDrive/tracks.csv')
df_tracks
You'll see that this dataset consists of 122860 rows and 20 columns. To be sure if we can trust this dataset, it's important to check if any values are missing.
pd.isnull(df_tracks).sum().sum()
71
Pandas pd.isnull()
returns a dataset with booleans True and False saying if the value is missing. Calling sum()
twice on this gives us a total number of all the missing values in the dataset. If we only used it once, we'd get a sum of missing values for each column.
There are just 71 missing values in a dataset consisting of over 500k rows. It means it not bad, and our conclusions will be valid. Such a dataset is a Pythonista's delight!
To make things easier, we should change the release_date
to a date type and then put months and years into separate columns. We did a similar operation last time, so you might already be familiar with this method. Let's practice.
year = df_tracks['release_date'].apply(lambda x : x.year)
month = df_tracks['release_date'].apply(lambda x : x.month)
df_tracks.insert(loc=8, column='year', value=year)
df_tracks.insert(loc=9, column='month', value=month)
This time instead of assigning a new column by defining df_tracks['year']
we used the function insert()
. This method allows us to choose the exact position of the new column ( loc
). If we had done it the old-fashioned way, the new columns would have ended up at the last possible index.
Exploring the dataset
The dataset we're using right now consists of 20 columns. We'll find here information such as name
, popularity
, duration
, explicit
, artist
, release_date
, and tracks traits like danceability
, speechiness
, loudness
, etc.
Popularity is measured on a scale between 0 and 100, where 100 is the best. Given our knowledge of the music industry, let's check if what we feel is true.
- What are the most popular songs right now?
To check this, let's use a great pandas function query()
. This is a filtering function that enables the selection and filters the columns of a dataFrame with a boolean expression.
It is important to note that Dataframe.query()
the method only works if the column name doesn’t have any empty spaces. So before applying the method, spaces in column names should be replaced with ‘_’
Let's learn how this function is used.
most_popular = df_tracks.query('popularity>90', inplace=False).sort_values('popularity', ascending=False)
most_popular[:10]
It's important to note that the whole function's expression is passed in quotation marks.
id | name | popularity | duration_ms | explicit | artists | id_artists | release_date | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | time_signature | year | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
93802 | 4iJyoBOLtHqaGxP12qzhQI | Peaches (feat. Daniel Caesar & Giveon) | 100 | 198082 | 1 | ['Justin Bieber', 'Daniel Caesar', 'Giveon'] | ['1uNFoZAHBGtllmzznpCI3s', '20wkVLutqVOYrc0kxF... | 2021-03-19 | 0.677 | 0.696 | 0 | -6.181 | 1 | 0.1190 | 0.32100 | 0.000000 | 0.4200 | 0.464 | 90.030 | 4 | 2021 | 3 |
93803 | 7lPN2DXiMsVn7XUKtOW1CS | drivers license | 99 | 242014 | 1 | ['Olivia Rodrigo'] | ['1McMsnEElThX1knmY4oliG'] | 2021-01-08 | 0.585 | 0.436 | 10 | -8.761 | 1 | 0.0601 | 0.72100 | 0.000013 | 0.1050 | 0.132 | 143.874 | 4 | 2021 | 1 |
93804 | 3Ofmpyhv5UAQ70mENzB277 | Astronaut In The Ocean | 98 | 132780 | 0 | ['Masked Wolf'] | ['1uU7g3DNSbsu0QjSEqZtEd'] | 2021-01-06 | 0.778 | 0.695 | 4 | -6.865 | 0 | 0.0913 | 0.17500 | 0.000000 | 0.1500 | 0.472 | 149.996 | 4 | 2021 | 1 |
92810 | 5QO79kh1waicV47BqGRL3g | Save Your Tears | 97 | 215627 | 1 | ['The Weeknd'] | ['1Xyo4u8uXC1ZmMpatF05PJ'] | 2020-03-20 | 0.680 | 0.826 | 0 | -5.487 | 1 | 0.0309 | 0.02120 | 0.000012 | 0.5430 | 0.644 | 118.051 | 4 | 2020 | 3 |
92811 | 6tDDoYIxWvMLTdKpjFkc1B | telepatía | 97 | 160191 | 0 | ['Kali Uchis'] | ['1U1el3k54VvEUzo3ybLPlM'] | 2020-12-04 | 0.653 | 0.524 | 11 | -9.016 | 0 | 0.0502 | 0.11200 | 0.000000 | 0.2030 | 0.553 | 83.970 | 4 | 2020 | 12 |
92813 | 0VjIjW4GlUZAMYd2vXMi3b | Blinding Lights | 96 | 200040 | 0 | ['The Weeknd'] | ['1Xyo4u8uXC1ZmMpatF05PJ'] | 2020-03-20 | 0.514 | 0.730 | 1 | -5.934 | 1 | 0.0598 | 0.00146 | 0.000095 | 0.0897 | 0.334 | 171.005 | 4 | 2020 | 3 |
93805 | 7MAibcTli4IisCtbHKrGMh | Leave The Door Open | 96 | 242096 | 0 | ['Bruno Mars', 'Anderson .Paak', 'Silk Sonic'] | ['0du5cEVh5yTK9QJze8zA0C', '3jK9MiCrA42lLAdMGU... | 2021-03-05 | 0.586 | 0.616 | 5 | -7.964 | 1 | 0.0324 | 0.18200 | 0.000000 | 0.0927 | 0.719 | 148.088 | 4 | 2021 | 3 |
92814 | 6f3Slt0GbA2bPZlz0aIFXN | The Business | 95 | 164000 | 0 | ['Tiësto'] | ['2o5jDhtHVPhrJdv3cEQ99Z'] | 2020-09-16 | 0.798 | 0.620 | 8 | -7.079 | 0 | 0.2320 | 0.41400 | 0.019200 | 0.1120 | 0.235 | 120.031 | 4 | 2020 | 9 |
91866 | 60ynsPSSKe6O3sfwRnIBRf | Streets | 94 | 226987 | 1 | ['Doja Cat'] | ['5cj0lLjcoR7YOSnhnX0Po5'] | 2019-11-07 | 0.749 | 0.463 | 11 | -8.433 | 1 | 0.0828 | 0.20800 | 0.037100 | 0.3370 | 0.190 | 90.028 | 4 | 2019 | 11 |
92816 | 3FAJ6O0NOHQV8Mc5Ri6ENp | Heartbreak Anniversary | 94 | 198371 | 0 | ['Giveon'] | ['4fxd5Ee7UefO4CUXgwJ7IP'] | 2020-03-27 | 0.449 | 0.465 | 0 | -8.964 | 1 | 0.0791 | 0.52400 | 0.000001 | 0.3030 | 0.543 | 89.087 | 3 | 2020 | 3 |
At the first sight, we can see that the first 10 most popular songs were released either in 2020 or 2021 and that almost half of them contain some explicit content indicated by the binary 1
in the explicit
column.
To see if our conclusions are right, let's sort the filtered values and show the columns of interest.
pop_date = most_popular.sort_values('release_date', ascending=False)
pop_date[['name', 'popularity', 'explicit','release_date']][:20]
name | popularity | explicit | release_date | |
---|---|---|---|---|
93802 | Peaches (feat. Daniel Caesar & Giveon) | 100 | 1 | 2021-03-19 |
93805 | Leave The Door Open | 96 | 0 | 2021-03-05 |
93815 | What’s Next | 91 | 1 | 2021-03-05 |
93811 | Hold On | 92 | 0 | 2021-03-05 |
93816 | We're Good | 91 | 0 | 2021-02-11 |
93813 | 911 | 91 | 1 | 2021-02-05 |
93809 | Up | 92 | 1 | 2021-02-05 |
93806 | Fiel | 94 | 0 | 2021-02-04 |
93808 | Ella No Es Tuya - Remix | 92 | 0 | 2021-02-03 |
93812 | Wellerman - Sea Shanty / 220 KID x Billen Ted ... | 92 | 0 | 2021-01-21 |
93810 | Goosebumps - Remix | 92 | 1 | 2021-01-15 |
93814 | Your Love (9PM) | 91 | 0 | 2021-01-15 |
93807 | Friday (feat. Mufasa & Hypeman) - Dopamine Re-... | 94 | 0 | 2021-01-15 |
93803 | drivers license | 99 | 1 | 2021-01-08 |
93804 | Astronaut In The Ocean | 98 | 0 | 2021-01-06 |
92823 | Good Days | 93 | 1 | 2020-12-25 |
92819 | Bandido | 94 | 0 | 2020-12-10 |
92811 | telepatía | 97 | 0 | 2020-12-04 |
92821 | LA NOCHE DE ANOCHE | 93 | 0 | 2020-11-27 |
92830 | Dynamite | 91 | 0 | 2020-11-20 |
We know which songs are the most popular in general, but as a good producer, you need to understand human emotions and how they shape the market.
In times of crisis, both artists and the audience have different tastes. Last year in March, the world went under a complete lockdown. It's natural to wonder what were some of the top songs released back then.
We're going to use query()
again, but this time going one step further and dealing with a bit more complex problem since we are defining two conditions.
The conditions are as follows:-
- Songs with popularity greater than or equal to 80.
- Songs that released in March 2020
Let's see how we'll approach this problem.
most_popular_march_20 = df_tracks.query('(popularity > 80) and (year in ["2020"]) and (month in ["3"])')
most_popular_march_20
id | name | popularity | duration_ms | explicit | artists | id_artists | release_date | year | month | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | time_signature | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
92810 | 5QO79kh1waicV47BqGRL3g | Save Your Tears | 97 | 215627 | 1 | ['The Weeknd'] | ['1Xyo4u8uXC1ZmMpatF05PJ'] | 2020-03-20 | 2020 | 3 | 0.680 | 0.826 | 0 | -5.487 | 1 | 0.0309 | 0.02120 | 0.000012 | 0.5430 | 0.644 | 118.051 | 4 | |
92813 | 0VjIjW4GlUZAMYd2vXMi3b | Blinding Lights | 96 | 200040 | 0 | ['The Weeknd'] | ['1Xyo4u8uXC1ZmMpatF05PJ'] | 2020-03-20 | 2020 | 3 | 0.514 | 0.730 | 1 | -5.934 | 1 | 0.0598 | 0.00146 | 0.000095 | 0.0897 | 0.334 | 171.005 | 4 | |
92816 | 3FAJ6O0NOHQV8Mc5Ri6ENp | Heartbreak Anniversary | 94 | 198371 | 0 | ['Giveon'] | ['4fxd5Ee7UefO4CUXgwJ7IP'] | 2020-03-27 | 2020 | 3 | 0.449 | 0.465 | 0 | -8.964 | 1 | 0.0791 | 0.52400 | 0.000001 | 0.3030 | 0.543 | 89.087 | 3 | |
92853 | 4xqrdfXkTW4T0RauPLv3WA | Heather | 89 | 198040 | 0 | ['Conan Gray'] | ['4Uc8Dsxct0oMqx0P6i60ea'] | 2020-03-20 | 2020 | 3 | 0.357 | 0.425 | 5 | -7.301 | 1 | 0.0333 | 0.58400 | 0.000000 | 0.3220 | 0.270 | 102.078 | 3 | |
92867 | 5nujrmhLynf4yMoMtj8AQF | Levitating (feat. DaBaby) | 89 | 203064 | 0 | ['Dua Lipa', 'DaBaby'] | ['6M2wZ9GZgrQXHCFfjv46we', '4r63FhuTkUYltbVAg5... | 2020-03-27 | 2020 | 3 | 0.702 | 0.825 | 6 | -3.787 | 0 | 0.0601 | 0.00883 | 0.000000 | 0.0674 | 0.915 | 102.977 | 4 | |
92927 | 7szuecWAPwGoV1e5vGu8tl | In Your Eyes | 86 | 237520 | 1 | ['The Weeknd'] | ['1Xyo4u8uXC1ZmMpatF05PJ'] | 2020-03-20 | 2020 | 3 | 0.667 | 0.719 | 7 | -5.371 | 0 | 0.0346 | 0.00285 | 0.000081 | 0.0736 | 0.717 | 100.021 | 4 | |
92951 | 6KfoDhO4XUWSbnyKjNp9c4 | Maniac | 86 | 185773 | 0 | ['Conan Gray'] | ['4Uc8Dsxct0oMqx0P6i60ea'] | 2020-03-20 | 2020 | 3 | 0.628 | 0.639 | 8 | -5.460 | 1 | 0.0435 | 0.00162 | 0.000000 | 0.3540 | 0.493 | 108.045 | 4 | |
92961 | 3PfIrDoz19wz7qK7tYeu62 | Don't Start Now | 85 | 183290 | 0 | ['Dua Lipa'] | ['6M2wZ9GZgrQXHCFfjv46we'] | 2020-03-27 | 2020 | 3 | 0.793 | 0.793 | 11 | -4.521 | 0 | 0.0830 | 0.01230 | 0.000000 | 0.0951 | 0.679 | 123.950 | 4 | |
92995 | 5m5aY6S9ttfIG157xli2Rs | Alô Ambev (Segue Sua Vida) - Ao Vivo | 84 | 169593 | 0 | ['Zé Neto & Cristiano'] | ['487N2T9nIPEHrlTZLL3SQs'] | 2020-03-26 | 2020 | 3 | 0.695 | 0.872 | 9 | -3.650 | 1 | 0.0868 | 0.33400 | 0.000000 | 0.9540 | 0.646 | 121.843 | 4 | |
93021 | 527k23H0A4Q0UJN3vGs0Da | After Party | 84 | 167916 | 1 | ['Don Toliver'] | ['4Gso3d4CscCijv0lmajZWs'] | 2020-03-13 | 2020 | 3 | 0.629 | 0.692 | 5 | -8.045 | 1 | 0.0376 | 0.00981 | 0.331000 | 0.6030 | 0.453 | 162.948 | 4 | |
93025 | 017PF4Q3l4DBUiWoXk4OWT | Break My Heart | 84 | 221820 | 0 | ['Dua Lipa'] | ['6M2wZ9GZgrQXHCFfjv46we'] | 2020-03-27 | 2020 | 3 | 0.730 | 0.729 | 4 | -3.434 | 0 | 0.0883 | 0.16700 | 0.000001 | 0.3490 | 0.467 | 113.013 | 4 | |
93071 | 1jaTQ3nqY3oAAYyCTbIvnM | WHATS POPPIN | 83 | 139741 | 1 | ['Jack Harlow'] | ['2LIk90788K0zvyj2JJVwkJ'] | 2020-03-13 | 2020 | 3 | 0.923 | 0.604 | 11 | -6.671 | 0 | 0.2450 | 0.01700 | 0.000000 | 0.2720 | 0.826 | 145.062 | 4 | |
93079 | 7AzlLxHn24DxjgQX73F9fU | No Idea | 83 | 154424 | 0 | ['Don Toliver'] | ['4Gso3d4CscCijv0lmajZWs'] | 2020-03-13 | 2020 | 3 | 0.652 | 0.631 | 6 | -5.718 | 0 | 0.0893 | 0.52400 | 0.000579 | 0.1650 | 0.350 | 127.998 | 4 | |
93139 | 39LLxExYz6ewLAcYrzQQyP | Levitating | 82 | 203808 | 0 | ['Dua Lipa'] | ['6M2wZ9GZgrQXHCFfjv46we'] | 2020-03-27 | 2020 | 3 | 0.695 | 0.884 | 6 | -2.278 | 0 | 0.0753 | 0.05610 | 0.000000 | 0.2130 | 0.914 | 103.014 | 4 | |
93149 | 4lsHZ92XCFOQfzJFBTluk8 | You Got It | 82 | 203145 | 1 | ['Vedo'] | ['3wVXTWabe3viT0jF7DfjOL'] | 2020-03-27 | 2020 | 3 | 0.762 | 0.433 | 5 | -8.937 | 1 | 0.1870 | 0.14300 | 0.000000 | 0.1180 | 0.394 | 122.074 | 4 | |
93187 | 2lCkncy6bIB0LTMT7kvrD1 | Azul | 81 | 205933 | 0 | ['J Balvin'] | ['1vyhD5VmyZ7KMfW5gqLgo5'] | 2020-03-19 | 2020 | 3 | 0.843 | 0.836 | 11 | -2.474 | 0 | 0.0695 | 0.08160 | 0.001380 | 0.0532 | 0.650 | 94.018 | 4 | |
93191 | 6qBFSepqLCuh5tehehc1bd | Like I Want You | 81 | 260776 | 0 | ['Giveon'] | ['4fxd5Ee7UefO4CUXgwJ7IP'] | 2020-03-27 | 2020 | 3 | 0.678 | 0.355 | 10 | -7.757 | 0 | 0.0627 | 0.75900 | 0.000071 | 0.1140 | 0.438 | 119.772 | 3 | |
93221 | 6bnF93Rx87YqUBLSgjiMU8 | Heartless | 81 | 198267 | 1 | ['The Weeknd'] | ['1Xyo4u8uXC1ZmMpatF05PJ'] | 2020-03-20 | 2020 | 3 | 0.537 | 0.746 | 10 | -5.507 | 0 | 0.1500 | 0.02360 | 0.000001 | 0.1560 | 0.252 | 170.062 | 4 |
The title of the songs seems to be mirroring the world's mood back then: 'Save Your Tears', 'Heartbreak Anniversary', 'Maniac', 'Break My Heart', and more.
Features and Popularity
Let's hop on to our next problem. We know that different features of a song can impact its popularity in different ways, however, we want to dig deeper and see how. This is one of the most important questions that we should ask.
- How do different features of a song impact its popularity?
We get an intuition that large audiences like songs that are compatible with dancing. Let's see if we can back this up with data as well.
df_1=df_tracks.groupby('popularity')['danceability'].mean().sort_values(ascending=[False]).reset_index()
df_1.head()
popularity | danceability | |
---|---|---|
0 | 95 | 0.798000 |
1 | 98 | 0.778000 |
2 | 91 | 0.751091 |
3 | 88 | 0.727105 |
4 | 85 | 0.712600 |
We have created a different dataset df_1
. This dataframe will have the popularity for different songs grouped by the mean of the danceability
score.
Now, this makes it easier for us to analyze the correlation between these two features.
We will use the basics of a plotting library in python called Plotly. It is an interactive, open-source plotting library that supports various charts.
It's important to import the library before moving ahead.
import plotly.express as px #importing plotly
fig2 = px.scatter(df_tracks, x="popularity", y="danceability", color="danceability",size='popularity')
fig2.show()
Bingo! We have plotted a scatter plot. In this case, since the area of the circles corresponds to the danceability score, we can call this chart a Bubble Chart as well. This implies that the more the popularity score the greater will be the area corresponding to that particular bubble and vice versa.
The graph in itself is interactive and with one quick look at it, we release that 'popularity' and 'danceability' are positively correlated, which implies that, as the popularity of the song increase, the danceability score for that song also increases.
We don't always need to plot a graph to check the correlation between two features. The same can also be achieved with a few simple codes given below.
We will use a module called Scipy.Stats for this. This module contains a large number of probability distributions as well as a growing library of statistical functions.
Next, we import pearsonr function from this module, which helps us calculate Pearson's Correlation Constant 'r' for two different features.
Let's learn how.
from scipy.stats import pearsonr #importing the library
data1 = df_1['popularity']
data2 = df_1['danceability']
# calculate Pearson's correlation
corr, _ = pearsonr(data1, data2)
print('Pearsons correlation: %.3f' % corr)
Pearsons correlation: 0.888
Doing a quick revision. The following are the three conditions for the Pearson's Correlation Coefficient 'r':-
- r>0, implies, positive correlation
- r=0, implies, no correlation
- r<0, implies, negative correlation.
We see that since the value for the correlation coefficient r =0.88 (>0), the two features are positively correlated, or we can say, an increase in one feature will have an increase in the other feature and vice versa.
Let's visualize the same for some other features as well.
How about "instrumentalness"? Instrumentalness
of value 1 indicates that there are no words at all, the lower the value the more words the song contains.
Following a similar procedure and plotting a Bubble Chart for Popularity v/s Instrumentalness.
Looking at the graph, it is easily noticeable that the two features are negatively correlated, which implies that, an increase in one leads to a decrease in another and vice versa.
So, do we need to check the correlation for each feature one by one? Not at all.
Thanks to pandas and plotly, we can easily check the correlation between any two given features. How? Let's find out.
We will use the combination of a function corr()
in pandas, as well as, heatmap
in plotly, so what exactly do these two do?
- corr() - This function of pandas helps us compute pairwise correlation between different columns of the dataset (excluding NA/null values).
- Heatmap - This function under plotly shows the magnitude of a phenomenon as colour in two dimensions.
To see how the combination of these two can help us achieve the desired result, we need to first import another plotly library called Graph Objects.
import plotly.graph_objects as go #importing the library
matrix=df_tracks.corr() #returns a matrix with correlation of all features
x_list=['popularity','duration_ms','explicit',
'danceability','energy','key','loudness',
'mode','speechiness','acousticness','instrumentalness',
'liveness','valence','tempo','time_signature']
fig_heatmap = go.Figure(data=go.Heatmap(
z=matrix,
x=x_list,
y=x_list,
hoverongaps = False))
fig_heatmap.update_layout(margin = dict(t=200,r=200,b=200,l=200),
width = 800, height = 650,
autosize = False )
fig_heatmap.show()
Here, the legend of the graph shows us how the color gets lighter as the correlation increases. We observe that there is no significant positive correlation between popularity and a song's feature. The most positive correlation occurs between popularity, danceability, loudness, and energy.
Let's just quickly check one more question.
- How long does a song lasts on average today? Has it always been like that?
Before we visualize our result, it's important to change the unit of duration from milliseconds to minutes.
df['duration']=df['duration_ms']//1000 #Floor division to get only the quotient
df.drop(['duration_ms'], axis = 1)
We observe that the average duration of the song has increased since 1969, but has remained more or less the same ever since. More information can be observed through the above graph.
Most popular artists
To check the most popular artists we'll use the artists' dataset.
We will start by importing the dataset into our python environment.
df_artists = pd.read_csv('/content/drive/MyDrive/artists.csv')
df_artists.head()
id | followers | genres | name | popularity | |
---|---|---|---|---|---|
0 | 0DheY5irMjBUeLybbCUEZ2 | 0.0 | [] | Armid & Amir Zare Pashai feat. Sara Rouzbehani | 0.0 |
1 | 0DlhY15l3wsrnlfGio2bjU | 5.0 | [] | ปูนา ภาวิณี | 0.0 |
2 | 0DmRESX2JknGPQyO15yxg7 | 0.0 | [] | Sadaa | 0.0 |
3 | 0DmhnbHjm1qw6NCYPeZNgJ | 0.0 | [] | Tra'gruda | 0.0 |
4 | 0Dn11fWM7vHQ3rinvWEl4E | 2.0 | [] | Ioannis Panoutsopoulos | 0.0 |
Our dataset has 1104349 rows and 5 columns.
- Analysis was done on Artists:-
artists_popular = df_artists.sort_values(by=['popularity'], ascending=False).reset_index()
artists_popular[:10]
index | id | followers | genres | name | popularity | |
---|---|---|---|---|---|---|
0 | 144481 | 1uNFoZAHBGtllmzznpCI3s | 44606973.0 | ['canadian pop', 'pop', 'post-teen pop'] | Justin Bieber | 100 |
1 | 115489 | 4q3ewBCX7sLwd24euuV69X | 32244734.0 | ['latin', 'reggaeton', 'trap latino'] | Bad Bunny | 98 |
2 | 126338 | 06HL4z0CvFAxyc27GXpf02 | 38869193.0 | ['pop', 'post-teen pop'] | Taylor Swift | 98 |
3 | 313676 | 3TVXtAsR1Inumwj472S9r4 | 54416812.0 | ['canadian hip hop', 'canadian pop', 'hip hop', 'pop rap', 'rap', 'toronto rap'] | Drake | 98 |
4 | 144484 | 3Nrfpe0tUJi4K4DXYWgMUX | 31623813.0 | ['k-pop', 'k-pop boy group'] | BTS | 96 |
5 | 115490 | 4MCBfE4596Uoi2O4DtmEMz | 16996777.0 | ['chicago rap', 'melodic rap'] | Juice WRLD | 96 |
6 | 144483 | 1Xyo4u8uXC1ZmMpatF05PJ | 31308207.0 | ['canadian contemporary r&b', 'canadian pop', 'pop'] | The Weeknd | 96 |
7 | 144485 | 66CXWjxzNUsdJxJ2JdwvnR | 61301006.0 | ['pop', 'post-teen pop'] | Ariana Grande | 95 |
8 | 144486 | 1vyhD5VmyZ7KMfW5gqLgo5 | 27286822.0 | ['latin', 'reggaeton', 'reggaeton colombiano', 'trap latino'] | J Balvin | 95 |
9 | 115491 | 7iK8PXO48WeuP03g8YR51W | 5001808.0 | ['trap latino'] | Myke Towers | 95 |
We notice that the top ten songs and artists differ. Justin Bieber is an unquestionable king of pop, and although their songs are not the most popular right now, The Weeknd, Taylor Swift, and Drake are the listeners' favorites too.
- Analyzing the Genres:-
Looking at the head of the dataframe, we also observe that for many rows, the column 'genres' is an empty list, these can be seen as NA values as well. To handle this type of situation it's important to see the proportion of such rows to the overall shape of the dataset.
df_artists[df_artists["genres"]=='[]']
We see that there are 59742 rows with empty lists passed out of the total 89336 rows of the dataset.
For this particular case, we will create and perform our analysis on a new dataframe with only those rows that contain some value under the column 'genres'.
df_genre=df_artists[df_artists["genres"]!='[]']
df_genre.head()
id | followers | genres | name | popularity | |
---|---|---|---|---|---|
45 | 0VLMVnVbJyJ4oyZs2L3Yl2 | 71.0 | ['carnaval cadiz'] | Las Viudas De Los Bisabuelos | 6 |
46 | 0dt23bs4w8zx154C5xdVyl | 63.0 | ['carnaval cadiz'] | Los De Capuchinos | 5 |
47 | 0pGhoB99qpEJEsBQxgaskQ | 64.0 | ['carnaval cadiz'] | Los “Pofesionales” | 7 |
48 | 3HDrX2OtSuXLW5dLR85uN3 | 53.0 | ['carnaval cadiz'] | Los Que No Paran De Rajar | 6 |
136 | 22mLrN5fkppmuUPsHx6i2G | 59.0 | ['classical harp', 'harp'] | Vera Dulova | 3 |
We have successfully segregated the rows we require for our analysis.
We observe that the column 'genres' has a list passed as value. Let's split these lists into individual values. For this task, we will use a very special function of pandas, called explode().
- explode() function splits the list by each element and create a new row for each of the element.
df_sort_genres=pd.DataFrame(df_genre.assign(genres=df_genre.genres.str.split(",")).explode('genres'))
df_sort_genres.tail()
id | followers | genres | name | popularity | |
---|---|---|---|---|---|
1104328 | 1q9C5XlekzXbRLIuLCDTre | 90087.0 | 'teen pop'] | Brent Rivera | 33 |
1104331 | 4fh2BIKYPFvXFsQLhaeVJp | 309.0 | ['la indie'] | Lone Kodiak | 20 |
1104334 | 7akMsd2vb4xowNTehv3gsY | 774.0 | ['indie rockism'] | The Str!ke | 0 |
1104336 | 35m7AJrUCtHYHyIUhCzmgi | 205.0 | ['indie rockism'] | Hunter Fraser | 6 |
1104345 | 1ljurfXKPlGncNdW3J8zJ8 | 2123.0 | ['deep acoustic pop'] | Right the Stars | 18 |
Replacing the square brackets with " " blank spaces to get only the keywords.
df_sort_genres['genres']=df_sort_genres.genres.str.replace('[',' ')
df_sort_genres['genres']=df_sort_genres.genres.str.replace(']',' ')
Let's analyze the top 30 genres now.
# get top 30 most commom genres
n = 30
top_30=pd.DataFrame(df_sort_genres['genres'].value_counts()[:n]).reset_index()
top_30.rename(columns = {'index':'Genres','genres':'Total_Count'}, inplace = True)
top_30
Genres | Total_Count | |
---|---|---|
0 | 'dance pop' | 551 |
1 | 'latin' | 483 |
2 | 'electro house' | 478 |
3 | 'pop' | 461 |
4 | 'edm' | 455 |
5 | 'hip hop' | 455 |
6 | 'electropop' | 432 |
7 | 'indie rock' | 411 |
8 | 'classical performance' | 407 |
9 | 'tropical' | 402 |
10 | 'latin rock' | 401 |
11 | 'french hip hop' | 400 |
12 | 'lo-fi beats' | 393 |
13 | 'urban contemporary' | 386 |
14 | 'rap' | 366 |
15 | 'pop rap' | 365 |
16 | 'funk' | 365 |
17 | 'modern rock' | 361 |
18 | 'indie folk' | 353 |
19 | 'adult standards' | 349 |
20 | 'pop dance' | 347 |
21 | 'country rock' | 346 |
22 | 'uk hip hop' | 343 |
23 | 'corrido' | 339 |
24 | 'stomp and holler' | 338 |
25 | 'art rock' | 336 |
26 | 'alternative rock' | 333 |
27 | 'alternative metal' | 328 |
28 | 'indie pop' | 325 |
29 | 'alternative r&b' | 325 |
Should we visualize this output using a pie chart? Or wait, let's make it more interesting and visualize it using a Donut Chart
We observe that the top 30 genres have more or less the same count with the 'Dance Pop' at the top. More information can be seen using the graph above.
What we learned
- For tracks.csv file:-
- We found the first 10 most popular songs released in 2020 & 2021 using
query()
function of pandas. - We found the titles of the most popular songs of March 2020.
- We saw how different features are correlated with each other using the
corr()
and theheatmap
functions of pandas and plotly respectively - We learned to visualize our result using Bubble Charts.
- For artists.csv:-
- We found the top 10 artists
- We learned the use of the
explode()
function and found the top 30 genres. - We visualized our result using a Donut Chart.
Now you know what to do to be a successful producer and use data and analytics to your advantage to pick the next big hit.
Go to https://datascience.fm/tag/pandas/ for more tutorials where we take popular datasets and analyze them with Pandas.
If you are looking for jobs in AI and DS check out Deep Learning Careers
Feedback is important to us. Write us at hello@datascience.fm
Tell us what articles you want to see more of and the kinds of YouTube videos we should create. for you.