Analyzing the Spotify dataset to gain insights in the music industry

Building a strong foundation through the pandas library by working on the 'Spotify' dataset. We will discuss some very basic tools that pandas provide to help gain insights into any dataset in the music domain.

Analyzing the Spotify dataset to gain insights in the music industry

In the past two posts within our Pandas series, we analyzed data from Chipotle restaurant and Flipkart online store. Today, we're going to look at Spotify dataset from a perspective of a recording studio start-up.

Btw if you are looking for roles in AI do check out https://www.deeplearning.careers

Imagine wanting to get started in the music industry. You believe you have a skill for spotting great talent and promoting them to become stars. Yet, apart from your ingenious ability, you need to be familiar with facts.

You want to know:

  1. What tracks are most popular amongst Spotify users?
  2. How many tracks gained popularity over 90 out of 100.
  3. Which tracks released in March 2020 gained popularity among over 80.
  4. Is there a correlation between popularity and a track's traits?
  5. How long should an average track last according to today's standards?
  6. What is the correlation between tracks' different features?
  7. Who is currently most popular and what genres do they represent?

Data to download: https://www.kaggle.com/lehaknarnauli/spotify-datasets

For a detailed explanation of the dataset: https://developer.spotify.com/documentation/web-api/reference/#category-tracks

✔️
Check us out on Twitter for more such cool articles

Before you continue reading, check out our new ChatGPT Tool for Job Seekers

Job hunting can be a demanding and time-consuming process. We believe our tool can take some of the stress out of it by leveraging AI to help you write compelling, personalized cover letters. We're excited to share this tool with you and look forward to hearing about your experiences and successes.

Experience the future of job application preparation. Give our ChatGPT-powered tool a try today!

https://jobseekerai.netlify.app/

Pre-processing the data

Today we'll only start with the Pandas library.

import pandas as pd

The Spotify dataset is quite huge and there are several files containing slightly different data. Today we'll use tracks and artists' datasets. We'll start with the tracks dataset.

# Loading the datset
df_tracks = pd.read_csv('/content/drive/MyDrive/tracks.csv')
df_tracks

You'll see that this dataset consists of 122860 rows and 20 columns. To be sure if we can trust this dataset, it's important to check if any values are missing.

pd.isnull(df_tracks).sum().sum()
71

Pandas pd.isnull()returns a dataset with booleans True and False saying if the value is missing. Calling sum() twice on this gives us a total number of all the missing values in the dataset. If we only used it once, we'd get a sum of missing values for each column.

There are just 71 missing values in a dataset consisting of over 500k rows. It means it not bad, and our conclusions will be valid. Such a dataset is a Pythonista's delight!

To make things easier, we should change the release_date to a date type and then put months and years into separate columns. We did a similar operation last time, so you might already be familiar with this method. Let's practice.

year = df_tracks['release_date'].apply(lambda x : x.year)
month = df_tracks['release_date'].apply(lambda x : x.month)
df_tracks.insert(loc=8, column='year', value=year) 
df_tracks.insert(loc=9, column='month', value=month)

This time instead of assigning a new column by defining df_tracks['year'] we used the function insert(). This method allows us to choose the exact position of the new column ( loc ). If we had done it the old-fashioned way, the new columns would have ended up at the last possible index.

Exploring the dataset

The dataset we're using right now consists of 20 columns. We'll find here information such as name, popularity, duration, explicit, artist, release_date, and tracks traits like danceability, speechiness, loudness , etc.

Popularity is measured on a scale between 0 and 100, where 100 is the best. Given our knowledge of the music industry, let's check if what we feel is true.

  • What are the most popular songs right now?

To check this, let's use a great pandas function query(). This is a filtering function that enables the selection and filters the columns of a dataFrame with a boolean expression.

It is important to note that Dataframe.query() the method only works if the column name doesn’t have any empty spaces. So before applying the method, spaces in column names should be replaced with ‘_’

Let's learn how this function is used.

most_popular = df_tracks.query('popularity>90', inplace=False).sort_values('popularity', ascending=False)
most_popular[:10]

It's important to note that the whole function's expression is passed in quotation marks.

id name popularity duration_ms explicit artists id_artists release_date danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo time_signature year month
93802 4iJyoBOLtHqaGxP12qzhQI Peaches (feat. Daniel Caesar & Giveon) 100 198082 1 ['Justin Bieber', 'Daniel Caesar', 'Giveon'] ['1uNFoZAHBGtllmzznpCI3s', '20wkVLutqVOYrc0kxF... 2021-03-19 0.677 0.696 0 -6.181 1 0.1190 0.32100 0.000000 0.4200 0.464 90.030 4 2021 3
93803 7lPN2DXiMsVn7XUKtOW1CS drivers license 99 242014 1 ['Olivia Rodrigo'] ['1McMsnEElThX1knmY4oliG'] 2021-01-08 0.585 0.436 10 -8.761 1 0.0601 0.72100 0.000013 0.1050 0.132 143.874 4 2021 1
93804 3Ofmpyhv5UAQ70mENzB277 Astronaut In The Ocean 98 132780 0 ['Masked Wolf'] ['1uU7g3DNSbsu0QjSEqZtEd'] 2021-01-06 0.778 0.695 4 -6.865 0 0.0913 0.17500 0.000000 0.1500 0.472 149.996 4 2021 1
92810 5QO79kh1waicV47BqGRL3g Save Your Tears 97 215627 1 ['The Weeknd'] ['1Xyo4u8uXC1ZmMpatF05PJ'] 2020-03-20 0.680 0.826 0 -5.487 1 0.0309 0.02120 0.000012 0.5430 0.644 118.051 4 2020 3
92811 6tDDoYIxWvMLTdKpjFkc1B telepatía 97 160191 0 ['Kali Uchis'] ['1U1el3k54VvEUzo3ybLPlM'] 2020-12-04 0.653 0.524 11 -9.016 0 0.0502 0.11200 0.000000 0.2030 0.553 83.970 4 2020 12
92813 0VjIjW4GlUZAMYd2vXMi3b Blinding Lights 96 200040 0 ['The Weeknd'] ['1Xyo4u8uXC1ZmMpatF05PJ'] 2020-03-20 0.514 0.730 1 -5.934 1 0.0598 0.00146 0.000095 0.0897 0.334 171.005 4 2020 3
93805 7MAibcTli4IisCtbHKrGMh Leave The Door Open 96 242096 0 ['Bruno Mars', 'Anderson .Paak', 'Silk Sonic'] ['0du5cEVh5yTK9QJze8zA0C', '3jK9MiCrA42lLAdMGU... 2021-03-05 0.586 0.616 5 -7.964 1 0.0324 0.18200 0.000000 0.0927 0.719 148.088 4 2021 3
92814 6f3Slt0GbA2bPZlz0aIFXN The Business 95 164000 0 ['Tiësto'] ['2o5jDhtHVPhrJdv3cEQ99Z'] 2020-09-16 0.798 0.620 8 -7.079 0 0.2320 0.41400 0.019200 0.1120 0.235 120.031 4 2020 9
91866 60ynsPSSKe6O3sfwRnIBRf Streets 94 226987 1 ['Doja Cat'] ['5cj0lLjcoR7YOSnhnX0Po5'] 2019-11-07 0.749 0.463 11 -8.433 1 0.0828 0.20800 0.037100 0.3370 0.190 90.028 4 2019 11
92816 3FAJ6O0NOHQV8Mc5Ri6ENp Heartbreak Anniversary 94 198371 0 ['Giveon'] ['4fxd5Ee7UefO4CUXgwJ7IP'] 2020-03-27 0.449 0.465 0 -8.964 1 0.0791 0.52400 0.000001 0.3030 0.543 89.087 3 2020 3

At the first sight, we can see that the first 10 most popular songs were released either in 2020 or 2021 and that almost half of them contain some explicit content indicated by the binary 1 in the explicit column.

To see if our conclusions are right, let's sort the filtered values and show the columns of interest.

pop_date = most_popular.sort_values('release_date', ascending=False)
pop_date[['name', 'popularity', 'explicit','release_date']][:20]
name popularity explicit release_date
93802 Peaches (feat. Daniel Caesar & Giveon) 100 1 2021-03-19
93805 Leave The Door Open 96 0 2021-03-05
93815 What’s Next 91 1 2021-03-05
93811 Hold On 92 0 2021-03-05
93816 We're Good 91 0 2021-02-11
93813 911 91 1 2021-02-05
93809 Up 92 1 2021-02-05
93806 Fiel 94 0 2021-02-04
93808 Ella No Es Tuya - Remix 92 0 2021-02-03
93812 Wellerman - Sea Shanty / 220 KID x Billen Ted ... 92 0 2021-01-21
93810 Goosebumps - Remix 92 1 2021-01-15
93814 Your Love (9PM) 91 0 2021-01-15
93807 Friday (feat. Mufasa & Hypeman) - Dopamine Re-... 94 0 2021-01-15
93803 drivers license 99 1 2021-01-08
93804 Astronaut In The Ocean 98 0 2021-01-06
92823 Good Days 93 1 2020-12-25
92819 Bandido 94 0 2020-12-10
92811 telepatía 97 0 2020-12-04
92821 LA NOCHE DE ANOCHE 93 0 2020-11-27
92830 Dynamite 91 0 2020-11-20

We know which songs are the most popular in general, but as a good producer, you need to understand human emotions and how they shape the market.

In times of crisis, both artists and the audience have different tastes. Last year in March, the world went under a complete lockdown. It's natural to wonder what were some of the top songs released back then.

We're going to use query() again, but this time going one step further and dealing with a bit more complex problem since we are defining two conditions.

The conditions are as follows:-

  1. Songs with popularity greater than or equal to 80.
  2. Songs that released in March 2020

Let's see how we'll approach this problem.

most_popular_march_20 = df_tracks.query('(popularity > 80) and (year in ["2020"]) and (month in ["3"])')
most_popular_march_20
id name popularity duration_ms explicit artists id_artists release_date year month danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo time_signature
92810 5QO79kh1waicV47BqGRL3g Save Your Tears 97 215627 1 ['The Weeknd'] ['1Xyo4u8uXC1ZmMpatF05PJ'] 2020-03-20 2020 3 0.680 0.826 0 -5.487 1 0.0309 0.02120 0.000012 0.5430 0.644 118.051 4
92813 0VjIjW4GlUZAMYd2vXMi3b Blinding Lights 96 200040 0 ['The Weeknd'] ['1Xyo4u8uXC1ZmMpatF05PJ'] 2020-03-20 2020 3 0.514 0.730 1 -5.934 1 0.0598 0.00146 0.000095 0.0897 0.334 171.005 4
92816 3FAJ6O0NOHQV8Mc5Ri6ENp Heartbreak Anniversary 94 198371 0 ['Giveon'] ['4fxd5Ee7UefO4CUXgwJ7IP'] 2020-03-27 2020 3 0.449 0.465 0 -8.964 1 0.0791 0.52400 0.000001 0.3030 0.543 89.087 3
92853 4xqrdfXkTW4T0RauPLv3WA Heather 89 198040 0 ['Conan Gray'] ['4Uc8Dsxct0oMqx0P6i60ea'] 2020-03-20 2020 3 0.357 0.425 5 -7.301 1 0.0333 0.58400 0.000000 0.3220 0.270 102.078 3
92867 5nujrmhLynf4yMoMtj8AQF Levitating (feat. DaBaby) 89 203064 0 ['Dua Lipa', 'DaBaby'] ['6M2wZ9GZgrQXHCFfjv46we', '4r63FhuTkUYltbVAg5... 2020-03-27 2020 3 0.702 0.825 6 -3.787 0 0.0601 0.00883 0.000000 0.0674 0.915 102.977 4
92927 7szuecWAPwGoV1e5vGu8tl In Your Eyes 86 237520 1 ['The Weeknd'] ['1Xyo4u8uXC1ZmMpatF05PJ'] 2020-03-20 2020 3 0.667 0.719 7 -5.371 0 0.0346 0.00285 0.000081 0.0736 0.717 100.021 4
92951 6KfoDhO4XUWSbnyKjNp9c4 Maniac 86 185773 0 ['Conan Gray'] ['4Uc8Dsxct0oMqx0P6i60ea'] 2020-03-20 2020 3 0.628 0.639 8 -5.460 1 0.0435 0.00162 0.000000 0.3540 0.493 108.045 4
92961 3PfIrDoz19wz7qK7tYeu62 Don't Start Now 85 183290 0 ['Dua Lipa'] ['6M2wZ9GZgrQXHCFfjv46we'] 2020-03-27 2020 3 0.793 0.793 11 -4.521 0 0.0830 0.01230 0.000000 0.0951 0.679 123.950 4
92995 5m5aY6S9ttfIG157xli2Rs Alô Ambev (Segue Sua Vida) - Ao Vivo 84 169593 0 ['Zé Neto & Cristiano'] ['487N2T9nIPEHrlTZLL3SQs'] 2020-03-26 2020 3 0.695 0.872 9 -3.650 1 0.0868 0.33400 0.000000 0.9540 0.646 121.843 4
93021 527k23H0A4Q0UJN3vGs0Da After Party 84 167916 1 ['Don Toliver'] ['4Gso3d4CscCijv0lmajZWs'] 2020-03-13 2020 3 0.629 0.692 5 -8.045 1 0.0376 0.00981 0.331000 0.6030 0.453 162.948 4
93025 017PF4Q3l4DBUiWoXk4OWT Break My Heart 84 221820 0 ['Dua Lipa'] ['6M2wZ9GZgrQXHCFfjv46we'] 2020-03-27 2020 3 0.730 0.729 4 -3.434 0 0.0883 0.16700 0.000001 0.3490 0.467 113.013 4
93071 1jaTQ3nqY3oAAYyCTbIvnM WHATS POPPIN 83 139741 1 ['Jack Harlow'] ['2LIk90788K0zvyj2JJVwkJ'] 2020-03-13 2020 3 0.923 0.604 11 -6.671 0 0.2450 0.01700 0.000000 0.2720 0.826 145.062 4
93079 7AzlLxHn24DxjgQX73F9fU No Idea 83 154424 0 ['Don Toliver'] ['4Gso3d4CscCijv0lmajZWs'] 2020-03-13 2020 3 0.652 0.631 6 -5.718 0 0.0893 0.52400 0.000579 0.1650 0.350 127.998 4
93139 39LLxExYz6ewLAcYrzQQyP Levitating 82 203808 0 ['Dua Lipa'] ['6M2wZ9GZgrQXHCFfjv46we'] 2020-03-27 2020 3 0.695 0.884 6 -2.278 0 0.0753 0.05610 0.000000 0.2130 0.914 103.014 4
93149 4lsHZ92XCFOQfzJFBTluk8 You Got It 82 203145 1 ['Vedo'] ['3wVXTWabe3viT0jF7DfjOL'] 2020-03-27 2020 3 0.762 0.433 5 -8.937 1 0.1870 0.14300 0.000000 0.1180 0.394 122.074 4
93187 2lCkncy6bIB0LTMT7kvrD1 Azul 81 205933 0 ['J Balvin'] ['1vyhD5VmyZ7KMfW5gqLgo5'] 2020-03-19 2020 3 0.843 0.836 11 -2.474 0 0.0695 0.08160 0.001380 0.0532 0.650 94.018 4
93191 6qBFSepqLCuh5tehehc1bd Like I Want You 81 260776 0 ['Giveon'] ['4fxd5Ee7UefO4CUXgwJ7IP'] 2020-03-27 2020 3 0.678 0.355 10 -7.757 0 0.0627 0.75900 0.000071 0.1140 0.438 119.772 3
93221 6bnF93Rx87YqUBLSgjiMU8 Heartless 81 198267 1 ['The Weeknd'] ['1Xyo4u8uXC1ZmMpatF05PJ'] 2020-03-20 2020 3 0.537 0.746 10 -5.507 0 0.1500 0.02360 0.000001 0.1560 0.252 170.062 4

The title of the songs seems to be mirroring the world's mood back then: 'Save Your Tears', 'Heartbreak Anniversary', 'Maniac', 'Break My Heart', and more.

Features and Popularity

Let's hop on to our next problem. We know that different features of a song can impact its popularity in different ways, however, we want to dig deeper and see how. This is one of the most important questions that we should ask.

  • How do different features of a song impact its popularity?

We get an intuition that large audiences like songs that are compatible with dancing. Let's see if we can back this up with data as well.

df_1=df_tracks.groupby('popularity')['danceability'].mean().sort_values(ascending=[False]).reset_index()
df_1.head()
popularity danceability
0 95 0.798000
1 98 0.778000
2 91 0.751091
3 88 0.727105
4 85 0.712600

We have created a different dataset df_1. This dataframe will have the popularity for different songs grouped by the mean of the danceability score.

Now, this makes it easier for us to analyze the correlation between these two features.

We will use the basics of a plotting library in python called Plotly. It is an interactive, open-source plotting library that supports various charts.

It's important to import the library before moving ahead.

import plotly.express as px   #importing plotly
fig2 = px.scatter(df_tracks, x="popularity", y="danceability", color="danceability",size='popularity')
fig2.show()

Bingo! We have plotted a scatter plot. In this case, since the area of the circles corresponds to the danceability score, we can call this chart a Bubble Chart as well. This implies that the more the popularity score the greater will be the area corresponding to that particular bubble and vice versa.

The graph in itself is interactive and with one quick look at it, we release that 'popularity' and 'danceability' are positively correlated, which implies that, as the popularity of the song increase, the danceability score for that song also increases.

We don't always need to plot a graph to check the correlation between two features. The same can also be achieved with a few simple codes given below.

We will use a module called Scipy.Stats for this. This module contains a large number of probability distributions as well as a growing library of statistical functions.

Next, we import pearsonr function from this module, which helps us calculate Pearson's Correlation Constant 'r' for two different features.

Let's learn how.

from scipy.stats import pearsonr    #importing the library
data1 = df_1['popularity']
data2 = df_1['danceability']

# calculate Pearson's correlation
corr, _ = pearsonr(data1, data2)
print('Pearsons correlation: %.3f' % corr)
Pearsons correlation: 0.888

Doing a quick revision. The following are the three conditions for the Pearson's Correlation Coefficient 'r':-

  1. r>0, implies, positive correlation
  2. r=0, implies, no correlation
  3. r<0, implies, negative correlation.

We see that since the value for the correlation coefficient r =0.88 (>0), the two features are positively correlated, or we can say, an increase in one feature will have an increase in the other feature and vice versa.

Let's visualize the same for some other features as well.

How about "instrumentalness"? Instrumentalness of value 1 indicates that there are no words at all, the lower the value the more words the song contains.

Following a similar procedure and plotting a Bubble Chart for Popularity v/s Instrumentalness.

Looking at the graph, it is easily noticeable that the two features are negatively correlated, which implies that, an increase in one leads to a decrease in another and vice versa.

So, do we need to check the correlation for each feature one by one? Not at all.

Thanks to pandas and plotly, we can easily check the correlation between any two given features. How? Let's find out.

We will use the combination of a function corr() in pandas, as well as, heatmap in plotly, so what exactly do these two do?

  1. corr() - This function of pandas helps us compute pairwise correlation between different columns of the dataset (excluding NA/null values).
  2. Heatmap - This function under plotly shows the magnitude of a phenomenon as colour in two dimensions.

To see how the combination of these two can help us achieve the desired result, we need to first import another plotly library called Graph Objects.

import plotly.graph_objects as go    #importing the library

matrix=df_tracks.corr()  #returns a matrix with correlation of all features
x_list=['popularity','duration_ms','explicit',
        'danceability','energy','key','loudness',
        'mode','speechiness','acousticness','instrumentalness',
        'liveness','valence','tempo','time_signature']

fig_heatmap = go.Figure(data=go.Heatmap(
                   z=matrix,
                   x=x_list,
                   y=x_list,
                   hoverongaps = False))
fig_heatmap.update_layout(margin = dict(t=200,r=200,b=200,l=200),
    width = 800, height = 650,
    autosize = False )

fig_heatmap.show()

Here, the legend of the graph shows us how the color gets lighter as the correlation increases. We observe that there is no significant positive correlation between popularity and a song's feature. The most positive correlation occurs between popularity, danceability, loudness, and energy.

Let's just quickly check one more question.

  • How long does a song lasts on average today? Has it always been like that?

Before we visualize our result, it's important to change the unit of duration from milliseconds to minutes.

df['duration']=df['duration_ms']//1000 #Floor division to get only the quotient
df.drop(['duration_ms'], axis = 1)

We observe that the average duration of the song has increased since 1969, but has remained more or less the same ever since. More information can be observed through the above graph.

Most popular artists

To check the most popular artists we'll use the artists' dataset.

We will start by importing the dataset into our python environment.

df_artists = pd.read_csv('/content/drive/MyDrive/artists.csv')
df_artists.head()
id followers genres name popularity
0 0DheY5irMjBUeLybbCUEZ2 0.0 [] Armid & Amir Zare Pashai feat. Sara Rouzbehani 0.0
1 0DlhY15l3wsrnlfGio2bjU 5.0 [] ปูนา ภาวิณี 0.0
2 0DmRESX2JknGPQyO15yxg7 0.0 [] Sadaa 0.0
3 0DmhnbHjm1qw6NCYPeZNgJ 0.0 [] Tra'gruda 0.0
4 0Dn11fWM7vHQ3rinvWEl4E 2.0 [] Ioannis Panoutsopoulos 0.0

Our dataset has 1104349 rows and 5 columns.

  • Analysis was done on Artists:-
artists_popular = df_artists.sort_values(by=['popularity'], ascending=False).reset_index()
artists_popular[:10]
index id followers genres name popularity
0 144481 1uNFoZAHBGtllmzznpCI3s 44606973.0 ['canadian pop', 'pop', 'post-teen pop'] Justin Bieber 100
1 115489 4q3ewBCX7sLwd24euuV69X 32244734.0 ['latin', 'reggaeton', 'trap latino'] Bad Bunny 98
2 126338 06HL4z0CvFAxyc27GXpf02 38869193.0 ['pop', 'post-teen pop'] Taylor Swift 98
3 313676 3TVXtAsR1Inumwj472S9r4 54416812.0 ['canadian hip hop', 'canadian pop', 'hip hop', 'pop rap', 'rap', 'toronto rap'] Drake 98
4 144484 3Nrfpe0tUJi4K4DXYWgMUX 31623813.0 ['k-pop', 'k-pop boy group'] BTS 96
5 115490 4MCBfE4596Uoi2O4DtmEMz 16996777.0 ['chicago rap', 'melodic rap'] Juice WRLD 96
6 144483 1Xyo4u8uXC1ZmMpatF05PJ 31308207.0 ['canadian contemporary r&b', 'canadian pop', 'pop'] The Weeknd 96
7 144485 66CXWjxzNUsdJxJ2JdwvnR 61301006.0 ['pop', 'post-teen pop'] Ariana Grande 95
8 144486 1vyhD5VmyZ7KMfW5gqLgo5 27286822.0 ['latin', 'reggaeton', 'reggaeton colombiano', 'trap latino'] J Balvin 95
9 115491 7iK8PXO48WeuP03g8YR51W 5001808.0 ['trap latino'] Myke Towers 95

We notice that the top ten songs and artists differ. Justin Bieber is an unquestionable king of pop, and although their songs are not the most popular right now, The Weeknd, Taylor Swift, and Drake are the listeners' favorites too.

  • Analyzing the Genres:-

Looking at the head of the dataframe, we also observe that for many rows, the column 'genres' is an empty list, these can be seen as NA values as well. To handle this type of situation it's important to see the proportion of such rows to the overall shape of the dataset.

df_artists[df_artists["genres"]=='[]']

We see that there are 59742 rows with empty lists passed out of the total 89336 rows of the dataset.

For this particular case, we will create and perform our analysis on a new dataframe with only those rows that contain some value under the column 'genres'.

df_genre=df_artists[df_artists["genres"]!='[]']
df_genre.head()
id followers genres name popularity
45 0VLMVnVbJyJ4oyZs2L3Yl2 71.0 ['carnaval cadiz'] Las Viudas De Los Bisabuelos 6
46 0dt23bs4w8zx154C5xdVyl 63.0 ['carnaval cadiz'] Los De Capuchinos 5
47 0pGhoB99qpEJEsBQxgaskQ 64.0 ['carnaval cadiz'] Los “Pofesionales” 7
48 3HDrX2OtSuXLW5dLR85uN3 53.0 ['carnaval cadiz'] Los Que No Paran De Rajar 6
136 22mLrN5fkppmuUPsHx6i2G 59.0 ['classical harp', 'harp'] Vera Dulova 3

We have successfully segregated the rows we require for our analysis.

We observe that the column 'genres' has a list passed as value. Let's split these lists into individual values. For this task, we will use a very special function of pandas, called explode().

  • explode() function splits the list by each element and create a new row for each of the element.
df_sort_genres=pd.DataFrame(df_genre.assign(genres=df_genre.genres.str.split(",")).explode('genres'))
df_sort_genres.tail()
id followers genres name popularity
1104328 1q9C5XlekzXbRLIuLCDTre 90087.0 'teen pop'] Brent Rivera 33
1104331 4fh2BIKYPFvXFsQLhaeVJp 309.0 ['la indie'] Lone Kodiak 20
1104334 7akMsd2vb4xowNTehv3gsY 774.0 ['indie rockism'] The Str!ke 0
1104336 35m7AJrUCtHYHyIUhCzmgi 205.0 ['indie rockism'] Hunter Fraser 6
1104345 1ljurfXKPlGncNdW3J8zJ8 2123.0 ['deep acoustic pop'] Right the Stars 18

Replacing the square brackets with " " blank spaces to get only the keywords.

df_sort_genres['genres']=df_sort_genres.genres.str.replace('[',' ')
df_sort_genres['genres']=df_sort_genres.genres.str.replace(']',' ')

Let's analyze the top 30 genres now.

# get top 30 most commom genres
n = 30
top_30=pd.DataFrame(df_sort_genres['genres'].value_counts()[:n]).reset_index()
top_30.rename(columns = {'index':'Genres','genres':'Total_Count'}, inplace = True)
top_30
Genres Total_Count
0 'dance pop' 551
1 'latin' 483
2 'electro house' 478
3 'pop' 461
4 'edm' 455
5 'hip hop' 455
6 'electropop' 432
7 'indie rock' 411
8 'classical performance' 407
9 'tropical' 402
10 'latin rock' 401
11 'french hip hop' 400
12 'lo-fi beats' 393
13 'urban contemporary' 386
14 'rap' 366
15 'pop rap' 365
16 'funk' 365
17 'modern rock' 361
18 'indie folk' 353
19 'adult standards' 349
20 'pop dance' 347
21 'country rock' 346
22 'uk hip hop' 343
23 'corrido' 339
24 'stomp and holler' 338
25 'art rock' 336
26 'alternative rock' 333
27 'alternative metal' 328
28 'indie pop' 325
29 'alternative r&b' 325

Should we visualize this output using a pie chart? Or wait, let's make it more interesting and visualize it using a Donut Chart

We observe that the top 30 genres have more or less the same count with the 'Dance Pop' at the top. More information can be seen using the graph above.

What we learned

  • For tracks.csv file:-
  1. We found the first 10 most popular songs released in 2020 & 2021 using query() function of pandas.
  2. We found the titles of the most popular songs of March 2020.
  3. We saw how different features are correlated with each other using the corr() and the heatmap functions of pandas and plotly respectively
  4. We learned to visualize our result using Bubble Charts.
  • For artists.csv:-
  1. We found the top 10 artists
  2. We learned the use of the explode() function and found the top 30 genres.
  3. We visualized our result using a Donut Chart.

Now you know what to do to be a successful producer and use data and analytics to your advantage to pick the next big hit.

Go to https://datascience.fm/tag/pandas/ for more tutorials where we take popular datasets and analyze them with Pandas.

If you are looking for jobs in AI and DS check out Deep Learning Careers

Feedback is important to us. Write us at hello@datascience.fm

Tell us what articles you want to see more of and the kinds of YouTube videos we should create. for you.