pandas

Analyzing the Spotify dataset to gain insights in the music industry

Building a strong foundation through the pandas library by working on the 'Spotify' dataset. We will discuss some very basic tools that pandas provide to help gain insights into any dataset in the music domain.

In the past two posts within our Pandas series, we analyzed data from Chipotle restaurant and Flipkart online store. Today, we're going to look at Spotify dataset from a perspective of a recording studio start-up.

Btw if you are looking for roles in AI do check out https://www.deeplearning.careers

Imagine wanting to get started in the music industry. You believe you have a skill for spotting great talent and promoting them to become stars. Yet, apart from your ingenious ability, you need to be familiar with facts.

You want to know:

What tracks are most popular amongst Spotify users?
How many tracks gained popularity over 90 out of 100.
Which tracks released in March 2020 gained popularity among over 80.
Is there a correlation between popularity and a track's traits?
How long should an average track last according to today's standards?
What is the correlation between tracks' different features?
Who is currently most popular and what genres do they represent?

Data to download: https://www.kaggle.com/lehaknarnauli/spotify-datasets

For a detailed explanation of the dataset: https://developer.spotify.com/documentation/web-api/reference/#category-tracks

✔️

Check us out on Twitter for more such cool articles

Follow @DatascienceFM

Before you continue reading, check out our new ChatGPT Tool for Job Seekers

Job hunting can be a demanding and time-consuming process. We believe our tool can take some of the stress out of it by leveraging AI to help you write compelling, personalized cover letters. We're excited to share this tool with you and look forward to hearing about your experiences and successes.

Experience the future of job application preparation. Give our ChatGPT-powered tool a try today!

https://jobseekerai.netlify.app/

Pre-processing the data

Today we'll only start with the Pandas library.

import pandas as pd

The Spotify dataset is quite huge and there are several files containing slightly different data. Today we'll use tracks and artists' datasets. We'll start with the tracks dataset.

# Loading the datset
df_tracks = pd.read_csv('/content/drive/MyDrive/tracks.csv')
df_tracks

You'll see that this dataset consists of 122860 rows and 20 columns. To be sure if we can trust this dataset, it's important to check if any values are missing.

pd.isnull(df_tracks).sum().sum()
71

Pandas pd.isnull()returns a dataset with booleans True and False saying if the value is missing. Calling sum() twice on this gives us a total number of all the missing values in the dataset. If we only used it once, we'd get a sum of missing values for each column.

There are just 71 missing values in a dataset consisting of over 500k rows. It means it not bad, and our conclusions will be valid. Such a dataset is a Pythonista's delight!

To make things easier, we should change the release_date to a date type and then put months and years into separate columns. We did a similar operation last time, so you might already be familiar with this method. Let's practice.

year = df_tracks['release_date'].apply(lambda x : x.year)
month = df_tracks['release_date'].apply(lambda x : x.month)
df_tracks.insert(loc=8, column='year', value=year) 
df_tracks.insert(loc=9, column='month', value=month)

This time instead of assigning a new column by defining df_tracks['year'] we used the function insert(). This method allows us to choose the exact position of the new column ( loc ). If we had done it the old-fashioned way, the new columns would have ended up at the last possible index.

Exploring the dataset

The dataset we're using right now consists of 20 columns. We'll find here information such as name, popularity, duration, explicit, artist, release_date, and tracks traits like danceability, speechiness, loudness , etc.

Popularity is measured on a scale between 0 and 100, where 100 is the best. Given our knowledge of the music industry, let's check if what we feel is true.

What are the most popular songs right now?

To check this, let's use a great pandas function query(). This is a filtering function that enables the selection and filters the columns of a dataFrame with a boolean expression.

It is important to note that Dataframe.query() the method only works if the column name doesn’t have any empty spaces. So before applying the method, spaces in column names should be replaced with ‘_’

Let's learn how this function is used.

most_popular = df_tracks.query('popularity>90', inplace=False).sort_values('popularity', ascending=False)
most_popular[:10]

It's important to note that the whole function's expression is passed in quotation marks.

	id	name	popularity	duration_ms	explicit	artists	id_artists	release_date	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	time_signature	year	month
93802	4iJyoBOLtHqaGxP12qzhQI	Peaches (feat. Daniel Caesar & Giveon)	100	198082	1	['Justin Bieber', 'Daniel Caesar', 'Giveon']	['1uNFoZAHBGtllmzznpCI3s', '20wkVLutqVOYrc0kxF...	2021-03-19	0.677	0.696	0	-6.181	1	0.1190	0.32100	0.000000	0.4200	0.464	90.030	4	2021	3
93803	7lPN2DXiMsVn7XUKtOW1CS	drivers license	99	242014	1	['Olivia Rodrigo']	['1McMsnEElThX1knmY4oliG']	2021-01-08	0.585	0.436	10	-8.761	1	0.0601	0.72100	0.000013	0.1050	0.132	143.874	4	2021	1
93804	3Ofmpyhv5UAQ70mENzB277	Astronaut In The Ocean	98	132780	0	['Masked Wolf']	['1uU7g3DNSbsu0QjSEqZtEd']	2021-01-06	0.778	0.695	4	-6.865	0	0.0913	0.17500	0.000000	0.1500	0.472	149.996	4	2021	1
92810	5QO79kh1waicV47BqGRL3g	Save Your Tears	97	215627	1	['The Weeknd']	['1Xyo4u8uXC1ZmMpatF05PJ']	2020-03-20	0.680	0.826	0	-5.487	1	0.0309	0.02120	0.000012	0.5430	0.644	118.051	4	2020	3
92811	6tDDoYIxWvMLTdKpjFkc1B	telepatía	97	160191	0	['Kali Uchis']	['1U1el3k54VvEUzo3ybLPlM']	2020-12-04	0.653	0.524	11	-9.016	0	0.0502	0.11200	0.000000	0.2030	0.553	83.970	4	2020	12
92813	0VjIjW4GlUZAMYd2vXMi3b	Blinding Lights	96	200040	0	['The Weeknd']	['1Xyo4u8uXC1ZmMpatF05PJ']	2020-03-20	0.514	0.730	1	-5.934	1	0.0598	0.00146	0.000095	0.0897	0.334	171.005	4	2020	3
93805	7MAibcTli4IisCtbHKrGMh	Leave The Door Open	96	242096	0	['Bruno Mars', 'Anderson .Paak', 'Silk Sonic']	['0du5cEVh5yTK9QJze8zA0C', '3jK9MiCrA42lLAdMGU...	2021-03-05	0.586	0.616	5	-7.964	1	0.0324	0.18200	0.000000	0.0927	0.719	148.088	4	2021	3
92814	6f3Slt0GbA2bPZlz0aIFXN	The Business	95	164000	0	['Tiësto']	['2o5jDhtHVPhrJdv3cEQ99Z']	2020-09-16	0.798	0.620	8	-7.079	0	0.2320	0.41400	0.019200	0.1120	0.235	120.031	4	2020	9
91866	60ynsPSSKe6O3sfwRnIBRf	Streets	94	226987	1	['Doja Cat']	['5cj0lLjcoR7YOSnhnX0Po5']	2019-11-07	0.749	0.463	11	-8.433	1	0.0828	0.20800	0.037100	0.3370	0.190	90.028	4	2019	11
92816	3FAJ6O0NOHQV8Mc5Ri6ENp	Heartbreak Anniversary	94	198371	0	['Giveon']	['4fxd5Ee7UefO4CUXgwJ7IP']	2020-03-27	0.449	0.465	0	-8.964	1	0.0791	0.52400	0.000001	0.3030	0.543	89.087	3	2020	3

At the first sight, we can see that the first 10 most popular songs were released either in 2020 or 2021 and that almost half of them contain some explicit content indicated by the binary 1 in the explicit column.

To see if our conclusions are right, let's sort the filtered values and show the columns of interest.

pop_date = most_popular.sort_values('release_date', ascending=False)
pop_date[['name', 'popularity', 'explicit','release_date']][:20]

	name	popularity	explicit	release_date
93802	Peaches (feat. Daniel Caesar & Giveon)	100	1	2021-03-19
93805	Leave The Door Open	96	0	2021-03-05
93815	What’s Next	91	1	2021-03-05
93811	Hold On	92	0	2021-03-05
93816	We're Good	91	0	2021-02-11
93813	911	91	1	2021-02-05
93809	Up	92	1	2021-02-05
93806	Fiel	94	0	2021-02-04
93808	Ella No Es Tuya - Remix	92	0	2021-02-03
93812	Wellerman - Sea Shanty / 220 KID x Billen Ted ...	92	0	2021-01-21
93810	Goosebumps - Remix	92	1	2021-01-15
93814	Your Love (9PM)	91	0	2021-01-15
93807	Friday (feat. Mufasa & Hypeman) - Dopamine Re-...	94	0	2021-01-15
93803	drivers license	99	1	2021-01-08
93804	Astronaut In The Ocean	98	0	2021-01-06
92823	Good Days	93	1	2020-12-25
92819	Bandido	94	0	2020-12-10
92811	telepatía	97	0	2020-12-04
92821	LA NOCHE DE ANOCHE	93	0	2020-11-27
92830	Dynamite	91	0	2020-11-20

We know which songs are the most popular in general, but as a good producer, you need to understand human emotions and how they shape the market.

In times of crisis, both artists and the audience have different tastes. Last year in March, the world went under a complete lockdown. It's natural to wonder what were some of the top songs released back then.

We're going to use query() again, but this time going one step further and dealing with a bit more complex problem since we are defining two conditions.

The conditions are as follows:-

Songs with popularity greater than or equal to 80.
Songs that released in March 2020

Let's see how we'll approach this problem.

most_popular_march_20 = df_tracks.query('(popularity > 80) and (year in ["2020"]) and (month in ["3"])')
most_popular_march_20

	id	name	popularity	duration_ms	explicit	artists	id_artists	release_date	year	month	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	time_signature
92810	5QO79kh1waicV47BqGRL3g	Save Your Tears	97	215627	1	['The Weeknd']	['1Xyo4u8uXC1ZmMpatF05PJ']	2020-03-20	2020	3	0.680	0.826	0	-5.487	1	0.0309	0.02120	0.000012	0.5430	0.644	118.051	4
92813	0VjIjW4GlUZAMYd2vXMi3b	Blinding Lights	96	200040	0	['The Weeknd']	['1Xyo4u8uXC1ZmMpatF05PJ']	2020-03-20	2020	3	0.514	0.730	1	-5.934	1	0.0598	0.00146	0.000095	0.0897	0.334	171.005	4
92816	3FAJ6O0NOHQV8Mc5Ri6ENp	Heartbreak Anniversary	94	198371	0	['Giveon']	['4fxd5Ee7UefO4CUXgwJ7IP']	2020-03-27	2020	3	0.449	0.465	0	-8.964	1	0.0791	0.52400	0.000001	0.3030	0.543	89.087	3
92853	4xqrdfXkTW4T0RauPLv3WA	Heather	89	198040	0	['Conan Gray']	['4Uc8Dsxct0oMqx0P6i60ea']	2020-03-20	2020	3	0.357	0.425	5	-7.301	1	0.0333	0.58400	0.000000	0.3220	0.270	102.078	3
92867	5nujrmhLynf4yMoMtj8AQF	Levitating (feat. DaBaby)	89	203064	0	['Dua Lipa', 'DaBaby']	['6M2wZ9GZgrQXHCFfjv46we', '4r63FhuTkUYltbVAg5...	2020-03-27	2020	3	0.702	0.825	6	-3.787	0	0.0601	0.00883	0.000000	0.0674	0.915	102.977	4
92927	7szuecWAPwGoV1e5vGu8tl	In Your Eyes	86	237520	1	['The Weeknd']	['1Xyo4u8uXC1ZmMpatF05PJ']	2020-03-20	2020	3	0.667	0.719	7	-5.371	0	0.0346	0.00285	0.000081	0.0736	0.717	100.021	4
92951	6KfoDhO4XUWSbnyKjNp9c4	Maniac	86	185773	0	['Conan Gray']	['4Uc8Dsxct0oMqx0P6i60ea']	2020-03-20	2020	3	0.628	0.639	8	-5.460	1	0.0435	0.00162	0.000000	0.3540	0.493	108.045	4
92961	3PfIrDoz19wz7qK7tYeu62	Don't Start Now	85	183290	0	['Dua Lipa']	['6M2wZ9GZgrQXHCFfjv46we']	2020-03-27	2020	3	0.793	0.793	11	-4.521	0	0.0830	0.01230	0.000000	0.0951	0.679	123.950	4
92995	5m5aY6S9ttfIG157xli2Rs	Alô Ambev (Segue Sua Vida) - Ao Vivo	84	169593	0	['Zé Neto & Cristiano']	['487N2T9nIPEHrlTZLL3SQs']	2020-03-26	2020	3	0.695	0.872	9	-3.650	1	0.0868	0.33400	0.000000	0.9540	0.646	121.843	4
93021	527k23H0A4Q0UJN3vGs0Da	After Party	84	167916	1	['Don Toliver']	['4Gso3d4CscCijv0lmajZWs']	2020-03-13	2020	3	0.629	0.692	5	-8.045	1	0.0376	0.00981	0.331000	0.6030	0.453	162.948	4
93025	017PF4Q3l4DBUiWoXk4OWT	Break My Heart	84	221820	0	['Dua Lipa']	['6M2wZ9GZgrQXHCFfjv46we']	2020-03-27	2020	3	0.730	0.729	4	-3.434	0	0.0883	0.16700	0.000001	0.3490	0.467	113.013	4
93071	1jaTQ3nqY3oAAYyCTbIvnM	WHATS POPPIN	83	139741	1	['Jack Harlow']	['2LIk90788K0zvyj2JJVwkJ']	2020-03-13	2020	3	0.923	0.604	11	-6.671	0	0.2450	0.01700	0.000000	0.2720	0.826	145.062	4
93079	7AzlLxHn24DxjgQX73F9fU	No Idea	83	154424	0	['Don Toliver']	['4Gso3d4CscCijv0lmajZWs']	2020-03-13	2020	3	0.652	0.631	6	-5.718	0	0.0893	0.52400	0.000579	0.1650	0.350	127.998	4
93139	39LLxExYz6ewLAcYrzQQyP	Levitating	82	203808	0	['Dua Lipa']	['6M2wZ9GZgrQXHCFfjv46we']	2020-03-27	2020	3	0.695	0.884	6	-2.278	0	0.0753	0.05610	0.000000	0.2130	0.914	103.014	4
93149	4lsHZ92XCFOQfzJFBTluk8	You Got It	82	203145	1	['Vedo']	['3wVXTWabe3viT0jF7DfjOL']	2020-03-27	2020	3	0.762	0.433	5	-8.937	1	0.1870	0.14300	0.000000	0.1180	0.394	122.074	4
93187	2lCkncy6bIB0LTMT7kvrD1	Azul	81	205933	0	['J Balvin']	['1vyhD5VmyZ7KMfW5gqLgo5']	2020-03-19	2020	3	0.843	0.836	11	-2.474	0	0.0695	0.08160	0.001380	0.0532	0.650	94.018	4
93191	6qBFSepqLCuh5tehehc1bd	Like I Want You	81	260776	0	['Giveon']	['4fxd5Ee7UefO4CUXgwJ7IP']	2020-03-27	2020	3	0.678	0.355	10	-7.757	0	0.0627	0.75900	0.000071	0.1140	0.438	119.772	3
93221	6bnF93Rx87YqUBLSgjiMU8	Heartless	81	198267	1	['The Weeknd']	['1Xyo4u8uXC1ZmMpatF05PJ']	2020-03-20	2020	3	0.537	0.746	10	-5.507	0	0.1500	0.02360	0.000001	0.1560	0.252	170.062	4

The title of the songs seems to be mirroring the world's mood back then: 'Save Your Tears', 'Heartbreak Anniversary', 'Maniac', 'Break My Heart', and more.

Features and Popularity

Let's hop on to our next problem. We know that different features of a song can impact its popularity in different ways, however, we want to dig deeper and see how. This is one of the most important questions that we should ask.

How do different features of a song impact its popularity?

We get an intuition that large audiences like songs that are compatible with dancing. Let's see if we can back this up with data as well.

df_1=df_tracks.groupby('popularity')['danceability'].mean().sort_values(ascending=[False]).reset_index()
df_1.head()

	popularity	danceability
0	95	0.798000
1	98	0.778000
2	91	0.751091
3	88	0.727105
4	85	0.712600

We have created a different dataset df_1. This dataframe will have the popularity for different songs grouped by the mean of the danceability score.

Now, this makes it easier for us to analyze the correlation between these two features.

We will use the basics of a plotting library in python called Plotly. It is an interactive, open-source plotting library that supports various charts.

It's important to import the library before moving ahead.

import plotly.express as px   #importing plotly
fig2 = px.scatter(df_tracks, x="popularity", y="danceability", color="danceability",size='popularity')
fig2.show()

Bingo! We have plotted a scatter plot. In this case, since the area of the circles corresponds to the danceability score, we can call this chart a Bubble Chart as well. This implies that the more the popularity score the greater will be the area corresponding to that particular bubble and vice versa.

The graph in itself is interactive and with one quick look at it, we release that 'popularity' and 'danceability' are positively correlated, which implies that, as the popularity of the song increase, the danceability score for that song also increases.

We don't always need to plot a graph to check the correlation between two features. The same can also be achieved with a few simple codes given below.

We will use a module called Scipy.Stats for this. This module contains a large number of probability distributions as well as a growing library of statistical functions.

Next, we import pearsonr function from this module, which helps us calculate Pearson's Correlation Constant 'r' for two different features.

Let's learn how.

from scipy.stats import pearsonr    #importing the library
data1 = df_1['popularity']
data2 = df_1['danceability']

# calculate Pearson's correlation
corr, _ = pearsonr(data1, data2)
print('Pearsons correlation: %.3f' % corr)

Pearsons correlation: 0.888

Doing a quick revision. The following are the three conditions for the Pearson's Correlation Coefficient 'r':-

r>0, implies, positive correlation
r=0, implies, no correlation
r<0, implies, negative correlation.

We see that since the value for the correlation coefficient r =0.88 (>0), the two features are positively correlated, or we can say, an increase in one feature will have an increase in the other feature and vice versa.

Let's visualize the same for some other features as well.

How about "instrumentalness"? Instrumentalness of value 1 indicates that there are no words at all, the lower the value the more words the song contains.

Following a similar procedure and plotting a Bubble Chart for Popularity v/s Instrumentalness.

Looking at the graph, it is easily noticeable that the two features are negatively correlated, which implies that, an increase in one leads to a decrease in another and vice versa.

So, do we need to check the correlation for each feature one by one? Not at all.

Thanks to pandas and plotly, we can easily check the correlation between any two given features. How? Let's find out.

We will use the combination of a function corr() in pandas, as well as, heatmap in plotly, so what exactly do these two do?

corr() - This function of pandas helps us compute pairwise correlation between different columns of the dataset (excluding NA/null values).
Heatmap - This function under plotly shows the magnitude of a phenomenon as colour in two dimensions.

To see how the combination of these two can help us achieve the desired result, we need to first import another plotly library called Graph Objects.

import plotly.graph_objects as go    #importing the library

matrix=df_tracks.corr()  #returns a matrix with correlation of all features
x_list=['popularity','duration_ms','explicit',
        'danceability','energy','key','loudness',
        'mode','speechiness','acousticness','instrumentalness',
        'liveness','valence','tempo','time_signature']

fig_heatmap = go.Figure(data=go.Heatmap(
                   z=matrix,
                   x=x_list,
                   y=x_list,
                   hoverongaps = False))
fig_heatmap.update_layout(margin = dict(t=200,r=200,b=200,l=200),
    width = 800, height = 650,
    autosize = False )

fig_heatmap.show()

Here, the legend of the graph shows us how the color gets lighter as the correlation increases. We observe that there is no significant positive correlation between popularity and a song's feature. The most positive correlation occurs between popularity, danceability, loudness, and energy.

Let's just quickly check one more question.

How long does a song lasts on average today? Has it always been like that?

Before we visualize our result, it's important to change the unit of duration from milliseconds to minutes.

df['duration']=df['duration_ms']//1000 #Floor division to get only the quotient
df.drop(['duration_ms'], axis = 1)

We observe that the average duration of the song has increased since 1969, but has remained more or less the same ever since. More information can be observed through the above graph.

Most popular artists

To check the most popular artists we'll use the artists' dataset.

We will start by importing the dataset into our python environment.

df_artists = pd.read_csv('/content/drive/MyDrive/artists.csv')
df_artists.head()

	id	followers	genres	name
0	0DheY5irMjBUeLybbCUEZ2	0.0	[]	Armid & Amir Zare Pashai feat. Sara Rouzbehani
1	0DlhY15l3wsrnlfGio2bjU	5.0	[]	ปูนา ภาวิณี
2	0DmRESX2JknGPQyO15yxg7	0.0	[]	Sadaa
3	0DmhnbHjm1qw6NCYPeZNgJ	0.0	[]	Tra'gruda
4	0Dn11fWM7vHQ3rinvWEl4E	2.0	[]	Ioannis Panoutsopoulos

Our dataset has 1104349 rows and 5 columns.

Analysis was done on Artists:-

artists_popular = df_artists.sort_values(by=['popularity'], ascending=False).reset_index()
artists_popular[:10]

	index	id	followers	genres	name	popularity
0	144481	1uNFoZAHBGtllmzznpCI3s	44606973.0	['canadian pop', 'pop', 'post-teen pop']	Justin Bieber	100
1	115489	4q3ewBCX7sLwd24euuV69X	32244734.0	['latin', 'reggaeton', 'trap latino']	Bad Bunny	98
2	126338	06HL4z0CvFAxyc27GXpf02	38869193.0	['pop', 'post-teen pop']	Taylor Swift	98
3	313676	3TVXtAsR1Inumwj472S9r4	54416812.0	['canadian hip hop', 'canadian pop', 'hip hop', 'pop rap', 'rap', 'toronto rap']	Drake	98
4	144484	3Nrfpe0tUJi4K4DXYWgMUX	31623813.0	['k-pop', 'k-pop boy group']	BTS	96
5	115490	4MCBfE4596Uoi2O4DtmEMz	16996777.0	['chicago rap', 'melodic rap']	Juice WRLD	96
6	144483	1Xyo4u8uXC1ZmMpatF05PJ	31308207.0	['canadian contemporary r&b', 'canadian pop', 'pop']	The Weeknd	96
7	144485	66CXWjxzNUsdJxJ2JdwvnR	61301006.0	['pop', 'post-teen pop']	Ariana Grande	95
8	144486	1vyhD5VmyZ7KMfW5gqLgo5	27286822.0	['latin', 'reggaeton', 'reggaeton colombiano', 'trap latino']	J Balvin	95
9	115491	7iK8PXO48WeuP03g8YR51W	5001808.0	['trap latino']	Myke Towers	95

We notice that the top ten songs and artists differ. Justin Bieber is an unquestionable king of pop, and although their songs are not the most popular right now, The Weeknd, Taylor Swift, and Drake are the listeners' favorites too.

Analyzing the Genres:-

Looking at the head of the dataframe, we also observe that for many rows, the column 'genres' is an empty list, these can be seen as NA values as well. To handle this type of situation it's important to see the proportion of such rows to the overall shape of the dataset.

df_artists[df_artists["genres"]=='[]']

We see that there are 59742 rows with empty lists passed out of the total 89336 rows of the dataset.

For this particular case, we will create and perform our analysis on a new dataframe with only those rows that contain some value under the column 'genres'.

df_genre=df_artists[df_artists["genres"]!='[]']
df_genre.head()

	id	followers	genres	name	popularity
45	0VLMVnVbJyJ4oyZs2L3Yl2	71.0	['carnaval cadiz']	Las Viudas De Los Bisabuelos	6
46	0dt23bs4w8zx154C5xdVyl	63.0	['carnaval cadiz']	Los De Capuchinos	5
47	0pGhoB99qpEJEsBQxgaskQ	64.0	['carnaval cadiz']	Los “Pofesionales”	7
48	3HDrX2OtSuXLW5dLR85uN3	53.0	['carnaval cadiz']	Los Que No Paran De Rajar	6
136	22mLrN5fkppmuUPsHx6i2G	59.0	['classical harp', 'harp']	Vera Dulova	3

We have successfully segregated the rows we require for our analysis.

We observe that the column 'genres' has a list passed as value. Let's split these lists into individual values. For this task, we will use a very special function of pandas, called explode().

explode() function splits the list by each element and create a new row for each of the element.

df_sort_genres=pd.DataFrame(df_genre.assign(genres=df_genre.genres.str.split(",")).explode('genres'))
df_sort_genres.tail()

	id	followers	genres	name	popularity
1104328	1q9C5XlekzXbRLIuLCDTre	90087.0	'teen pop']	Brent Rivera	33
1104331	4fh2BIKYPFvXFsQLhaeVJp	309.0	['la indie']	Lone Kodiak	20
1104334	7akMsd2vb4xowNTehv3gsY	774.0	['indie rockism']	The Str!ke	0
1104336	35m7AJrUCtHYHyIUhCzmgi	205.0	['indie rockism']	Hunter Fraser	6
1104345	1ljurfXKPlGncNdW3J8zJ8	2123.0	['deep acoustic pop']	Right the Stars	18

Replacing the square brackets with " " blank spaces to get only the keywords.

df_sort_genres['genres']=df_sort_genres.genres.str.replace('[',' ')
df_sort_genres['genres']=df_sort_genres.genres.str.replace(']',' ')

Let's analyze the top 30 genres now.

# get top 30 most commom genres
n = 30
top_30=pd.DataFrame(df_sort_genres['genres'].value_counts()[:n]).reset_index()
top_30.rename(columns = {'index':'Genres','genres':'Total_Count'}, inplace = True)
top_30

	Genres	Total_Count
0	'dance pop'	551
1	'latin'	483
2	'electro house'	478
3	'pop'	461
4	'edm'	455
5	'hip hop'	455
6	'electropop'	432
7	'indie rock'	411
8	'classical performance'	407
9	'tropical'	402
10	'latin rock'	401
11	'french hip hop'	400
12	'lo-fi beats'	393
13	'urban contemporary'	386
14	'rap'	366
15	'pop rap'	365
16	'funk'	365
17	'modern rock'	361
18	'indie folk'	353
19	'adult standards'	349
20	'pop dance'	347
21	'country rock'	346
22	'uk hip hop'	343
23	'corrido'	339
24	'stomp and holler'	338
25	'art rock'	336
26	'alternative rock'	333
27	'alternative metal'	328
28	'indie pop'	325
29	'alternative r&b'	325

Should we visualize this output using a pie chart? Or wait, let's make it more interesting and visualize it using a Donut Chart

We observe that the top 30 genres have more or less the same count with the 'Dance Pop' at the top. More information can be seen using the graph above.

What we learned

For tracks.csv file:-

We found the first 10 most popular songs released in 2020 & 2021 using query() function of pandas.
We found the titles of the most popular songs of March 2020.
We saw how different features are correlated with each other using the corr() and the heatmap functions of pandas and plotly respectively
We learned to visualize our result using Bubble Charts.

For artists.csv:-

We found the top 10 artists
We learned the use of the explode() function and found the top 30 genres.
We visualized our result using a Donut Chart.

Now you know what to do to be a successful producer and use data and analytics to your advantage to pick the next big hit.

Go to https://datascience.fm/tag/pandas/ for more tutorials where we take popular datasets and analyze them with Pandas.

If you are looking for jobs in AI and DS check out Deep Learning Careers

Feedback is important to us. Write us at hello@datascience.fm

Tell us what articles you want to see more of and the kinds of YouTube videos we should create. for you.

Analyzing the Spotify dataset to gain insights in the music industry

Pre-processing the data

Exploring the dataset

Features and Popularity

Most popular artists

What we learned

Read next

Pandas Walkthrough E-book on 3 Most Important Concepts

Mega Guide to Pandas for Data Scientists

Pandas Group By - Key areas you should watch out for