pandas

Video Game Sales Analysis

Analyzing the 'Video Game Sales Analysis' dataset with the help of pandas library and learn the fundamentals of plotly by visualizing the result using simple yet interactive graphs.

Welcome back to another pandas posts with us. We will be analyzing the video games sales analysis using pandas and visualizing the results using plotly.

Who doesn't like video games? Video games have become a favourite pastime for most of us. Video games are a billion-dollar business and have been for many years. Now imagine being able to analyze the sales trends for the popular Nintendo games and many more such platforms. How cool would it be!

In this module, we will be laying the foundation for our analysis by processing and exploring a large amount of data on video game sales. The dataset contains information regarding the sales of video games across various regions like North America, Europe, Japan and also globally, while also giving information regarding the Names, Publishers and Platforms. This dataset has been made available thanks to Kaggle which is the home for many such datasets and competitions.

Get the data here.

You are a data scientist in the gaming industry and want to analyze the sales pattern for different features. You want to investigate different features and approach the game producer with a thorough report. You want to find the answers to the following questions:

Which region has performed the best in terms of sales?
The top gaming consoles are Microsoft (Xbox), Sony (Playstation) and Nintendo, with Google acting as a new competitor. Does the dataset also back this information? Analyze with respect to different regions.
What are the top 10 games currently making the most sales globally?
What are the top games for different regions?
Are there any games with release year older than 2000 that are still making high sales? What are they?
What are the top gaming genres that are making high sales?
Does the publisher have any impact on the regional sales?
Is there any region that has out-performed global average sales?

We will be exploring the questions above using data, analytics and visualization techniques one at a time. So let's get started!

Before you continue reading, check out our new ChatGPT Tool for Job Seekers

Job hunting can be a demanding and time-consuming process. We believe our tool can take some of the stress out of it by leveraging AI to help you write compelling, personalized cover letters. We're excited to share this tool with you and look forward to hearing about your experiences and successes.

Experience the future of job application preparation. Give our ChatGPT-powered tool a try today!

https://jobseekerai.netlify.app/

Learning Objectives

After this module, we should be able to do the following:

Use pandas library to analyze different features of the dataset, which includes,

Read the dataset.
Use functions like df.query and df.groupby.
Analyze the sales for different regions with respect to various features.

2. Use plotly library to visualize the given results.

Plot graphs like bar graphs and pie charts.

Understanding the dataset

This dataset contains a list of video games with sales greater than 100,000 copies. Each row in the dataset represents a game, whereas, each column contains different attributes.

The data set includes information about:

Rank - The overall rank of the game.
Name - The name of the game.
Platform- The platform on which the game was released.
Year- The year in which the game was released.
Genre-The genre of the game.
Publisher-The publisher of the game.
NA_Sales, EU_Sales, JP_Sales , Other_Sales, Global_Sales - Sales made by a particular game in North America, Europe, Japan, Other regions and Globally (in millions).

Pre-processing the dataset

To get started, we need to import some useful libraries that will help us import the dataset into our python environment, manipulate and analyze the same and later help us to visualize it.

#Importing libraries
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import warnings
warnings.filterwarnings("ignore")
print("The modules are imported")

#Importing dataset
df=pd.read_csv("/content/vgsales.csv")
df.head()

	Rank	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
0	1	Wii Sports	Wii	2006.0	Sports	Nintendo	41.49	29.02	3.77	8.46	82.74
1	2	Super Mario Bros.	NES	1985.0	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24
2	3	Mario Kart Wii	Wii	2008.0	Racing	Nintendo	15.85	12.88	3.79	3.31	35.82
3	4	Wii Sports Resort	Wii	2009.0	Sports	Nintendo	15.75	11.01	3.28	2.96	33.00
4	5	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37

Using df.shape function, we see that our dataset has 16598 rows and 11 columns.

Now, we will look for missing values using the df.isnull().sum() function. We observe that the column Year has 271 missing values and the column Publisher has 58 missing values. We will treat the missing values as follows:

ForYear, we use the code df["Year"].fillna(df["Year"].quantile(0.50),inplace = True)
ForPublisher, we use the codedf["Publisher"].dropna(inplace = True)

Now that we are done, it's time to jump to the questions.

💡

Before that, follow us on Twitter to stay updated about more such content.

Follow @DatascienceFM

1. Which region has performed the best in terms of sales?

We will utilize the average sales made per region and compare the results. Before we do that, let's make sure we know how to calculate the average. We will be using a very simple line of code for this, i.e., df['Region'].mean(), where 'region' depicts the different regions in the dataset and mean() function is used to calculate the mean. We observe that our output is coming in decimals, to convert the values in millions, let's multiple the result with 10,00,000. The final code should look like this.

x=(df['NA_Sales'].mean()*1000000)
y=(df['EU_Sales'].mean()*1000000)
z=(df['JP_Sales'].mean()*1000000)
q=(df['Other_Sales'].mean()*1000000)
p=(df['Global_Sales'].mean()*1000000)

print("The average sales in North America =", (f"${x:,.3f}")) #comma separated values till 3 decimal place and $ sign
print("The average sales in Europe =",(f"${y:,.3f}"))
print("The average sales in Japan =",(f"${z:,.3f}"))
print("The average sales in other regions =",(f"${q:,.3f}"))
print("The average sales globally =",(f"${p:,.3f}"))

The average sales in North America = $264,667.430
The average sales in Europe = $146,652.006
The average sales in Japan = $77,781.660
The average sales in other regions = $48,063.020
The average sales globally = $537,440.656

Let us now plot our findings on a bar chart.

colors = ['lightslategray',] * 4
colors[1]='darkgray'
colors[2]='grey'
colors[3]='dimgrey'
colors[0] = 'crimson'

bar1 = go.Figure(data=[go.Bar(
    y=['Global','North America', 'Europe', 'Japan',
       'Other'],
    x=[537440.656,264667.430, 146652.006, 77781.660, 48063.020],
    orientation='h',
    marker_color=colors # marker color can be a single color value or an iterable
)])
bar1.update_layout(title_text='Region with highest sales on an average')
bar1.update_xaxes(title='Average Sales')
bar1.update_yaxes(title='Regions')

We have successfully used a bar graph to demonstrate the results. It is clear now which region has made the most sales (expect global sales) in terms of video game. We observe that North America has the highest average sales of $264,667.430. It's clear now that we can use the bar graph to answer questions like, which country sells the most video games in the world? This will be help us make some strategy changes.

The above result also answers our 8th questions, i.e., Is there any region that has out-performed global average sales?

The answer is 'No'. There is no region that has out-performed the global average sales. The average global sales is $537,440.656 million.

2. The top gaming consoles are Microsoft (Xbox), Sony (Playstation) and Nintendo, with Google acting as a new competitor.

Does the dataset also back this information? Analyze with respect to different regions and also, globally.

The world is getting more connected every day, and more gaming trends and preferences are emerging around the globe. We want to know which console is preferred the most based on different regions. We will compile results from all over the world and determine if any regional trends might shape the gaming industry as a whole.

We will use the df.groupby function of pandas for our analysis.

groupby(): This function is used to group or combine large amounts of data and compute operations on these groups.

Have a look at the code given below.

#Grouping the north america sales based on each platform
data2 = pd.DataFrame(df.groupby("Platform")[["NA_Sales"]].sum().sort_values(by=['NA_Sales'],ascending=[False]).reset_index())
data2.rename(columns = {'Platform':'Platform_NA'}, inplace = True)

#Grouping the europe sales based on each platform
data3 = pd.DataFrame(df.groupby("Platform")[["EU_Sales"]].sum().sort_values(by=['EU_Sales'],ascending=[False]).reset_index())
data3.rename(columns = {'Platform':'Platform_EU'}, inplace = True)

#Grouping the japan sales based on each platform
data4 = pd.DataFrame(df.groupby("Platform")[["JP_Sales"]].sum().sort_values(by=['JP_Sales'],ascending=[False]).reset_index())
data4.rename(columns = {'Platform':'Platform_JP'}, inplace = True)

#Grouping the other region sales based on each platform
data5 = pd.DataFrame(df.groupby("Platform")[["Other_Sales"]].sum().sort_values(by=['Other_Sales'],ascending=[False]).reset_index())
data5.rename(columns = {'Platform':'Platform_other'}, inplace = True)

#Concatenating our datasets
data=pd.concat([data1,data2,data3,data4,data5],axis=1)
data.head(3)

The dataframe data we created will provide us with the total sales (global, NA, EU, JP, Others) based on different platforms. The dataframe will look like this:

	Platform_glob	Global_Sales	Platform_NA	NA_Sales	Platform_EU	EU_Sales	Platform_JP	JP_Sales	Platform_other	Other_Sales
0	PS2	1,255.64	X360	601.05	PS3	343.71	DS	175.57	PS2	193.44
1	X360	979.96	PS2	583.84	PS2	339.29	PS	139.82	PS3	141.93
2	PS3	957.84	Wii	507.71	X360	280.58	PS2	139.20	X360	85.54

We realize that the dataframe alone will make it very hard for us to analyze the result. Therefore, we plot the result using a line chart.

from plotly.subplots import make_subplots #import new library

subplot1 = make_subplots(rows=4, cols=1, shared_yaxes=True,subplot_titles=("North American top platforms","Europe top platforms","Japan top platforms","Other regions top platforms"))

#Subplot for North America
subplot1.add_trace(go.Bar(x=data['Platform_NA'], y=data['NA_Sales'],
                    marker=dict(color=[1, 2, 3],coloraxis="coloraxis")),1, 1)

#Subplot for Europe
subplot1.add_trace(go.Bar(x=data['Platform_EU'], y=data['EU_Sales'],
                    marker=dict(color=[4, 5, 6], coloraxis="coloraxis")),                         2, 1)
                   
#Subplot for Japan
subplot1.add_trace(go.Bar(x=data['Platform_JP'], y=data['JP_Sales'],
                    marker=dict(color=[7, 8, 9], coloraxis="coloraxis")),
                    3, 1)

##Subplot for Other Regions
subplot1.add_trace(go.Bar(x=data['Platform_other'], y=data['Other_Sales'],
                    marker=dict(color=[10, 11, 12], coloraxis="coloraxis")),
                   4, 1)
                   
subplot1.update_layout(height=900,width=500,coloraxis=dict(colorscale='Magenta'), showlegend=False)
subplot1.show()

The graph shows us the top platforms preferred by users in different regions. We observe the following:

X360 (Microsoft) is the top preferred console by users in North America making a total of $601.05 million.
PS3 (Sony) is the top preferred console by users in Europe making a total of $343.71 million.
DS (Nintendo) is the top preferred console by users in Japan making a total of $175.57 million.
PS2 (Sony) is the top preferred console by users in other regions making a total of $193.44 million.

This implies that our assumption about Sony, Nintendo and Microsoft among the top consoles was correct.

We can use the same method to compare the top games and publishers for different regions.

Moving on to the next question.

3. What are the top 10 games currently making the most sales globally?

We will use a similar approach by grouping the games with respect to the global sales and observe the top 10 games.

top = pd.DataFrame(df.groupby("Name")[["Global_Sales"]].sum().sort_values(by=['Global_Sales'],ascending=[False]).reset_index())
top.head(10) #Printing the top 10 results

	Name	Global_Sales
0	Wii Sports	82.74
1	Grand Theft Auto V	55.92
2	Super Mario Bros.	45.31
3	Tetris	35.84
4	Mario Kart Wii	35.82
5	Wii Sports Resort	33.00
6	Pokemon Red/Pokemon Blue	31.37
7	Call of Duty: Black Ops	31.03
8	Call of Duty: Modern Warfare 3	30.83
9	New Super Mario Bros.	30.01

We see the most played game is Wii Sports making a total of $82.74 million globally.

We will plot the above using a pie chart.

pie1 = px.pie(top, values=top['Global_Sales'][:10], names=top['Name'][:10],title='Top 10 games globally', 
              color_discrete_sequence=px.colors.sequential.Purp_r)
pie1.update_traces(textposition='inside', textinfo='percent+label',showlegend=False)

pie1.show()

The pie chart also shows the proportion of sales each game holds, while also depicting the results.

4. What are the top games for different regions?

We will have to compare the sales made by different games regionally. We will use the same approach we did while analyzing the platform.

Try answering this on your own, if you get stuck you can always refer to the lines of code below.

name2 = pd.DataFrame(df.groupby("Name")[["NA_Sales"]].mean().sort_values(by=['NA_Sales'],ascending=[False]).reset_index())
name2.rename(columns = {'Name':'Name_NA'}, inplace = True)

name3 = pd.DataFrame(df.groupby("Name")[["EU_Sales"]].mean().sort_values(by=['EU_Sales'],ascending=[False]).reset_index())
name3.rename(columns = {'Name':'Name_EU'}, inplace = True)

name4 = pd.DataFrame(df.groupby("Name")[["JP_Sales"]].mean().sort_values(by=['JP_Sales'],ascending=[False]).reset_index())
name4.rename(columns = {'Name':'Name_JP'}, inplace = True)

name5 = pd.DataFrame(df.groupby("Name")[["Other_Sales"]].mean().sort_values(by=['Other_Sales'],ascending=[False]).reset_index())
name5.rename(columns = {'Name':'Name_other'}, inplace = True)

#Concatenating the results.
name_df=pd.concat([name2,name3,name4,name5],axis=1)

The dataframe name_df will give us the required result.

Plotting it in a similar fashion to understand the results obtained.

subplot_name1 = make_subplots(rows=4, cols=1, shared_yaxes=True,subplot_titles=("North American top games","Europe top games", "Japan top games","Other regions top games",'Top games globally'))

#Subplot for North America
subplot_name1.add_trace(go.Bar(x=name_df['Name_NA'][:5], y=name_df['NA_Sales'][:5],marker=dict(color=[1, 2, 3],coloraxis="coloraxis")),1, 1)

#Subplot for Europe
subplot_name1.add_trace(go.Bar(x=name_df['Name_EU'][:5], y=name_df['EU_Sales'][:5],marker=dict(color=[4, 5, 6], coloraxis="coloraxis")), 2, 1)

#Subplot for Japan
subplot_name1.add_trace(go.Bar(x=name_df['Name_JP'][:5], y=name_df['JP_Sales'][:5],marker=dict(color=[7, 8, 9], coloraxis="coloraxis")),3, 1)

#Subplot for other regions
subplot_name1.add_trace(go.Bar(x=name_df['Name_other'][:5], y=name_df['Other_Sales'][:5],marker=dict(color=[10, 11, 12], coloraxis="coloraxis")),4, 1)

subplot_name1.update_layout(height=1000,width=500,coloraxis=dict(colorscale='Mint_r'), showlegend=False)
subplot_name1.update_xaxes(tickangle=45)
subplot_name1.show()

The graph shows us the top games preferred by users in different regions and also globally. We observe the following:

Wii Sports has been the top game in North America, Europe, other regions.
Pokemon Red/Pokemon Blue is the top game in Japan.

5. Are there any games with release year older than 2000 that are still making high sales? What are they?

To achieve this we will first segregate the Year less than 2000 from the rest of the dataframe. Doing this using the df.query() function.

query()- This is a filtering function that enables to select and filter the columns of a dataFrame with a boolean expression.

old_games = pd.DataFrame(df.query('Year<2000', inplace=False))

The above code returns a dataframe old_games that has all the games with release date older than 2000. We can also see the first 5 rows using the function df.head(). Moreover, using the function df.shape we observe that the total number of such games are 1974. However, to answer this particular question, we will only focus on the old games making the most sales.

We are considering Global Sales to analyze the overall sales. Remember that you can use the same approach to find the sales for any other region as well.

We will target the data exceeding the 99th percentile to get only the games making the highest sales.

import numpy as np #import library
a=old_games['Global_Sales']
np.percentile(a, [99]) 

>>>array([7.8235])

Let's consider global sales greater than $7.82 million as high and proceed with the analysis.

old_games.query('Global_Sales>7.8235', inplace=False)

Here, we have only targeted those games which makes global sales more than $7.82 million.
Note that we have used this value to help us simplify our analysis. This can change according to your definition of high sales.
The final output should look like this:

	Rank	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
1	2	Super Mario Bros.	NES	1985	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24
4	5	Pokemon Red/Pokemon Blue	GB	1996	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37
5	6	Tetris	GB	1989	Puzzle	Nintendo	23.20	2.26	4.22	0.58	30.26
9	10	Duck Hunt	NES	1984	Shooter	Nintendo	26.93	0.63	0.28	0.47	28.31
12	13	Pokemon Gold/Pokemon Silver	GB	1999	Role-Playing	Nintendo	9.00	6.18	7.20	0.71	23.10
18	19	Super Mario World	SNES	1990	Platform	Nintendo	12.78	3.75	3.54	0.55	20.61
21	22	Super Mario Land	GB	1989	Platform	Nintendo	10.83	2.71	4.18	0.42	18.14
22	23	Super Mario Bros. 3	NES	1988	Platform	Nintendo	9.54	3.44	3.84	0.46	17.28
30	31	Pokémon Yellow: Special Pikachu Edition	GB	1998	Role-Playing	Nintendo	5.89	5.04	3.12	0.59	14.64
46	47	Super Mario 64	N64	1996	Platform	Nintendo	6.91	2.85	1.91	0.23	11.89
50	51	Super Mario Land 2: 6 Golden Coins	GB	1992	Adventure	Nintendo	6.16	2.04	2.69	0.29	11.18
52	53	Gran Turismo	PS	1997	Racing	Sony Computer Entertainment	4.02	3.87	2.54	0.52	10.95
57	58	Super Mario All-Stars	SNES	1993	Platform	Nintendo	5.99	2.15	2.12	0.29	10.55
63	64	Mario Kart 64	N64	1996	Racing	Nintendo	5.55	1.94	2.23	0.15	9.87
66	67	Final Fantasy VII	PS	1997	Role-Playing	Sony Computer Entertainment	3.01	2.47	3.28	0.96	9.72
69	70	Gran Turismo 2	PS	1999	Racing	Sony Computer Entertainment	3.88	3.42	1.69	0.50	9.49
71	72	Donkey Kong Country	SNES	1994	Platform	Nintendo	4.36	1.71	3.00	0.23	9.30
76	77	Super Mario Kart	SNES	1992	Racing	Nintendo	3.54	1.24	3.81	0.18	8.76
84	85	GoldenEye 007	N64	1997	Shooter	Nintendo	5.80	2.01	0.13	0.15	8.09
87	88	Final Fantasy VIII	PS	1999	Role-Playing	SquareSoft	2.28	1.72	3.63	0.23	7.86

Super Mario Bros. released in 1985 is the top old game still making global sales of $40.24 million. This is followed by Pokemon Red/Pokemon Blue (1996), Tetris (1989) and more.

Let's quickly hop on to the next question.

6. What are the top gaming genres that are making high sales?

The result can help game developers find out which genre is preferred in the market. It can also be used to compare gameplay across different games as one tool. These can be analyzed and used to develop strategies for future game releases.

To analyze the sales with respect to genre, we will consider global sales.

Give this question a try before looking at the code below. We will simply use the groupby function on global sales and later visualize our result using a bar graph.

This is how the code should look like:

genre_df = df.groupby("Genre")[["Global_Sales"]].sum().sort_values(by=['Global_Sales'],ascending=[False]).reset_index()
genre_df #print the dataframe

The dataframe genre_df will return the most preferred genres.

	Genre	Global_Sales
0	Action	1,751.18
1	Sports	1,330.93
2	Shooter	1,037.37
3	Role-Playing	927.37
4	Platform	831.37
5	Misc	809.96
6	Racing	732.04
7	Fighting	448.91
8	Simulation	392.20
9	Puzzle	244.95
10	Adventure	239.04
11	Strategy	175.12

bar_genre= px.bar(genre_df, x='Genre', y='Global_Sales',color='Global_Sales',color_continuous_scale='Burgyl')
bar_genre.show()

The bar graph clearly shows the preference of users for different genres. We observe that Action games are the most preferred across the globe, making a total of $1751.18 million.

7. Does the publisher have any impact on the regional sales?

We can quickly extract the top publishers for different regions using the same method we used in games and platforms. We can take the top 10 publishers for our analysis. Try comparing the publishers for different regions yourself and compare!

We observe the following from the bar graphs:

Nintendo is the top publisher in North America, Europe and Japan making a total of $816.87 million, $418.74 million and $455.42 million respectively.
For other regions, Electronic Arts is the top publisher making $129.77 million.

8. Is there any region that was out-performed global average sales?

We have already answered the above while answering the 1st question. We found that no region has out-performed global average sales.

You can look at the detailed explanation in the 1st question above.

And with that, we come to an end of our analysis.

Conclusions

We have managed to find the answer to each question, and we can now share our report with the game producer. This report will contain the following information:

North America is the top region has the highest average sales of $264,667.430.
The most preferred platforms are as follows:

X360 (Microsoft) is the top preferred console by users in North America making a total of $601.05 million.
PS3 (Sony) is the top preferred console by users in Europe making a total of $343.71 million.
DS (Nintendo) is the top preferred console by users in Japan making a total of $175.57 million.
PS2 (Sony) is the top preferred console by users in other regions making a total of $193.44 million.

3. The top most played game is Wii Sports making a total of $82.74 million globally.

4. The top games are as follows:

Wii Sports has been the top game in North America, Europe, other regions.
Pokemon Red/Pokemon Blue is the top game in Japan.

5. Super Mario Bros. released in 1985 is the top old game still making global sales of $40.24 million. This is followed by Pokemon Red/Pokemon Blue (1996), Tetris (1989) and more.

6. Action games are the most preferred across the globe, making a total of $1751.18 million.

7. The top publishers are as follows:

Nintendo is the top publisher in North America, Europe and Japan making a total of $816.87 million, $418.74 million and $455.42 million respectively.
For other regions, Electronic Arts is the top publisher making $129.77 million.

8. There is no region that has out-performed the global average sales. The average global sales are $537,440.656 million.

This report summarizes the top games, platforms, publishers, genres are more attributes for different regions. The results will be beneficial in improving both sales and customer satisfaction.

We hope you find this analysis useful in addressing your plans for improving the next version of your game and have a clearer vision of how to work on datasets in similar sectors while also handling graphs using plotly.

Stay tuned for more pandas series.

If you are looking for jobs in AI and DS check out Deep Learning Careers

See you next time!