Video Game Sales Analysis
Analyzing the 'Video Game Sales Analysis' dataset with the help of pandas library and learn the fundamentals of plotly by visualizing the result using simple yet interactive graphs.
Welcome back to another pandas posts with us. We will be analyzing the video games sales analysis using pandas and visualizing the results using plotly.
Who doesn't like video games? Video games have become a favourite pastime for most of us. Video games are a billion-dollar business and have been for many years. Now imagine being able to analyze the sales trends for the popular Nintendo games and many more such platforms. How cool would it be!
In this module, we will be laying the foundation for our analysis by processing and exploring a large amount of data on video game sales. The dataset contains information regarding the sales of video games across various regions like North America, Europe, Japan and also globally, while also giving information regarding the Names, Publishers and Platforms. This dataset has been made available thanks to Kaggle which is the home for many such datasets and competitions.
Get the data here.
You are a data scientist in the gaming industry and want to analyze the sales pattern for different features. You want to investigate different features and approach the game producer with a thorough report. You want to find the answers to the following questions:
- Which region has performed the best in terms of sales?
- The top gaming consoles are Microsoft (Xbox), Sony (Playstation) and Nintendo, with Google acting as a new competitor. Does the dataset also back this information? Analyze with respect to different regions.
- What are the top 10 games currently making the most sales globally?
- What are the top games for different regions?
- Are there any games with release year older than 2000 that are still making high sales? What are they?
- What are the top gaming genres that are making high sales?
- Does the publisher have any impact on the regional sales?
- Is there any region that has out-performed global average sales?
We will be exploring the questions above using data, analytics and visualization techniques one at a time. So let's get started!
Before you continue reading, check out our new ChatGPT Tool for Job Seekers
Job hunting can be a demanding and time-consuming process. We believe our tool can take some of the stress out of it by leveraging AI to help you write compelling, personalized cover letters. We're excited to share this tool with you and look forward to hearing about your experiences and successes.
Experience the future of job application preparation. Give our ChatGPT-powered tool a try today!
https://jobseekerai.netlify.app/
Learning Objectives
After this module, we should be able to do the following:
- Use pandas library to analyze different features of the dataset, which includes,
- Read the dataset.
- Use functions like
df.query
anddf.groupby
. - Analyze the sales for different regions with respect to various features.
2. Use plotly library to visualize the given results.
- Plot graphs like bar graphs and pie charts.
Understanding the dataset
This dataset contains a list of video games with sales greater than 100,000 copies. Each row in the dataset represents a game, whereas, each column contains different attributes.
The data set includes information about:
Rank
- The overall rank of the game.Name
- The name of the game.Platform
- The platform on which the game was released.Year
- The year in which the game was released.Genre
-The genre of the game.Publisher
-The publisher of the game.NA_Sales
,EU_Sales
,JP_Sales
,Other_Sales
,Global_Sales
- Sales made by a particular game in North America, Europe, Japan, Other regions and Globally (in millions).
Pre-processing the dataset
To get started, we need to import some useful libraries that will help us import the dataset into our python environment, manipulate and analyze the same and later help us to visualize it.
#Importing libraries
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import warnings
warnings.filterwarnings("ignore")
print("The modules are imported")
#Importing dataset
df=pd.read_csv("/content/vgsales.csv")
df.head()
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Wii Sports | Wii | 2006.0 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 |
1 | 2 | Super Mario Bros. | NES | 1985.0 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 |
2 | 3 | Mario Kart Wii | Wii | 2008.0 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 |
3 | 4 | Wii Sports Resort | Wii | 2009.0 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 |
4 | 5 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
Using df.shape
function, we see that our dataset has 16598 rows and 11 columns.
Now, we will look for missing values using the df.isnull().sum()
function. We observe that the column Year
has 271 missing values and the column Publisher
has 58 missing values. We will treat the missing values as follows:
- For
Year
, we use the codedf["Year"].fillna(df["Year"].quantile(0.50),inplace = True)
- For
Publisher
, we use the codedf["Publisher"].dropna(inplace = True)
Now that we are done, it's time to jump to the questions.
1. Which region has performed the best in terms of sales?
We will utilize the average sales made per region and compare the results. Before we do that, let's make sure we know how to calculate the average. We will be using a very simple line of code for this, i.e., df['Region'].mean()
, where 'region' depicts the different regions in the dataset and mean()
function is used to calculate the mean. We observe that our output is coming in decimals, to convert the values in millions, let's multiple the result with 10,00,000
. The final code should look like this.
x=(df['NA_Sales'].mean()*1000000)
y=(df['EU_Sales'].mean()*1000000)
z=(df['JP_Sales'].mean()*1000000)
q=(df['Other_Sales'].mean()*1000000)
p=(df['Global_Sales'].mean()*1000000)
print("The average sales in North America =", (f"${x:,.3f}")) #comma separated values till 3 decimal place and $ sign
print("The average sales in Europe =",(f"${y:,.3f}"))
print("The average sales in Japan =",(f"${z:,.3f}"))
print("The average sales in other regions =",(f"${q:,.3f}"))
print("The average sales globally =",(f"${p:,.3f}"))
The average sales in North America = $264,667.430
The average sales in Europe = $146,652.006
The average sales in Japan = $77,781.660
The average sales in other regions = $48,063.020
The average sales globally = $537,440.656
Let us now plot our findings on a bar chart.
colors = ['lightslategray',] * 4
colors[1]='darkgray'
colors[2]='grey'
colors[3]='dimgrey'
colors[0] = 'crimson'
bar1 = go.Figure(data=[go.Bar(
y=['Global','North America', 'Europe', 'Japan',
'Other'],
x=[537440.656,264667.430, 146652.006, 77781.660, 48063.020],
orientation='h',
marker_color=colors # marker color can be a single color value or an iterable
)])
bar1.update_layout(title_text='Region with highest sales on an average')
bar1.update_xaxes(title='Average Sales')
bar1.update_yaxes(title='Regions')
We have successfully used a bar graph to demonstrate the results. It is clear now which region has made the most sales (expect global sales) in terms of video game. We observe that North America has the highest average sales of $264,667.430. It's clear now that we can use the bar graph to answer questions like, which country sells the most video games in the world? This will be help us make some strategy changes.
The above result also answers our 8th questions, i.e., Is there any region that has out-performed global average sales?
The answer is 'No'. There is no region that has out-performed the global average sales. The average global sales is $537,440.656 million.
2. The top gaming consoles are Microsoft (Xbox), Sony (Playstation) and Nintendo, with Google acting as a new competitor.
Does the dataset also back this information? Analyze with respect to different regions and also, globally.
The world is getting more connected every day, and more gaming trends and preferences are emerging around the globe. We want to know which console is preferred the most based on different regions. We will compile results from all over the world and determine if any regional trends might shape the gaming industry as a whole.
We will use the df.groupby
function of pandas for our analysis.
- groupby(): This function is used to group or combine large amounts of data and compute operations on these groups.
Have a look at the code given below.
#Grouping the north america sales based on each platform
data2 = pd.DataFrame(df.groupby("Platform")[["NA_Sales"]].sum().sort_values(by=['NA_Sales'],ascending=[False]).reset_index())
data2.rename(columns = {'Platform':'Platform_NA'}, inplace = True)
#Grouping the europe sales based on each platform
data3 = pd.DataFrame(df.groupby("Platform")[["EU_Sales"]].sum().sort_values(by=['EU_Sales'],ascending=[False]).reset_index())
data3.rename(columns = {'Platform':'Platform_EU'}, inplace = True)
#Grouping the japan sales based on each platform
data4 = pd.DataFrame(df.groupby("Platform")[["JP_Sales"]].sum().sort_values(by=['JP_Sales'],ascending=[False]).reset_index())
data4.rename(columns = {'Platform':'Platform_JP'}, inplace = True)
#Grouping the other region sales based on each platform
data5 = pd.DataFrame(df.groupby("Platform")[["Other_Sales"]].sum().sort_values(by=['Other_Sales'],ascending=[False]).reset_index())
data5.rename(columns = {'Platform':'Platform_other'}, inplace = True)
#Concatenating our datasets
data=pd.concat([data1,data2,data3,data4,data5],axis=1)
data.head(3)
The dataframe data
we created will provide us with the total sales (global, NA, EU, JP, Others) based on different platforms. The dataframe will look like this:
Platform_glob | Global_Sales | Platform_NA | NA_Sales | Platform_EU | EU_Sales | Platform_JP | JP_Sales | Platform_other | Other_Sales | |
---|---|---|---|---|---|---|---|---|---|---|
0 | PS2 | 1,255.64 | X360 | 601.05 | PS3 | 343.71 | DS | 175.57 | PS2 | 193.44 |
1 | X360 | 979.96 | PS2 | 583.84 | PS2 | 339.29 | PS | 139.82 | PS3 | 141.93 |
2 | PS3 | 957.84 | Wii | 507.71 | X360 | 280.58 | PS2 | 139.20 | X360 | 85.54 |
We realize that the dataframe alone will make it very hard for us to analyze the result. Therefore, we plot the result using a line chart.
from plotly.subplots import make_subplots #import new library
subplot1 = make_subplots(rows=4, cols=1, shared_yaxes=True,subplot_titles=("North American top platforms","Europe top platforms","Japan top platforms","Other regions top platforms"))
#Subplot for North America
subplot1.add_trace(go.Bar(x=data['Platform_NA'], y=data['NA_Sales'],
marker=dict(color=[1, 2, 3],coloraxis="coloraxis")),1, 1)
#Subplot for Europe
subplot1.add_trace(go.Bar(x=data['Platform_EU'], y=data['EU_Sales'],
marker=dict(color=[4, 5, 6], coloraxis="coloraxis")), 2, 1)
#Subplot for Japan
subplot1.add_trace(go.Bar(x=data['Platform_JP'], y=data['JP_Sales'],
marker=dict(color=[7, 8, 9], coloraxis="coloraxis")),
3, 1)
##Subplot for Other Regions
subplot1.add_trace(go.Bar(x=data['Platform_other'], y=data['Other_Sales'],
marker=dict(color=[10, 11, 12], coloraxis="coloraxis")),
4, 1)
subplot1.update_layout(height=900,width=500,coloraxis=dict(colorscale='Magenta'), showlegend=False)
subplot1.show()
The graph shows us the top platforms preferred by users in different regions. We observe the following:
- X360 (Microsoft) is the top preferred console by users in North America making a total of $601.05 million.
- PS3 (Sony) is the top preferred console by users in Europe making a total of $343.71 million.
- DS (Nintendo) is the top preferred console by users in Japan making a total of $175.57 million.
- PS2 (Sony) Â is the top preferred console by users in other regions making a total of $193.44 million.
This implies that our assumption about Sony, Nintendo and Microsoft among the top consoles was correct.
We can use the same method to compare the top games and publishers for different regions.
Moving on to the next question.
3. What are the top 10 games currently making the most sales globally?
We will use a similar approach by grouping the games with respect to the global sales and observe the top 10 games.
top = pd.DataFrame(df.groupby("Name")[["Global_Sales"]].sum().sort_values(by=['Global_Sales'],ascending=[False]).reset_index())
top.head(10) #Printing the top 10 results
Name | Global_Sales | |
---|---|---|
0 | Wii Sports | 82.74 |
1 | Grand Theft Auto V | 55.92 |
2 | Super Mario Bros. | 45.31 |
3 | Tetris | 35.84 |
4 | Mario Kart Wii | 35.82 |
5 | Wii Sports Resort | 33.00 |
6 | Pokemon Red/Pokemon Blue | 31.37 |
7 | Call of Duty: Black Ops | 31.03 |
8 | Call of Duty: Modern Warfare 3 | 30.83 |
9 | New Super Mario Bros. | 30.01 |
We see the most played game is Wii Sports making a total of $82.74 million globally.
We will plot the above using a pie chart.
pie1 = px.pie(top, values=top['Global_Sales'][:10], names=top['Name'][:10],title='Top 10 games globally',
color_discrete_sequence=px.colors.sequential.Purp_r)
pie1.update_traces(textposition='inside', textinfo='percent+label',showlegend=False)
pie1.show()
The pie chart also shows the proportion of sales each game holds, while also depicting the results.
4. What are the top games for different regions?
We will have to compare the sales made by different games regionally. We will use the same approach we did while analyzing the platform.
Try answering this on your own, if you get stuck you can always refer to the lines of code below.
name2 = pd.DataFrame(df.groupby("Name")[["NA_Sales"]].mean().sort_values(by=['NA_Sales'],ascending=[False]).reset_index())
name2.rename(columns = {'Name':'Name_NA'}, inplace = True)
name3 = pd.DataFrame(df.groupby("Name")[["EU_Sales"]].mean().sort_values(by=['EU_Sales'],ascending=[False]).reset_index())
name3.rename(columns = {'Name':'Name_EU'}, inplace = True)
name4 = pd.DataFrame(df.groupby("Name")[["JP_Sales"]].mean().sort_values(by=['JP_Sales'],ascending=[False]).reset_index())
name4.rename(columns = {'Name':'Name_JP'}, inplace = True)
name5 = pd.DataFrame(df.groupby("Name")[["Other_Sales"]].mean().sort_values(by=['Other_Sales'],ascending=[False]).reset_index())
name5.rename(columns = {'Name':'Name_other'}, inplace = True)
#Concatenating the results.
name_df=pd.concat([name2,name3,name4,name5],axis=1)
The dataframe name_df
will give us the required result.
Plotting it in a similar fashion to understand the results obtained.
subplot_name1 = make_subplots(rows=4, cols=1, shared_yaxes=True,subplot_titles=("North American top games","Europe top games", "Japan top games","Other regions top games",'Top games globally'))
#Subplot for North America
subplot_name1.add_trace(go.Bar(x=name_df['Name_NA'][:5], y=name_df['NA_Sales'][:5],marker=dict(color=[1, 2, 3],coloraxis="coloraxis")),1, 1)
#Subplot for Europe
subplot_name1.add_trace(go.Bar(x=name_df['Name_EU'][:5], y=name_df['EU_Sales'][:5],marker=dict(color=[4, 5, 6], coloraxis="coloraxis")), 2, 1)
#Subplot for Japan
subplot_name1.add_trace(go.Bar(x=name_df['Name_JP'][:5], y=name_df['JP_Sales'][:5],marker=dict(color=[7, 8, 9], coloraxis="coloraxis")),3, 1)
#Subplot for other regions
subplot_name1.add_trace(go.Bar(x=name_df['Name_other'][:5], y=name_df['Other_Sales'][:5],marker=dict(color=[10, 11, 12], coloraxis="coloraxis")),4, 1)
subplot_name1.update_layout(height=1000,width=500,coloraxis=dict(colorscale='Mint_r'), showlegend=False)
subplot_name1.update_xaxes(tickangle=45)
subplot_name1.show()
The graph shows us the top games preferred by users in different regions and also globally. We observe the following:
- Wii Sports has been the top game in North America, Europe, other regions.
- Pokemon Red/Pokemon Blue is the top game in Japan.
5. Are there any games with release year older than 2000 that are still making high sales? What are they?
To achieve this we will first segregate the Year
less than 2000 from the rest of the dataframe. Doing this using the df.query()
function.
- query()- This is a filtering function that enables to select and filter the columns of a dataFrame with a boolean expression.
old_games = pd.DataFrame(df.query('Year<2000', inplace=False))
The above code returns a dataframe old_games
that has all the games with release date older than 2000. We can also see the first 5 rows using the function df.head()
. Moreover, using the function df.shape
we observe that the total number of such games are 1974. However, to answer this particular question, we will only focus on the old games making the most sales.
We are considering Global Sales to analyze the overall sales. Remember that you can use the same approach to find the sales for any other region as well.
We will target the data exceeding the 99th percentile to get only the games making the highest sales.
import numpy as np #import library
a=old_games['Global_Sales']
np.percentile(a, [99])
>>>array([7.8235])
Let's consider global sales greater than $7.82 million as high and proceed with the analysis.
old_games.query('Global_Sales>7.8235', inplace=False)
Here, we have only targeted those games which makes global sales more than $7.82 million.
Note that we have used this value to help us simplify our analysis. This can change according to your definition of high sales.
The final output should look like this:
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | Super Mario Bros. | NES | 1985 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 |
4 | 5 | Pokemon Red/Pokemon Blue | GB | 1996 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
5 | 6 | Tetris | GB | 1989 | Puzzle | Nintendo | 23.20 | 2.26 | 4.22 | 0.58 | 30.26 |
9 | 10 | Duck Hunt | NES | 1984 | Shooter | Nintendo | 26.93 | 0.63 | 0.28 | 0.47 | 28.31 |
12 | 13 | Pokemon Gold/Pokemon Silver | GB | 1999 | Role-Playing | Nintendo | 9.00 | 6.18 | 7.20 | 0.71 | 23.10 |
18 | 19 | Super Mario World | SNES | 1990 | Platform | Nintendo | 12.78 | 3.75 | 3.54 | 0.55 | 20.61 |
21 | 22 | Super Mario Land | GB | 1989 | Platform | Nintendo | 10.83 | 2.71 | 4.18 | 0.42 | 18.14 |
22 | 23 | Super Mario Bros. 3 | NES | 1988 | Platform | Nintendo | 9.54 | 3.44 | 3.84 | 0.46 | 17.28 |
30 | 31 | Pokémon Yellow: Special Pikachu Edition | GB | 1998 | Role-Playing | Nintendo | 5.89 | 5.04 | 3.12 | 0.59 | 14.64 |
46 | 47 | Super Mario 64 | N64 | 1996 | Platform | Nintendo | 6.91 | 2.85 | 1.91 | 0.23 | 11.89 |
50 | 51 | Super Mario Land 2: 6 Golden Coins | GB | 1992 | Adventure | Nintendo | 6.16 | 2.04 | 2.69 | 0.29 | 11.18 |
52 | 53 | Gran Turismo | PS | 1997 | Racing | Sony Computer Entertainment | 4.02 | 3.87 | 2.54 | 0.52 | 10.95 |
57 | 58 | Super Mario All-Stars | SNES | 1993 | Platform | Nintendo | 5.99 | 2.15 | 2.12 | 0.29 | 10.55 |
63 | 64 | Mario Kart 64 | N64 | 1996 | Racing | Nintendo | 5.55 | 1.94 | 2.23 | 0.15 | 9.87 |
66 | 67 | Final Fantasy VII | PS | 1997 | Role-Playing | Sony Computer Entertainment | 3.01 | 2.47 | 3.28 | 0.96 | 9.72 |
69 | 70 | Gran Turismo 2 | PS | 1999 | Racing | Sony Computer Entertainment | 3.88 | 3.42 | 1.69 | 0.50 | 9.49 |
71 | 72 | Donkey Kong Country | SNES | 1994 | Platform | Nintendo | 4.36 | 1.71 | 3.00 | 0.23 | 9.30 |
76 | 77 | Super Mario Kart | SNES | 1992 | Racing | Nintendo | 3.54 | 1.24 | 3.81 | 0.18 | 8.76 |
84 | 85 | GoldenEye 007 | N64 | 1997 | Shooter | Nintendo | 5.80 | 2.01 | 0.13 | 0.15 | 8.09 |
87 | 88 | Final Fantasy VIII | PS | 1999 | Role-Playing | SquareSoft | 2.28 | 1.72 | 3.63 | 0.23 | 7.86 |
Super Mario Bros. released in 1985 is the top old game still making global sales of $40.24 million. This is followed by Pokemon Red/Pokemon Blue (1996), Tetris (1989) and more.
Let's quickly hop on to the next question.
6. What are the top gaming genres that are making high sales?
The result can help game developers find out which genre is preferred in the market. It can also be used to compare gameplay across different games as one tool. These can be analyzed and used to develop strategies for future game releases.
To analyze the sales with respect to genre, we will consider global sales.
Give this question a try before looking at the code below. We will simply use the groupby
function on global sales and later visualize our result using a bar graph.
This is how the code should look like:
genre_df = df.groupby("Genre")[["Global_Sales"]].sum().sort_values(by=['Global_Sales'],ascending=[False]).reset_index()
genre_df #print the dataframe
The dataframe genre_df
will return the most preferred genres.
Genre | Global_Sales | |
---|---|---|
0 | Action | 1,751.18 |
1 | Sports | 1,330.93 |
2 | Shooter | 1,037.37 |
3 | Role-Playing | 927.37 |
4 | Platform | 831.37 |
5 | Misc | 809.96 |
6 | Racing | 732.04 |
7 | Fighting | 448.91 |
8 | Simulation | 392.20 |
9 | Puzzle | 244.95 |
10 | Adventure | 239.04 |
11 | Strategy | 175.12 |
bar_genre= px.bar(genre_df, x='Genre', y='Global_Sales',color='Global_Sales',color_continuous_scale='Burgyl')
bar_genre.show()
The bar graph clearly shows the preference of users for different genres. We observe that Action games are the most preferred across the globe, making a total of $1751.18 million.
7. Does the publisher have any impact on the regional sales?
We can quickly extract the top publishers for different regions using the same method we used in games and platforms. We can take the top 10 publishers for our analysis. Try comparing the publishers for different regions yourself and compare!
We observe the following from the bar graphs:
- Nintendo is the top publisher in North America, Europe and Japan making a total of $816.87 million, $418.74 million and $455.42 million respectively.
- For other regions, Electronic Arts is the top publisher making $129.77 million.
8. Is there any region that was out-performed global average sales?
We have already answered the above while answering the 1st question. We found that no region has out-performed global average sales.
You can look at the detailed explanation in the 1st question above.
And with that, we come to an end of our analysis.
Conclusions
We have managed to find the answer to each question, and we can now share our report with the game producer. This report will contain the following information:
- North America is the top region has the highest average sales of $264,667.430.
- The most preferred platforms are as follows:
- X360 (Microsoft) is the top preferred console by users in North America making a total of $601.05 million.
- PS3 (Sony) is the top preferred console by users in Europe making a total of $343.71 million.
- DS (Nintendo) is the top preferred console by users in Japan making a total of $175.57 million.
- PS2 (Sony) is the top preferred console by users in other regions making a total of $193.44 million.
3. The top most played game is Wii Sports making a total of $82.74 million    globally.
4. The top games are as follows:
- Wii Sports has been the top game in North America, Europe, other regions.
- Pokemon Red/Pokemon Blue is the top game in Japan.
5. Super Mario Bros. released in 1985 is the top old game still making global sales of $40.24 million. This is followed by Pokemon Red/Pokemon Blue (1996), Tetris (1989) and more.
6. Action games are the most preferred across the globe, making a total of $1751.18 million.
7. The top publishers are as follows:
- Nintendo is the top publisher in North America, Europe and Japan making a total of $816.87 million, $418.74 million and $455.42 million respectively.
- For other regions, Electronic Arts is the top publisher making $129.77 million.
8. There is no region that has out-performed the global average sales. The average global sales are $537,440.656 million.
This report summarizes the top games, platforms, publishers, genres are more attributes for different regions. The results will be beneficial in improving both sales and customer satisfaction.
We hope you find this analysis useful in addressing your plans for improving the next version of your game and have a clearer vision of how to work on datasets in similar sectors while also handling graphs using plotly.
Stay tuned for more pandas series.
If you are looking for jobs in AI and DS check out Deep Learning Careers
See you next time!