Video Game Sales Analysis

Analyzing the 'Video Game Sales Analysis' dataset with the help of pandas library and learn the fundamentals of plotly by visualizing the result using simple yet interactive graphs.

Video Game Sales Analysis

Welcome back to another pandas posts with us. We will be analyzing the video games sales analysis using pandas and visualizing the results using plotly.

Who doesn't like video games? Video games have become a favourite pastime for most of us. Video games are a billion-dollar business and have been for many years. Now imagine being able to analyze the sales trends for the popular Nintendo games and many more such platforms. How cool would it be!

In this module, we will be laying the foundation for our analysis by processing and exploring a large amount of data on video game sales. The dataset contains information regarding the sales of video games across various regions like North America, Europe, Japan and also globally, while also giving information regarding the Names, Publishers and Platforms. This dataset has been made available thanks to Kaggle which is the home for many such datasets and competitions.

Get the data here.

You are a data scientist in the gaming industry and want to analyze the sales pattern for different features. You want to investigate different features and approach the game producer with a thorough report. You want to find the answers to the following questions:

  1. Which region has performed the best in terms of sales?
  2. The top gaming consoles are Microsoft (Xbox), Sony (Playstation) and Nintendo, with Google acting as a new competitor. Does the dataset also back this information? Analyze with respect to different regions.
  3. What are the top 10 games currently making the most sales globally?
  4. What are the top games for different regions?
  5. Are there any games with release year older than 2000 that are still making high sales? What are they?
  6. What are the top gaming genres that are making high sales?
  7. Does the publisher have any impact on the regional sales?
  8. Is there any region that has out-performed global average sales?

We will be exploring the questions above using data, analytics and visualization techniques one at a time. So let's get started!

Before you continue reading, check out our new ChatGPT Tool for Job Seekers

Job hunting can be a demanding and time-consuming process. We believe our tool can take some of the stress out of it by leveraging AI to help you write compelling, personalized cover letters. We're excited to share this tool with you and look forward to hearing about your experiences and successes.

Experience the future of job application preparation. Give our ChatGPT-powered tool a try today!

https://jobseekerai.netlify.app/

Learning Objectives

After this module, we should be able to do the following:

  1. Use pandas library to analyze different features of the dataset, which includes,
  • Read the dataset.
  • Use functions like df.query and df.groupby.
  • Analyze the sales for different regions with respect to various features.

2. Use plotly library to visualize the given results.

  • Plot graphs like bar graphs and pie charts.

Understanding the dataset

This dataset contains a list of video games with sales greater than 100,000 copies. Each row in the dataset represents a game, whereas, each column contains different attributes.

The data set includes information about:

  • Rank - The overall rank of the game.
  • Name - The name of the game.
  • Platform- The platform on which the game was released.
  • Year- The year in which the game was released.
  • Genre-The genre of the game.
  • Publisher-The publisher of the game.
  • NA_Sales, EU_Sales, JP_Sales , Other_Sales, Global_Sales - Sales made by a particular game in North America, Europe, Japan, Other regions and Globally (in millions).

Pre-processing the dataset

To get started, we need to import some useful libraries that will help us import the dataset into our python environment, manipulate and analyze the same and later help us to visualize it.

#Importing libraries
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import warnings
warnings.filterwarnings("ignore")
print("The modules are imported")

#Importing dataset
df=pd.read_csv("/content/vgsales.csv")
df.head()
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
0 1 Wii Sports Wii 2006.0 Sports Nintendo 41.49 29.02 3.77 8.46 82.74
1 2 Super Mario Bros. NES 1985.0 Platform Nintendo 29.08 3.58 6.81 0.77 40.24
2 3 Mario Kart Wii Wii 2008.0 Racing Nintendo 15.85 12.88 3.79 3.31 35.82
3 4 Wii Sports Resort Wii 2009.0 Sports Nintendo 15.75 11.01 3.28 2.96 33.00
4 5 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo 11.27 8.89 10.22 1.00 31.37

Using df.shape function, we see that our dataset has 16598 rows and 11 columns.

Now, we will look for missing values using the df.isnull().sum() function. We observe that the column Year has 271 missing values and the column Publisher has 58 missing values. We will treat the missing values as follows:

  • ForYear, we use the code df["Year"].fillna(df["Year"].quantile(0.50),inplace = True)
  • ForPublisher, we use the codedf["Publisher"].dropna(inplace = True)

Now that we are done, it's time to jump to the questions.

💡
Before that, follow us on Twitter to stay updated about more such content.

1. Which region has performed the best in terms of sales?

We will utilize the average sales made per region and compare the results. Before we do that, let's make sure we know how to calculate the average. We will be using a very simple line of code for this, i.e., df['Region'].mean(), where 'region' depicts the different regions in the dataset and mean() function is used to calculate the mean. We observe that our output is coming in decimals, to convert the values in millions, let's multiple the result with 10,00,000. The final code should look like this.

x=(df['NA_Sales'].mean()*1000000)
y=(df['EU_Sales'].mean()*1000000)
z=(df['JP_Sales'].mean()*1000000)
q=(df['Other_Sales'].mean()*1000000)
p=(df['Global_Sales'].mean()*1000000)

print("The average sales in North America =", (f"${x:,.3f}")) #comma separated values till 3 decimal place and $ sign
print("The average sales in Europe =",(f"${y:,.3f}"))
print("The average sales in Japan =",(f"${z:,.3f}"))
print("The average sales in other regions =",(f"${q:,.3f}"))
print("The average sales globally =",(f"${p:,.3f}"))
The average sales in North America = $264,667.430
The average sales in Europe = $146,652.006
The average sales in Japan = $77,781.660
The average sales in other regions = $48,063.020
The average sales globally = $537,440.656

Let us now plot our findings on a bar chart.

colors = ['lightslategray',] * 4
colors[1]='darkgray'
colors[2]='grey'
colors[3]='dimgrey'
colors[0] = 'crimson'

bar1 = go.Figure(data=[go.Bar(
    y=['Global','North America', 'Europe', 'Japan',
       'Other'],
    x=[537440.656,264667.430, 146652.006, 77781.660, 48063.020],
    orientation='h',
    marker_color=colors # marker color can be a single color value or an iterable
)])
bar1.update_layout(title_text='Region with highest sales on an average')
bar1.update_xaxes(title='Average Sales')
bar1.update_yaxes(title='Regions')

We have successfully used a bar graph to demonstrate the results. It is clear now which region has made the most sales (expect global sales) in terms of video game. We observe that North America has the highest average sales of $264,667.430. It's clear now that we can use the bar graph to answer questions like, which country sells the most video games in the world? This will be help us make some strategy changes.

The above result also answers our 8th questions, i.e., Is there any region that has out-performed global average sales?

The answer is 'No'. There is no region that has out-performed the global average sales. The average global sales is $537,440.656 million.

2. The top gaming consoles are Microsoft (Xbox), Sony (Playstation) and Nintendo, with Google acting as a new competitor.

Does the dataset also back this information? Analyze with respect to different regions and also, globally.

The world is getting more connected every day, and more gaming trends and preferences are emerging around the globe. We want to know which console is preferred the most based on different regions. We will compile results from all over the world and determine if any regional trends might shape the gaming industry as a whole.

We will use the df.groupby function of pandas for our analysis.

  • groupby(): This function is used to group or combine large amounts of data and compute operations on these groups.

Have a look at the code given below.

#Grouping the north america sales based on each platform
data2 = pd.DataFrame(df.groupby("Platform")[["NA_Sales"]].sum().sort_values(by=['NA_Sales'],ascending=[False]).reset_index())
data2.rename(columns = {'Platform':'Platform_NA'}, inplace = True)

#Grouping the europe sales based on each platform
data3 = pd.DataFrame(df.groupby("Platform")[["EU_Sales"]].sum().sort_values(by=['EU_Sales'],ascending=[False]).reset_index())
data3.rename(columns = {'Platform':'Platform_EU'}, inplace = True)

#Grouping the japan sales based on each platform
data4 = pd.DataFrame(df.groupby("Platform")[["JP_Sales"]].sum().sort_values(by=['JP_Sales'],ascending=[False]).reset_index())
data4.rename(columns = {'Platform':'Platform_JP'}, inplace = True)

#Grouping the other region sales based on each platform
data5 = pd.DataFrame(df.groupby("Platform")[["Other_Sales"]].sum().sort_values(by=['Other_Sales'],ascending=[False]).reset_index())
data5.rename(columns = {'Platform':'Platform_other'}, inplace = True)

#Concatenating our datasets
data=pd.concat([data1,data2,data3,data4,data5],axis=1)
data.head(3)

The dataframe data we created will provide us with the total sales (global, NA, EU, JP, Others) based on different platforms. The dataframe will look like this:

Platform_glob Global_Sales Platform_NA NA_Sales Platform_EU EU_Sales Platform_JP JP_Sales Platform_other Other_Sales
0 PS2 1,255.64 X360 601.05 PS3 343.71 DS 175.57 PS2 193.44
1 X360 979.96 PS2 583.84 PS2 339.29 PS 139.82 PS3 141.93
2 PS3 957.84 Wii 507.71 X360 280.58 PS2 139.20 X360 85.54

We realize that the dataframe alone will make it very hard for us to analyze the result. Therefore, we plot the result using a line chart.

from plotly.subplots import make_subplots #import new library

subplot1 = make_subplots(rows=4, cols=1, shared_yaxes=True,subplot_titles=("North American top platforms","Europe top platforms","Japan top platforms","Other regions top platforms"))

#Subplot for North America
subplot1.add_trace(go.Bar(x=data['Platform_NA'], y=data['NA_Sales'],
                    marker=dict(color=[1, 2, 3],coloraxis="coloraxis")),1, 1)

#Subplot for Europe
subplot1.add_trace(go.Bar(x=data['Platform_EU'], y=data['EU_Sales'],
                    marker=dict(color=[4, 5, 6], coloraxis="coloraxis")),                         2, 1)
                   
#Subplot for Japan
subplot1.add_trace(go.Bar(x=data['Platform_JP'], y=data['JP_Sales'],
                    marker=dict(color=[7, 8, 9], coloraxis="coloraxis")),
                    3, 1)

##Subplot for Other Regions
subplot1.add_trace(go.Bar(x=data['Platform_other'], y=data['Other_Sales'],
                    marker=dict(color=[10, 11, 12], coloraxis="coloraxis")),
                   4, 1)
                   
subplot1.update_layout(height=900,width=500,coloraxis=dict(colorscale='Magenta'), showlegend=False)
subplot1.show()

The graph shows us the top platforms preferred by users in different regions. We observe the following:

  • X360 (Microsoft) is the top preferred console by users in North America making a total of $601.05 million.
  • PS3 (Sony) is the top preferred console by users in Europe making a total of $343.71 million.
  • DS (Nintendo) is the top preferred console by users in Japan making a total of $175.57 million.
  • PS2 (Sony)  is the top preferred console by users in other regions making a total of $193.44 million.

This implies that our assumption about Sony, Nintendo and Microsoft among the top consoles was correct.

We can use the same method to compare the top games and publishers for different regions.

Moving on to the next question.

3. What are the top 10 games currently making the most sales globally?

We will use a similar approach by grouping the games with respect to the global sales and observe the top 10 games.

top = pd.DataFrame(df.groupby("Name")[["Global_Sales"]].sum().sort_values(by=['Global_Sales'],ascending=[False]).reset_index())
top.head(10) #Printing the top 10 results
Name Global_Sales
0 Wii Sports 82.74
1 Grand Theft Auto V 55.92
2 Super Mario Bros. 45.31
3 Tetris 35.84
4 Mario Kart Wii 35.82
5 Wii Sports Resort 33.00
6 Pokemon Red/Pokemon Blue 31.37
7 Call of Duty: Black Ops 31.03
8 Call of Duty: Modern Warfare 3 30.83
9 New Super Mario Bros. 30.01

We see the most played game is Wii Sports making a total of $82.74 million globally.

We will plot the above using a pie chart.

pie1 = px.pie(top, values=top['Global_Sales'][:10], names=top['Name'][:10],title='Top 10 games globally', 
              color_discrete_sequence=px.colors.sequential.Purp_r)
pie1.update_traces(textposition='inside', textinfo='percent+label',showlegend=False)

pie1.show()

The pie chart also shows the proportion of sales each game holds, while also depicting the results.

4. What are the top games for different regions?

We will have to compare the sales made by different games regionally. We will use the same approach we did while analyzing the platform.

Try answering this on your own, if you get stuck you can always refer to the lines of code below.

name2 = pd.DataFrame(df.groupby("Name")[["NA_Sales"]].mean().sort_values(by=['NA_Sales'],ascending=[False]).reset_index())
name2.rename(columns = {'Name':'Name_NA'}, inplace = True)

name3 = pd.DataFrame(df.groupby("Name")[["EU_Sales"]].mean().sort_values(by=['EU_Sales'],ascending=[False]).reset_index())
name3.rename(columns = {'Name':'Name_EU'}, inplace = True)

name4 = pd.DataFrame(df.groupby("Name")[["JP_Sales"]].mean().sort_values(by=['JP_Sales'],ascending=[False]).reset_index())
name4.rename(columns = {'Name':'Name_JP'}, inplace = True)

name5 = pd.DataFrame(df.groupby("Name")[["Other_Sales"]].mean().sort_values(by=['Other_Sales'],ascending=[False]).reset_index())
name5.rename(columns = {'Name':'Name_other'}, inplace = True)

#Concatenating the results.
name_df=pd.concat([name2,name3,name4,name5],axis=1)

The dataframe name_df will give us the required result.

Plotting it in a similar fashion to understand the results obtained.

subplot_name1 = make_subplots(rows=4, cols=1, shared_yaxes=True,subplot_titles=("North American top games","Europe top games", "Japan top games","Other regions top games",'Top games globally'))

#Subplot for North America
subplot_name1.add_trace(go.Bar(x=name_df['Name_NA'][:5], y=name_df['NA_Sales'][:5],marker=dict(color=[1, 2, 3],coloraxis="coloraxis")),1, 1)

#Subplot for Europe
subplot_name1.add_trace(go.Bar(x=name_df['Name_EU'][:5], y=name_df['EU_Sales'][:5],marker=dict(color=[4, 5, 6], coloraxis="coloraxis")), 2, 1)

#Subplot for Japan
subplot_name1.add_trace(go.Bar(x=name_df['Name_JP'][:5], y=name_df['JP_Sales'][:5],marker=dict(color=[7, 8, 9], coloraxis="coloraxis")),3, 1)

#Subplot for other regions
subplot_name1.add_trace(go.Bar(x=name_df['Name_other'][:5], y=name_df['Other_Sales'][:5],marker=dict(color=[10, 11, 12], coloraxis="coloraxis")),4, 1)

subplot_name1.update_layout(height=1000,width=500,coloraxis=dict(colorscale='Mint_r'), showlegend=False)
subplot_name1.update_xaxes(tickangle=45)
subplot_name1.show()

The graph shows us the top games preferred by users in different regions and also globally. We observe the following:

  • Wii Sports has been the top game in North America, Europe, other regions.
  • Pokemon Red/Pokemon Blue is the top game in Japan.

5. Are there any games with release year older than 2000 that are still making high sales? What are they?

To achieve this we will first segregate the Year less than 2000 from the rest of the dataframe. Doing this using the df.query() function.

  • query()- This is a filtering function that enables to select and filter the columns of a dataFrame with a boolean expression.
old_games = pd.DataFrame(df.query('Year<2000', inplace=False))

The above code returns a dataframe old_games that has all the games with release date older than 2000. We can also see the first 5 rows using the function df.head(). Moreover, using the function df.shape we observe that the total number of such games are 1974. However, to answer this particular question, we will only focus on the old games making the most sales.

We are considering Global Sales to analyze the overall sales. Remember that you can use the same approach to find the sales for any other region as well.

We will target the data exceeding the 99th percentile to get only the games making the highest sales.

import numpy as np #import library
a=old_games['Global_Sales']
np.percentile(a, [99]) 

>>>array([7.8235])

Let's consider global sales greater than $7.82 million as high and proceed with the analysis.

old_games.query('Global_Sales>7.8235', inplace=False)

Here, we have only targeted those games which makes global sales more than $7.82 million.
Note that we have used this value to help us simplify our analysis. This can change according to your definition of high sales.
The final output should look like this:

Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
1 2 Super Mario Bros. NES 1985 Platform Nintendo 29.08 3.58 6.81 0.77 40.24
4 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo 11.27 8.89 10.22 1.00 31.37
5 6 Tetris GB 1989 Puzzle Nintendo 23.20 2.26 4.22 0.58 30.26
9 10 Duck Hunt NES 1984 Shooter Nintendo 26.93 0.63 0.28 0.47 28.31
12 13 Pokemon Gold/Pokemon Silver GB 1999 Role-Playing Nintendo 9.00 6.18 7.20 0.71 23.10
18 19 Super Mario World SNES 1990 Platform Nintendo 12.78 3.75 3.54 0.55 20.61
21 22 Super Mario Land GB 1989 Platform Nintendo 10.83 2.71 4.18 0.42 18.14
22 23 Super Mario Bros. 3 NES 1988 Platform Nintendo 9.54 3.44 3.84 0.46 17.28
30 31 Pokémon Yellow: Special Pikachu Edition GB 1998 Role-Playing Nintendo 5.89 5.04 3.12 0.59 14.64
46 47 Super Mario 64 N64 1996 Platform Nintendo 6.91 2.85 1.91 0.23 11.89
50 51 Super Mario Land 2: 6 Golden Coins GB 1992 Adventure Nintendo 6.16 2.04 2.69 0.29 11.18
52 53 Gran Turismo PS 1997 Racing Sony Computer Entertainment 4.02 3.87 2.54 0.52 10.95
57 58 Super Mario All-Stars SNES 1993 Platform Nintendo 5.99 2.15 2.12 0.29 10.55
63 64 Mario Kart 64 N64 1996 Racing Nintendo 5.55 1.94 2.23 0.15 9.87
66 67 Final Fantasy VII PS 1997 Role-Playing Sony Computer Entertainment 3.01 2.47 3.28 0.96 9.72
69 70 Gran Turismo 2 PS 1999 Racing Sony Computer Entertainment 3.88 3.42 1.69 0.50 9.49
71 72 Donkey Kong Country SNES 1994 Platform Nintendo 4.36 1.71 3.00 0.23 9.30
76 77 Super Mario Kart SNES 1992 Racing Nintendo 3.54 1.24 3.81 0.18 8.76
84 85 GoldenEye 007 N64 1997 Shooter Nintendo 5.80 2.01 0.13 0.15 8.09
87 88 Final Fantasy VIII PS 1999 Role-Playing SquareSoft 2.28 1.72 3.63 0.23 7.86

Super Mario Bros. released in 1985 is the top old game still making global sales of $40.24 million. This is followed by Pokemon Red/Pokemon Blue (1996), Tetris (1989) and more.

Let's quickly hop on to the next question.

6. What are the top gaming genres that are making high sales?

The result can help game developers find out which genre is preferred in the market. It can also be used to compare gameplay across different games as one tool. These can be analyzed and used to develop strategies for future game releases.

To analyze the sales with respect to genre, we will consider global sales.

Give this question a try before looking at the code below. We will simply use the groupby function on global sales and later visualize our result using a bar graph.

This is how the code should look like:

genre_df = df.groupby("Genre")[["Global_Sales"]].sum().sort_values(by=['Global_Sales'],ascending=[False]).reset_index()
genre_df #print the dataframe

The dataframe genre_df will return the most preferred genres.

Genre Global_Sales
0 Action 1,751.18
1 Sports 1,330.93
2 Shooter 1,037.37
3 Role-Playing 927.37
4 Platform 831.37
5 Misc 809.96
6 Racing 732.04
7 Fighting 448.91
8 Simulation 392.20
9 Puzzle 244.95
10 Adventure 239.04
11 Strategy 175.12
bar_genre= px.bar(genre_df, x='Genre', y='Global_Sales',color='Global_Sales',color_continuous_scale='Burgyl')
bar_genre.show()

The bar graph clearly shows the preference of users for different genres. We observe that Action games are the most preferred across the globe, making a total of $1751.18 million.

7. Does the publisher have any impact on the regional sales?

We can quickly extract the top publishers for different regions using the same method we used in games and platforms. We can take the top 10 publishers for our analysis. Try comparing the publishers for different regions yourself and compare!

We observe the following from the bar graphs:

  • Nintendo is the top publisher in North America, Europe and Japan making a total of $816.87 million, $418.74 million and $455.42 million respectively.
  • For other regions, Electronic Arts is the top publisher making $129.77 million.

8. Is there any region that was out-performed global average sales?

We have already answered the above while answering the 1st question. We found that no region has out-performed global average sales.

You can look at the detailed explanation in the 1st question above.

And with that, we come to an end of our analysis.

Conclusions

We have managed to find the answer to each question, and we can now share our report with the game producer. This report will contain the following information:

  1. North America is the top region has the highest average sales of $264,667.430.
  2. The most preferred platforms are as follows:
  • X360 (Microsoft) is the top preferred console by users in North America making a total of $601.05 million.
  • PS3 (Sony) is the top preferred console by users in Europe making a total of $343.71 million.
  • DS (Nintendo) is the top preferred console by users in Japan making a total of $175.57 million.
  • PS2 (Sony) is the top preferred console by users in other regions making a total of $193.44 million.

3. The top most played game is Wii Sports making a total of $82.74 million       globally.

4. The top games are as follows:

  • Wii Sports has been the top game in North America, Europe, other regions.
  • Pokemon Red/Pokemon Blue is the top game in Japan.

5. Super Mario Bros. released in 1985 is the top old game still making global sales of $40.24 million. This is followed by Pokemon Red/Pokemon Blue (1996), Tetris (1989) and more.

6. Action games are the most preferred across the globe, making a total of $1751.18 million.

7. The top publishers are as follows:

  • Nintendo is the top publisher in North America, Europe and Japan making a total of $816.87 million, $418.74 million and $455.42 million respectively.
  • For other regions, Electronic Arts is the top publisher making $129.77 million.

8. There is no region that has out-performed the global average sales. The average global sales are $537,440.656 million.

This report summarizes the top games, platforms, publishers, genres are more attributes for different regions. The results will be beneficial in improving both sales and customer satisfaction.

We hope you find this analysis useful in addressing your plans for improving the next version of your game and have a clearer vision of how to work on datasets in similar sectors while also handling graphs using plotly.

Stay tuned for more pandas series.

If you are looking for jobs in AI and DS check out Deep Learning Careers


See you next time!