Analyzing the Flipkart Sales Dataset to gain business insights
Let's learn to do a deep analysis of the Flipkart sales dataset using the pandas and plotly libraries. We will take into account various features to gain key insights into our data.
As one of Flipkart's analysts, you are asked to present a detailed report for the management on various aspects. You are provided with a dataset consisting of 20,000 purchases. We will use the Pandas library to analyze the dataset. Moreover, it is recommended to visualize the results using various graphs and charts.
Data Visualization of the result helps the decision-makers understand how the business data is being interpreted to determine business decisions. It also focuses on business insights to discover areas that require attention. Therefore, in addition to pandas, we will make use of the Plotly library as well.
You want to provide a detailed and easy-to-understand report of the dataset covering the following areas:
- What are the top performing products and brands?
- Which brands are offering the highest discount?
- What are some of the target customers that spend the most money on flipkart?
- A thorough analysis of product ratings:
- What is the proportion of 5 star rated products to the total products?
- What are the best and worst performing products and brands in terms of ratings?
- How many number of products are sold for different types of ratings?
- Is there any trend associated with the retail price and discount price over the months?
- When are customers the most active during the day?
Link to the dataset: https://www.kaggle.com/PromptCloudHQ/flipkart-products
Pre-processing the data
Let's import the necessary libraries.
import numpy as np
import pandas as pd
import ast
import plotly.express as px
from plotly import graph_objects as go
To load the dataset, we're going to use pd.read_csv()
.
df = pd.read_csv("flipkart_com-ecommerce_sample.csv")
We can inspect the dataframe using the head()
function.
We see that that dataset consists of columns like uniq_id, timestamp, item name, retail price and discounted price, and more.
There are many columns containing very precise and detailed information on each product that was ordered. This lets us analyze the data in more detail. However, as detailed as the columns are, we may not have all of the cells filled in with these useful values.
Let's see if we have any missing values.
df.isnull().sum()
uniq_id 0
crawl_timestamp 0
product_url 0
product_name 0
product_category_tree 0
pid 0
retail_price 78
discounted_price 78
image 3
is_FK_Advantage_product 0
description 2
product_rating 0
overall_rating 0
brand 5864
product_specifications 14
dtype: int64
We see that there are quite a few missing values, so we have to be careful with our analysis. We will treat the missing values for retail_price
and discounted_price
first using the median imputation method (mean imputation can only be used for variables with no outliers).
Let's see how.
df["retail_price"].fillna(df["retail_price"].median(),inplace=True)
df["discounted_price"].fillna(df["discounted_price"].median(),inplace=True)
We will add another column to our dataset discount_percentage
. This will give us the percentage of discount being offered by different brands. The discount percentage is a useful component that can be used later to draw very important insights.
x=df['retail_price']-df['discounted_price']
y=(x/df['retail_price'])*100
df['discount_percentage']=y
To make working with this dataset easy, extract some more columns like Time
date
and lastly main_category
with the help of the columns crawl_timestamp
and product_category_tree
respectively. Refer to the following lines of code if you get stuck.
df['timestamp']=pd.to_datetime(df['crawl_timestamp']) #converting into datetime to extract date and time easily
df['Time']=df['timestamp'].apply(lambda x : x.time) #extracting time
df['date']=df['timestamp'].apply(lambda x : x.date) #extracting date
df.drop(['crawl_timestamp'], axis = 1,inplace=True) #dropping the column
df['main_category']=df['product_category_tree'].apply(lambda x :x.split('>>')[0][2:len(x.split('>>')[0])-1]) #new column using product_category_tree
We are ready to start with our analysis!
Analyzing the dataset
- What are the top performing products and brands?
It's important to note that out of the total product_rating
in our dataset, only 1849 rows have relevant ratings in them. The rest of the rows have the value 'No rating available'. The same can be checked using the codes df['product_rating'].value_counts()
or len(df[df['product_rating']!='No rating available'])
. This leaves out a huge chunk of data. Therefore, it's important to take into consideration all the products and brands, irrespective of the ratings.
We will analyze the effect of ratings on products/brands in the later section of this post.
Let's find out the top products and brands that customers buy the most.
# Top 10 main products being purchased
n = 10
top_products=pd.DataFrame(df['main_category'].value_counts() [:n]).reset_index()
top_products.rename(columns = {'index':'Top_Products','main_category':'Total_Count'}, inplace = True)
#Top 10 main brands being purchased
n = 10
top_brands=pd.DataFrame(df['brand'].value_counts()[:n]).reset_index()
top_brands.rename(columns = {'index':'Top_Brands','brand':'Total_Count'}, inplace = True)
We have successfully created two separate dataframes, namely, top_products
and top_brands
which contain the first 10 most frequently purchased products and brands that Flipkart customers prefer over other products and brands.
Now we will visualize and see what these dataframes tell us.
from plotly.subplots import make_subplots #plotly library to create subplots
label1 = top_products['Top_Products']
value1=top_products['Total_Count']
label2=top_brands['Top_Brands']
value2=top_brands['Total_Count']
# Create subplots
fig_both = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig_both.add_trace(go.Pie(labels=label1, values=value1, name="Top Products",pull=[0.3, 0, 0, 0]),
1, 1)
fig_both.add_trace(go.Pie(labels=label2, values=value2, name="Top Brands",pull=[0.3, 0, 0, 0]),
1, 2)
# Use `hole` to create a donut-like pie chart
fig_both.update_traces(hole=.4, hoverinfo="label+percent+name")
#fig_both.update_traces(hoverinfo="label+percent+name")
fig_both.update_layout(
title_text="Top products and brands distribution",
#Add annotations in the center of the donut pies
annotations=[dict(text='Product', x=0.18, y=0.5, font_size=20, showarrow=False),
dict(text='Brand', x=0.82, y=0.5, font_size=20, showarrow=False)])
fig_both.show()
Bazinga! Our first graph is ready!
Plotly has made it very easy to analyze and understand the result using a Pie Chart with a hole (also known as a donut chart). The chart can easily be interpreted and important conclusions can be drawn.
We observe that the top product customers are purchasing is 'Clothing', whereas, the top brand is 'Regular'. The other top performing products and brands can be seen in the pie chart.
Note that the same result can be analyzed by printing the newly created dataframes top_products
and top_brands
.
Moving on to our next question.
- Which brands are offering the highest discount?
One of the major reasons why brands offering the highest discount is important for our analysis is because we can promote and recommend these target brands to our active customers. The high discount in itself is attractive, which can draw more customers to the shopping website. Anytime you tell a customer that they can save money, you’re likely to get their attention, which will eventually lead to an increase in sales.
The column discount_percentage that we created will come in handy while working on this problem. Let's learn how.
df_discount=df.query('discount_percentage > 90') #targeting brands giving high discounts
df_discount=df_discount.dropna() #dropping rows with NA values
df_discount["brand"].replace('FashBlush','Fash Blush',inplace=True) #handling spelling errors
max_discount=pd.DataFrame(df_discount.groupby('brand')[['discount_percentage']].mean().sort_values(by=['discount_percentage'],ascending=False).reset_index()) #creating a dataframe
Once we print the dataframe max_discount
, we will be shown with a list of all the brands that are giving 90% and higher discounts on the Flipkart website.
The code to visualize the result is as follows:
px.bar(max_discount, x= 'brand', y='discount_percentage',color='brand',color_discrete_sequence=px.colors.qualitative.Dark2) #plotting a bar graph
The bar graph has returned a very insightful result into the brands offering the highest discounts. Some of these brands are 'Rajcrafts', 'Bling', 'Fash Blush' and more.
We have successfully identified the target brands.
- What are some of the target customers that spend the most money on Flipkart?
It is important to take into consideration the most active customers on the platform. We assume that the customers that have spent the highest on the platform, shop the most from Flipkart. These potential customers are our target audience.
The column uniq_id
in our dataset is the unique ID that the system assigns to each individual customer using the website. We can make use of these unique IDs and treat them like unique customers. By grouping these customers with the sum of discount_price
spent by them, we can get a list of our potential buyers.
df_customer=df.groupby("uniq_id")[["discounted_price"]].sum().sort_values(by=['discounted_price'],ascending=[False]).reset_index()
#Top 20 customers spending the most
list1=df_customer[:20]
#plotting a bar graph
px.bar(list1, x= 'uniq_id', y="discounted_price",color='discounted_price',color_continuous_scale=px.colors.diverging.BrBG)
Well done! We have found out the top 20 customers that are spending the highest on Flipkart. The Flipkart accounts associated with these unique IDs are our areas of focus.
Note that we can find out the top 50, 100 or even 1000 customers spending the most using the same process.
Let's hop on to the next question.
A thorough analysis of product ratings:
- What is the proportion of 5 star rated products to the total products?
- What are the best and worst-performing products and brands in terms of ratings?
- How many products are sold for different types of ratings?
We will answer these questions one by one.
Before we answer the first question, it's important to understand the advantage of the star rating system. A star rating is a rating question that lets people rate a product/brand with a number of stars. They create standardization which allows the customers to compare different products/brands easily.
This is a great benefit of the rating system and can be used to segregate the top-quality products and brands with the rest. It goes without saying that the 5-star products/brands are the most trusted for the customers and should be recommended and promoted more.
It's important to note that not all products have a rating score. The information of the products with different ratings is coming from a very small proportion of the population. Let's first understand that out of the total products, how many are actually rated, and out of those rated products, how many are 5-star rated.
- What is the proportion of 5 star rated products to the total products?
# 5 star rating
total_prod=len(df['pid']) #total products using pid variable
total_ratings=len(df[df['product_rating']!='No rating available']) #total rated products
top_ratings=len(df[df['product_rating']=='5']) #5 star rated products
df_funnel_1 = dict(
number=[total_prod,total_ratings,top_ratings],
stage=["Total Products","Products with ratings","Products with 5 star rating"])
funnel_1_fig = px.funnel(df_funnel_1, x='number', y='stage')
funnel_1_fig.show()
The above graph is called a Funnel Chart. Funnel charts are often used to represent data in different stages of a business process.
The above graph has made it very easy for us to visualize the different levels, starting from the total products and ending at the 5 star products. We observe that out of 20,000 products, only 1849 have product ratings and only 620 products have 5 star ratings. This is a very small proportion of the total population. This answers our first question.
- What are the best and worst-performing products and brands in terms of ratings?
For the best performing brands, we will look at the 5-star products/brands and for the worst-performing, we will look at the 1-star products/brands. It's important to also look at the 1-star products/brands because these are particularly the problem areas. These products/brands can be further improved to meet customer requirements and increase the quality of the items on the website
The code for the same is given below.
#5 star products/brands
rating_5=pd.DataFrame(df.loc[df['product_rating'] == '5'])
top_product_type=rating_5['main_category'].value_counts() #top products
top_brand_type=rating_5['brand'].value_counts() #top brands
#top 5 products
df_top_product=pd.DataFrame(top_product_type[:5].reset_index()) #first 5
df_top_product.rename(columns = {'index':'top_prod'}, inplace = True)
df_top_product.drop('main_category', inplace=True, axis=1)
#top 5 brands
df_top_brand=pd.DataFrame(top_brand_type[:5].reset_index())
df_top_brand.rename(columns = {'index':'top_brands'}, inplace = True)
df_top_brand.drop('brand', inplace=True, axis=1)
df_top_brand.head()
#concatenating the 2 tables
df_product_brand_rate5=pd.concat([df_top_product,df_top_brand],axis=1)
We will do a similar procedure for the bottom products/brands (1 star).
Once we combine these two final dataframes, we will get the following dataframe.
top_prod | top_brands | bottom_prod | bottom_brand | |
---|---|---|---|---|
0 | Clothing | Regular | Home Improvement | Legrand |
1 | Jewellery | Slim | Health & Personal Care Appliances | Zobello |
2 | Footwear | Black | Kitchen & Dining | Shop Rajasthan |
3 | Watches | Bosch | Sports & Fitness | VR Designers |
4 | Kitchen & Dining | JDX | Baby Care | Trident |
The final table returns the top product and brand in the 5-star category is the same for the overall best performing product and brand. Also, the product 'Baby Care' and the brand 'Trident' are the worst product and brand in the 1-star category. Now, it is clear what are some of the categories that need improvement. We can also compare the top and bottom products/brands and upgrade them accordingly.
Before ending our analysis on ratings, let's find out the total number of products for each type of rating.
df.drop(df.index[df['product_rating'] == 'No rating available'], inplace = True)
ratings=pd.DataFrame(df['product_rating'].value_counts().reset_index())
ratings['index'] = ratings['index'].astype(float)
ratings.head().sort_values(by=['index'],ascending=[False])
ratings.rename(columns = {'index':'Ratings','product_rating':'Counts'}, inplace = True)
3plotting the result
data=ratings
x=ratings['Ratings']
y=ratings['Counts']
figdot2 = go.Figure()
figdot2.add_trace(go.Scatter(
x=x,
y=y,
marker=dict(color="crimson", size=12),
mode="markers",
name="ratings",
))
figdot2.update_layout(title="Ratings v/s Count",
xaxis_title="Ratings",
yaxis_title="Count",
)
figdot2.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
figdot2.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
figdot2.show()
The above is called a Dot Chart. One of the simplest ways to visualize data.
We observe that 5-star products have the highest count, followed by 4-star, 1 star, 3-star, and 2-star. We can look at the count by placing our cursor on different dots.
We have completed our analysis on ratings thoroughly. Let's see what our next question is.
- Is there any trend associated with the retail price and discount price over the months?
One of the reasons to analyze the trend of retail and discount prices is to understand during what time of the year the prices have spiked up or down. Doing this analysis gives us some valuable insight into the workings of different features of the dataset. We can use this data to make informed decisions like when to raise or lower prices on products. These decisions shouldn't always be a “gut” feeling.
df_date_retail = pd.DataFrame(df.groupby("date")[["retail_price"]].mean().reset_index())
df_date_discount = pd.DataFrame(df.groupby("date")[["discounted_price"]].mean().reset_index())
df_date_price=pd.concat([df_date_retail,df_date_discount],axis=1)
df_date_price = df_date_price.loc[:,~df_date_price.columns.duplicated()] #remove duplicate columns
#Plot
x=df_date_price['date']
y1=df_date_price['retail_price']
y2=df_date_price['discounted_price']
fig_area2 = go.Figure()
fig_area2.add_trace(go.Scatter(x=x, y=y1, fill='tozeroy',name='retail price',
line=dict(width=0.5, color='crimson'))) # fill down to xaxis
fig_area2.add_trace(go.Scatter(x=x, y=y2, fill='tozeroy',name='discount price',
line=dict(width=0.5, color='darkslategray')
)) # fill to trace0 y
fig_area2.update_layout(
xaxis_title="Dates",
yaxis_title="Price (in 1000s)",
plot_bgcolor='white'
)
fig_area2.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
fig_area2.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
fig_area2.show()
The above plot is an example of an Area Chart. We observe that in the months of December, February, April, and May the prices had spiked up, which can be due to various reasons.
- When are customers the most active during the day?
Let's find out around what time during the day are customers the most active.
We can relate the customer activity with the different product URLs being clicked using the column product_url
. This will tell us when different URLs are being clicked during the day.
One of the reasons why it is an important thing to analyze is to understand when the rush hours are more and we can improve the Flipkart platform performance during these hours. Moreover, ads for different brands can also run during these active hours.
We will use the date column with the product_url
for the analysis. Below, we have plotted a scatter plot to visualize the result and gain key insights from the same.
scat2 = px.scatter(x=df['Time'].sort_values(ascending=True), y=df['product_url'])
scat2.update_layout(
title_text='No. of clicks vs time', # title of plot
xaxis_title_text='Time', # xaxis label
yaxis_title_text='No. of Clicks', # yaxis label
)
#scat.update_xaxes(showticklabels=False)
scat2.update_yaxes(showticklabels=False)
scat2.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
scat2.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
scat2.show()
We see that user activity keeps increasing throughout the day.
With this, we come to an end of our analysis.
You can think more features in this dataset that can provide with significant information and work on them in a similar fashion.
Conclusion:
Hence, we conclude that:
- The top product customers are purchasing is 'Clothing', whereas, the top brand is 'Regular'.
- Brands giving highest discounts are 'Rajcrafts', 'Bling', 'Fash Blush' and more.
- Learnt how to target active customers.
- Out of 20,000 products only 620 products have 5 star ratings.
- The product 'Baby Care' and the brand 'Trident' are the lowest performing product and brand in the 1-star category.
- In the months of December, February, April, and May the prices spiked up.
We have found the answers to all the questions, and we can now share our report with the team, together with the processed data for further analysis.
Hope that now you feel more confident to handle pandas and plotly libraries for your analysis.
Waiting for you in our next pandas series.
See you then!