Data Analysis and Visualization in the retail/FMCG sector
We have worked with datasets of various sectors in our previous pandas series. Let us now learn how to analyze a dataset in the retail and FMCG sector using Pandas, but this time let’s take this to the next level and include some plotting libraries as well. We will work with Pandas and Plotly.
The Plotly python library is an interactive, open-source plotting library that supports over 40 unique chart types covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases.
Let’s get our hands dirty and experiment with plotly for visualization purposes in this article.
To understand how data is handled in the retail/FMCG sector, the ‘BigMart Sales’ dataset is a perfect fit. While working on the same, we can learn to manipulate the data to our advantage and draw key insights while also visualizing the results.
Imagine you are working in the procurement and logistics team of a company that wants to launch a chain of supermarkets/grocery stores across the country. Your team wants to optimize the purchasing patterns for different aspects of the dataset.
You want to know how the following factors affect the sales:-
- Does the outlet type (grocery store or supermarket) have any impact on the overall sales?
- Which type of city has the most overall sales? Or which outlet location makes the most overall sales?
- Does the outlet size have any impact on the overall sales?
- Which category of products sells the most and the least?
- Do the product visibility and weight have any impact on the sales of the product?
- What is the average MRP of the product that sells the most and the least? Which category do these products fall under?
- What are some products that sell better in Tier 1 cities as compared to Tier 2 and Tier 3 cities?
- Are there any products selling better in Tier 2 and 3 cities as compared to Tier 1 cities?
Data to download: https://www.kaggle.com/brijbhushannanda1979/bigmart-sales-data?select=Train.csv
The training dataset, ‘Train.csv’ will be sufficient to answer all of the questions, therefore, we will be using it for our analysis.
Pre-Processing the data
To get started, we need to import some useful libraries that will help us import the dataset into our python environment, manipulate and analyze the same and later help us to visualize it.
import pandas as pd
import numpy as np
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
print("The modules are imported")
While using platforms like Google Colab, we need to connect our google drive and store the data. Or else, we’ll need to give the direct path to Pandas of the file location on your system.
Using the former method to import the dataset:-
df = pd.read_csv('Train.csv')
Before moving forward, let’s have a quick look at some of the important attributes of the datasets.
df.head()
Item_Identifier | Item_Weight | Item_Fat_Content | Item_Visibility | Item_Type | Item_MRP | Outlet_Identifier | Outlet_Establishment_Year | Outlet_Size | Outlet_Location_Type | Outlet_Type | Item_Outlet_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | FDA15 | 9.300 | Low Fat | 0.016 | Dairy | 249.809 | OUT049 | 1999 | Medium | Tier 1 | Supermarket Type1 | 3735.138 |
1 | DRC01 | 5.920 | Regular | 0.019 | Soft Drinks | 48.269 | OUT018 | 2009 | Medium | Tier 3 | Supermarket Type2 | 443.423 |
2 | FDN15 | 17.500 | Low Fat | 0.017 | Meat | 141.618 | OUT049 | 1999 | Medium | Tier 1 | Supermarket Type1 | 2097.270 |
3 | FDX07 | 19.200 | Regular | 0.000 | Fruits and Vegetables | 182.095 | OUT010 | 1998 | NaN | Tier 3 | Grocery Store | 732.380 |
4 | NCD19 | 8.930 | Low Fat | 0.000 | Household | 53.861 | OUT013 | 1987 | High | Tier 3 | Supermarket Type1 | 994.705 |
Our dataset has 8523 rows and 12 columns.
Moreover, we observe that the columns ‘Item_Identifier” and “Outlet_Identifier” have no use to us, so we will remove them from the dataset.
df.drop(['Item_Identifier','Outlet_Identifier'],axis=1,inplace=True)
Before we could answer any question, it’s important to do a general EDA on our dataset.
We will begin with missing value imputation.
df.isnull().sum()
Item_Identifier 0
Item_Weight 1463
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 2410
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64
Looking at the output, we observe that only 2 columns have missing values in them, namely, Item_Weight and Outlet_Size.
We can do the mean or median imputation for numerical variables and mode imputation for categorical variables.
Starting with Item_Weight, knowing that it’s a numerical variable, we will choose between either mean or median imputation. To do so, first look at the boxplot for the same to check for the presence of outliers.
Plotting the Boxplot using Plotly:
fig = px.box(df, y="Item_Weight")
fig.show()
Since we are clear of all outliers, it will be safe to say that we can use the mean imputation.
df["Item_Weight"].fillna(df["Item_Weight"].mean(),inplace=True)
Checking if all the missing values are filled.
df["Item_Weight"].isnull().sum()
0
We have successfully, filled all missing values for Item_Weight.
Following the same procedure for ‘Outlet_Size’, but since it is a categorical variable, we will use the mode imputation.
df["Outlet_Size"].fillna(df["Outlet_Size"].mode()[0],inplace=True)
df["Outlet_Size"].isnull().sum()
Item_Weight 0
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Establishment_Year 0
Outlet_Size 0
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64
Voila! We have successfully filled in all of the missing values.
Now we are ready to go!
Analysis of sales corresponding to each factor.
Outlet Type v/s Sales
While moving forward to analyze if and how the outlet type is correlated with the sales, we will again be using the 'groupby()' function of pandas, like we did while analyzing the ‘Chipotle Orders’ datasets.
Doing a quick revision, groupby() function is used to group or combine large amounts of data and compute operations on these groups.
Here, we are combining the groupby() function with some other simple yet interesting functions of pandas like sort_values(), reset_index() and set_option.
outlet_type_sales = df.groupby("Outlet_Type")[["Item_Outlet_Sales"]].sum()
pd.set_option('display.float_format', lambda x: '%.3f' % x) #converting from scientific notation to numerical format
outlet_type_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
In the above snippet of code we have grouped the data based on the ‘Outlet_Type’ and summed up the 'Item_Outlet_Sales' for each of the Outlet Type using the sum() function. Storing the same in a variable named ‘outlet_type_sales’.
Now before we could read the overall sales per outlet type, we had to convert the values from scientific notation to their numerical format using the set_option function of pandas, after which we firstly arrange the obtained dataframe in descending order (Outlet type with the most sales on top) using the sort_values() function of pandas and finally resetting the index.
Outlet_Type | Item_Outlet_Sales | |
---|---|---|
0 | Supermarket Type1 | 12917342.263 |
1 | Supermarket Type3 | 3453926.051 |
2 | Supermarket Type2 | 1851822.830 |
3 | Grocery Store | 368034.266 |
From the output we obtained from the above code, we note that the outlet type 'Supermarket Type 1' makes the most overall sales across all cities, whereas, 'Grocery Store' makes the least sales.
Taking this one step further and visualizing the same result.
One of the many benefits of visualizing the result is that the plotted patterns make more sense when graphically represented because visuals and diagrams make it easier for us to identify strongly correlated parameters.
So, let’s learn today how to use plotly to our advantage. We have already imported the library at the beginning of our session.
data = df.groupby("Outlet_Type")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
px.bar(data, x= 'Outlet_Type', y="Item_Outlet_Sales",color='Outlet_Type',color_discrete_sequence=px.colors.qualitative.Pastel1)
The variable data has the grouped data of the overall sales corresponding to different outlet type (the same way we did above). Here, we take the Outlet_type on the x-axis. It will have 4 values, namely, Supermarket Type1, Supermarket Type 2, Supermarket Type 3, and lastly, Grocery Store. Y-axis will have the sum of variable Item_Outlet_sales corresponding to different values on the x-axis.
The function color_discrete_sequence helps us customize the colour sequence for the plot. Here, we have used the colour sequence “Pastel1” using the input colors.qualitative.Pastel1.
You can choose any of the built-in qualitative colour sequences from the px.colors.qualitative module using the following code:-
fig = px.colors.qualitative.swatches()
fig.show()
Make sure the plotly library is already imported before implementing the above code.
It is important to note that the input corresponding to color only helps choose the variable for the legend to be displayed besides the plot. We have taken “Outlet_Type” to be shown as an input for legend.
This is how the final plot looks.
The plot is self-explanatory, thanks to plotly, we can observe the outlet type v/s sales in the hover data wherever we place our cursor.
We will follow a similar procedure while analyzing the overall sales for
- Outlet_location
- Outlet_Size
- Item_visibility and Item_weight
- Item_Type
- Item_ MRP.
And later we will go on comparing the sales trend for Tier 1 cities v/s Tier 2&3 cities.
Outlet location v/s Sales
Using the same functions to see the sales per outlet location by grouping the data with respect to the Outlet_Location and aggregating the corresponding sales. This time we will convert the scientific values into numbers in the very beginning.
pd.set_option('display.float_format', lambda x: '%.3f' % x)
location_type_sales = df.groupby("Outlet_Location_Type")[["Item_Outlet_Sales"]].sum().reset_index()
location_type_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False])
Outlet_Location_Type | Item_Outlet_Sales | |
---|---|---|
0 | Tier 3 | 7131575.550 |
1 | Tier 2 | 6066611.805 |
2 | Tier 1 | 4223880.474 |
The output shows that Tier 3 cities perform the best in terms of sales, followed by Tier 2 and lastly, Tier 1 cities.
This is a very important inference we have drawn since this contradicts the most general hypothesis that Tier 1 cities might generate the most sales followed by Tier 2 and Tier 3 cities.
Visualizing the same using plotly.
data = df.groupby("Outlet_Location_Type")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
px.bar(data, x= 'Outlet_Location_Type', y="Item_Outlet_Sales",color='Outlet_Location_Type',color_discrete_sequence=px.colors.qualitative.T10)
Using the color sequence T10 in this case.
Outlet Size v/s Sales
Let’s do the same analysis for outlet size v/s corresponding sales.
pd.set_option('display.float_format', lambda x: '%.3f' % x)
outlet_size_sales = df.groupby("Outlet_Size")[["Item_Outlet_Sales"]].sum().reset_index()
outlet_size_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False])
Outlet_Size | Item_Outlet_Sales | |
---|---|---|
0 | Medium | 11119470.539 |
1 | Small | 4300311.651 |
2 | High | 2002285.640 |
'Medium-sized' outlets have outperformed 'Small' and 'High-sized' outlets.
The above result is more clear using plotly.
data = df.groupby("Outlet_Size")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
px.bar(data, x= 'Outlet_Size', y="Item_Outlet_Sales",color='Outlet_Size',color_discrete_sequence=px.colors.qualitative.Set3)
Using the colour sequence Set3 in this case.
Let’s move on to the next factor analysis.
Item Category v/s Sales
Try to analyze the data on your own for the variables 'Item_type' and 'Item_Outlet_Sales' before looking at the code below.
Don’t worry, you already know how to do it!
If you are done, evaluate your code using the following lines of code as reference:-
df.groupby("Item_Type")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
Item_Type | Item_Outlet_Sales | |
---|---|---|
0 | Household | 13086.965 |
1 | Fruits and Vegetables | 12117.560 |
2 | Snack Foods | 10993.690 |
3 | Canned | 10306.584 |
4 | Dairy | 10256.649 |
5 | Health and Hygiene | 9779.936 |
6 | Frozen Foods | 9678.069 |
7 | Soft Drinks | 9554.230 |
8 | Meat | 9390.443 |
9 | Breads | 8958.339 |
10 | Breakfast | 8209.314 |
11 | Starchy Foods | 8132.081 |
12 | Baking Goods | 7931.675 |
13 | Hard Drinks | 7843.124 |
14 | Seafood | 6503.534 |
15 | Others | 6008.845 |
We observe that the 'Household' item category has sold the most followed by 'Fruits and Vegetables', whereas, 'Others' item category have sold the least. The above analysis is achieved by keeping all cities, outlet size, and outlet type constant.
Give plotly a try on your own too. Use any of the plotly color sequence of your choice.
Use the below lines of code as reference:-
data=df.groupby("Item_Type")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
px.bar(data, x= 'Item_Type', y="Item_Outlet_Sales",color='Item_Type',color_discrete_sequence=px.colors.qualitative.Bold)
Congratulations! You have successfully used plotly to visualize the result all by yourself!
The plot looks vibrant and equally informative. We have used the color sequence Bold in this case.
Moving on to our next factor analysis.
Item Visibility and Weight v/s Sales
Analyzing these 2 factors one by one will give us a better understanding of the result. Let's start with the item visibility.
First, we will have a look at the unique values for Item_Visibility using the pandas function unique(), which returns an array of all the unique values in a column.
df['Item_Visibility'].unique()
array([0.0000000,0.0160473 , 0.01927822, 0.01676007, ..., 0.03518627, 0.14522065,0.04487828])
We see that Item_Visibility=0 does not make sense and should be removed.
df.drop(df.loc[df['Item_Visibility']==0].index, inplace=True)
Now, we are ready for the analysis.
pd.set_option('display.float_format', lambda x: '%.3f' % x)
item_visibility_sales = df.groupby("Item_Visibility")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
item_visibility_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False])
It is important to note that the Item_Visibility has a total of 7880 rows. In such cases, bar graphs are not a suitable plotting option.
We will visualize the result with the help of a scatter plot. Let's see how.
fig = px.scatter(df,x='Item_Visibility', y='Item_Outlet_Sales',color='Item_Visibility')
fig.show()
We observe that items with visibility between 0.005 to 0.18 generate the most sales, whereas, items with visibility of more than 0.2 generate the least sales. We can also say that products with a very high visibility are not making a lot of sales. This can be due to many reasons like, big items (or items with more visibility) are more costly, so they are sold less, or because big items are hard to carry around and therefore not purchased daily or in bulk.
Since Item_Visibility can (not necessarily) be correlated with Item_Weight, we will check if Item_Weight analysis gives us the same result.
First, checking to see if the Item_Weight is greater than or equal to 0. We can simply do so by checking the minimum value for this variable.
df['Item_Weight'].min()
4.555
We see that the minimum value is 4.555 units. Now, we shall proceed with the further steps.
pd.set_option('display.float_format', lambda x: '%.3f' % x)
item_visibility_sales = df.groupby("Item_Weight")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
item_visibility_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False])
We see that product with the a weight of 12.85 units has made the most sales and Item_Weight around 9.1 units has the least sales.
Visualizing the same using a scatter plot to get a clear picture.
fig = px.scatter(df,x='Item_Weight', y='Item_Outlet_Sales',color='Item_Weight')
fig.show()
The scatter plot also shows that products with weight 12.6 units are making the most sales. The total sales corresponding to other item weight are scattered across the plot.
Item MRP v/s Sales
Which category do these products fall under?
Checking the sales for different item MRP below:-
df.groupby("Item_MRP")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
The output will show that the product with MRP 234.9958 has the most sales, followed by MRP 253.0356, with the 2nd highest sales.
Whereas, the product with MRP 31.290 has the least sales. Let's find out what these items are.
Finding out the item type for the Item_MRP making the highest sales.
df.loc[df['Item_MRP'] ==234.9958]
Item_Weight | Item_Fat_Content | Item_Visibility | Item_Type | Item_MRP | Outlet_Establishment_Year | Outlet_Size | Outlet_Location_Type | Outlet_Type | Item_Outlet_Sales | |
---|---|---|---|---|---|---|---|---|---|---|
7061 | 20.350 | Regular | 0.015 | Fruits and Vegetables | 234.996 | 1987 | High | Tier 3 | Supermarket Type1 | 5842.395 |
7189 | 12.858 | Low Fat | 0.011 | Household | 234.996 | 1985 | Medium | Tier 3 | Supermarket Type3 | 13086.965 |
The item types for items with MRP 244.996 are 'Fruits and Vegetables' and 'Household'
Finding out the item type for the Item_MRP making the lowest sales.
df.loc[df['Item_MRP'] ==31.290]
Item_Weight | Item_Fat_Content | Item_Visibility | Item_Type | Item_MRP | Outlet_Establishment_Year | Outlet_Size | Outlet_Location_Type | Outlet_Type | Item_Outlet_Sales | |
---|---|---|---|---|---|---|---|---|---|---|
154 | 12.858 | Low Fat | 0.042 | Soft Drinks | 31.290 | 1985 | Medium | Tier 3 | Supermarket Type3 | 898.830 |
The item type for items with MRP 31.290 is 'Soft Drinks'.
Visualize the above analysis using a scatter plot by following the same procedure that we discussed before.
Give it a try!
Products performing better and worse in Tier 1 cities v/s Tier 2 & 3 cities in terms of Sales
In this last section, we will answer our last two questions. They are:-
- What are some products that sell better in Tier 1 cities as compared to Tier 2 and Tier 3 cities?
- Are there any products selling better in Tier 2 and 3 cities as compared to Tier 1 cities?
We will follow a different approach to reach our target.
First, we will split our data into two categories:-
- Tier 1 cities
- Tier 2 & 3 cities
This will result in two different dataframes.
The code for the same is as follows:-
df_tier1=pd.DataFrame(df.loc[df['Outlet_Location_Type'] =='Tier 1'])#for tier 1
df_tier2=pd.DataFrame(df.loc[df['Outlet_Location_Type'] =='Tier 2']) #for tier 2
df_tier3=pd.DataFrame(df.loc[df['Outlet_Location_Type'] =='Tier 3']) #for tier 3
df_cities=pd.concat([df_tier2,df_tier3],axis=0) #combining dfs for tier 2 & 3
We have saved the Tier 1 cities data in the variable "df_tier1" and the same for Tier 2 & 3 in the variable named "df_cities"
Now, we will analyze how each product is performing for firstly Tier 1 cities and then for Tier 2 and 3 cities.
- For Tier 1 cities :-
df_tier1_list=df_tier1.groupby("Item_Type")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
df_tier1_list.rename(columns = {'Item_Type':'Items_Tier_1','Item_Outlet_Sales':'Sales_tier_1'}, inplace = True)
Items_Tier_1 | Sales_tier_1 | |
---|---|---|
0 | Health and Hygiene | 9779.936 |
1 | Starchy Foods | 8132.081 |
2 | Dairy | 7833.803 |
3 | Canned | 7590.120 |
4 | Fruits and Vegetables | 7549.506 |
5 | Frozen Foods | 7370.406 |
6 | Household | 7148.029 |
7 | Soft Drinks | 7017.532 |
8 | Breads | 6972.258 |
9 | Breakfast | 6930.978 |
10 | Snack Foods | 6692.622 |
11 | Seafood | 6503.534 |
12 | Meat | 6331.758 |
13 | Hard Drinks | 6308.455 |
14 | Baking Goods | 6093.402 |
15 | Others | 6008.845 |
- For Tier 2 & 3 cities:-
df_cities_list=df_cities.groupby("Item_Type")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
df_cities_list.rename(columns = {'Item_Type':'Items_Tier2_3','Item_Outlet_Sales':'Sales_tier2_3'}, inplace = True)
Items_Tier2_3 | Sales_tier2_3 | |
---|---|---|
0 | Household | 13086.965 |
1 | Fruits and Vegetables | 12117.560 |
2 | Snack Foods | 10993.690 |
3 | Canned | 10306.584 |
4 | Dairy | 10256.649 |
5 | Frozen Foods | 9678.069 |
6 | Soft Drinks | 9554.230 |
7 | Meat | 9390.443 |
8 | Breads | 8958.339 |
9 | Breakfast | 8209.314 |
10 | Baking Goods | 7931.675 |
11 | Hard Drinks | 7843.124 |
12 | Starchy Foods | 7443.644 |
13 | Health and Hygiene | 7142.702 |
14 | Seafood | 5992.200 |
15 | Others | 5546.114 |
Concatenating these two dataframes into one variable named 'df_sales'
df_sales=pd.concat([df_tier1_list,df_cities_list],axis=1)
df_sales
The output will show a combined dataframe. This helps us to see the sales corresponding to each item type for the two categories of cities.
Let's plot the same data. We get an intuition that a line plot is a better fit. This will help us compare the sales trend for Tier 1 v/s Tier 2 & 3 cities.
Before we proceed with the above, it's important to import another plotly library, Graph Objects.
import plotly.graph_objects as go #importing the library
x=df_sales['Items_Tier_1']
y1=df_sales['Sales_tier_1']
y2=df_sales['Sales_tier2_3']
# Create traces
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y1,
mode='lines+markers',line_color='rgb(0,100,80)',
name='tier 1 sales'))
fig.add_trace(go.Scatter(x=x, y=y2,
mode='lines+markers',
name='tier 2&3 sales'))
fig.show()
With only one quick look at the graph, we can clearly see that no item type has performed better in Tier 1 cities than that for Tier 2& 3 except for the category 'Others', that too with a very slight margin.
Sales generated by Tier 2 and 3 cities have outnumbered that for Tier 1 cities.
And with that, we come to an end of our analysis.
Conclusions
Hence we conclude:-
-
Sales have been made the most for:
- Supermarket Type 1
- Tier 3 cities
- Medium sized outlet
- Household Items
- Item weight 12.6 units
-
Tier 2 and 3 cities have outperformed Tier 1 cities in terms of sales for all types of items except for the category 'Others'.
We have managed to find the answer to each question, and we can now share our conclusions with the team, together with the processed data for further analysis.
Hoping now you have a clearer vision of how to work on datasets in the retail/FMCG sector and can handle simple graphs using plotly.
Stay tuned for more pandas series.
See you next time!