Data Analysis and Visualization in the retail/FMCG sector

We have worked with datasets of various sectors in our previous pandas series. Let us now learn how to analyze a dataset in the retail and FMCG sector using Pandas, but this time let’s take this to the next level and include some plotting libraries as well. We will work with Pandas and Plotly.

The Plotly python library is an interactive, open-source plotting library that supports over 40 unique chart types covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases.

Let’s get our hands dirty and experiment with plotly for visualization purposes in this article.

To understand how data is handled in the retail/FMCG sector, the ‘BigMart Sales’ dataset is a perfect fit. While working on the same, we can learn to manipulate the data to our advantage and draw key insights while also visualizing the results.

Imagine you are working in the procurement and logistics team of a company that wants to launch a chain of supermarkets/grocery stores across the country. Your team wants to optimize the purchasing patterns for different aspects of the dataset.

You want to know how the following factors affect the sales:-

  1. Does the outlet type (grocery store or supermarket) have any impact on the overall sales?
  2. Which type of city has the most overall sales? Or which outlet location makes the most overall sales?
  3. Does the outlet size have any impact on the overall sales?
  4. Which category of products sells the most and the least?
  5. Do the product visibility and weight have any impact on the sales of the product?
  6. What is the average MRP of the product that sells the most and the least? Which category do these products fall under?
  7. What are some products that sell better in Tier 1 cities as compared to Tier 2 and Tier 3 cities?
  8. Are there any products selling better in Tier 2 and 3 cities as compared to Tier 1 cities?

Data to download: https://www.kaggle.com/brijbhushannanda1979/bigmart-sales-data?select=Train.csv

The training dataset, ‘Train.csv’ will be sufficient to answer all of the questions, therefore, we will be using it for our analysis.

✔️
Check us out on Twitter for more such cool articles

Pre-Processing the data

To get started, we need to import some useful libraries that will help us import the dataset into our python environment, manipulate and analyze the same and later help us to visualize it.

import pandas as pd
import numpy as np
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
print("The modules are imported")

While using platforms like Google Colab, we need to connect our google drive and store the data. Or else, we’ll need to give the direct path to Pandas of the file location on your system.

Using the former method to import the dataset:-

df = pd.read_csv('Train.csv')

Before moving forward, let’s have a quick look at some of the important attributes of the datasets.

df.head()
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type Item_Outlet_Sales
0 FDA15 9.300 Low Fat 0.016 Dairy 249.809 OUT049 1999 Medium Tier 1 Supermarket Type1 3735.138
1 DRC01 5.920 Regular 0.019 Soft Drinks 48.269 OUT018 2009 Medium Tier 3 Supermarket Type2 443.423
2 FDN15 17.500 Low Fat 0.017 Meat 141.618 OUT049 1999 Medium Tier 1 Supermarket Type1 2097.270
3 FDX07 19.200 Regular 0.000 Fruits and Vegetables 182.095 OUT010 1998 NaN Tier 3 Grocery Store 732.380
4 NCD19 8.930 Low Fat 0.000 Household 53.861 OUT013 1987 High Tier 3 Supermarket Type1 994.705

Our dataset has 8523 rows and 12 columns.

Moreover, we observe that the columns ‘Item_Identifier” and “Outlet_Identifier” have no use to us, so we will remove them from the dataset.

df.drop(['Item_Identifier','Outlet_Identifier'],axis=1,inplace=True)

Before we could answer any question, it’s important to do a general EDA on our dataset.

We will begin with missing value imputation.

df.isnull().sum()
Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

Looking at the output, we observe that only 2 columns have missing values in them, namely, Item_Weight and Outlet_Size.

We can do the mean or median imputation for numerical variables and mode imputation for categorical variables.

Starting with Item_Weight, knowing that it’s a numerical variable, we will choose between either mean or median imputation. To do so, first look at the boxplot for the same to check for the presence of outliers.

Plotting the Boxplot using Plotly:

fig = px.box(df, y="Item_Weight")
fig.show()

Since we are clear of all outliers, it will be safe to say that we can use the mean imputation.

df["Item_Weight"].fillna(df["Item_Weight"].mean(),inplace=True)

Checking if all the missing values are filled.

df["Item_Weight"].isnull().sum()
0

We have successfully, filled all missing values for Item_Weight.

Following the same procedure for ‘Outlet_Size’, but since it is a categorical variable, we will use the mode imputation.

df["Outlet_Size"].fillna(df["Outlet_Size"].mode()[0],inplace=True)
df["Outlet_Size"].isnull().sum()
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

Voila! We have successfully filled in all of the missing values.

Now we are ready to go!

Analysis of sales corresponding to each factor.

Outlet Type v/s Sales

While moving forward to analyze if and how the outlet type is correlated with the sales, we will again be using the 'groupby()' function of pandas, like we did while analyzing the ‘Chipotle Orders’ datasets.

Doing a quick revision, groupby() function is used to group or combine large amounts of data and compute operations on these groups.

Here, we are combining the groupby() function with some other simple yet interesting functions of pandas like sort_values(), reset_index() and set_option.

outlet_type_sales = df.groupby("Outlet_Type")[["Item_Outlet_Sales"]].sum()
pd.set_option('display.float_format', lambda x: '%.3f' % x) #converting from scientific notation to numerical format
outlet_type_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()

In the above snippet of code we have grouped the data based on the ‘Outlet_Type’ and summed up the 'Item_Outlet_Sales' for each of the Outlet Type using the sum() function. Storing the same in a variable named ‘outlet_type_sales’.

Now before we could read the overall sales per outlet type, we had to convert the values from scientific notation to their numerical format using the set_option function of pandas, after which we firstly arrange the obtained dataframe in descending order (Outlet type with the most sales on top) using the sort_values() function of pandas and finally resetting the index.

Outlet_Type Item_Outlet_Sales
0 Supermarket Type1 12917342.263
1 Supermarket Type3 3453926.051
2 Supermarket Type2 1851822.830
3 Grocery Store 368034.266

From the output we obtained from the above code, we note that the outlet type 'Supermarket Type 1' makes the most overall sales across all cities, whereas, 'Grocery Store' makes the least sales.

Taking this one step further and visualizing the same result.

One of the many benefits of visualizing the result is that the plotted patterns make more sense when graphically represented because visuals and diagrams make it easier for us to identify strongly correlated parameters.

So, let’s learn today how to use plotly to our advantage. We have already imported the library at the beginning of our session.

data = df.groupby("Outlet_Type")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
px.bar(data, x= 'Outlet_Type', y="Item_Outlet_Sales",color='Outlet_Type',color_discrete_sequence=px.colors.qualitative.Pastel1)

The variable data has the grouped data of the overall sales corresponding to different outlet type (the same way we did above). Here, we take the Outlet_type on the x-axis. It will have 4 values, namely, Supermarket Type1, Supermarket Type 2, Supermarket Type 3, and lastly, Grocery Store. Y-axis will have the sum of variable Item_Outlet_sales corresponding to different values on the x-axis.

The function color_discrete_sequence helps us customize the colour sequence for the plot. Here, we have used the colour sequence “Pastel1” using the input colors.qualitative.Pastel1.

You can choose any of the built-in qualitative colour sequences from the px.colors.qualitative module using the following code:-

fig = px.colors.qualitative.swatches()
fig.show()

Make sure the plotly library is already imported before implementing the above code.

It is important to note that the input corresponding to color only helps choose the variable for the legend to be displayed besides the plot. We have taken “Outlet_Type” to be shown as an input for legend.

This is how the final plot looks.

The plot is self-explanatory, thanks to plotly, we can observe the outlet type v/s sales in the hover data wherever we place our cursor.

We will follow a similar procedure while analyzing the overall sales for

  • Outlet_location
  • Outlet_Size
  • Item_visibility and Item_weight
  • Item_Type
  • Item_ MRP.

And later we will go on comparing the sales trend for Tier 1 cities v/s Tier 2&3 cities.

Outlet location v/s Sales

Using the same functions to see the sales per outlet location by grouping the data with respect to the Outlet_Location and aggregating the corresponding sales. This time we will  convert the scientific values into numbers in the very beginning.

pd.set_option('display.float_format', lambda x: '%.3f' % x)
location_type_sales = df.groupby("Outlet_Location_Type")[["Item_Outlet_Sales"]].sum().reset_index()
location_type_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False])
Outlet_Location_Type Item_Outlet_Sales
0 Tier 3 7131575.550
1 Tier 2 6066611.805
2 Tier 1 4223880.474

The output shows that Tier 3 cities perform the best in terms of sales, followed by Tier 2 and lastly, Tier 1 cities.

This is a very important inference we have drawn since this contradicts the most general hypothesis that Tier 1 cities might generate the most sales followed by Tier 2 and Tier 3 cities.

Visualizing the same using plotly.

data = df.groupby("Outlet_Location_Type")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
px.bar(data, x= 'Outlet_Location_Type', y="Item_Outlet_Sales",color='Outlet_Location_Type',color_discrete_sequence=px.colors.qualitative.T10)

Using the color sequence T10 in this case.

Outlet Size v/s Sales

Let’s do the same analysis for outlet size v/s corresponding sales.

pd.set_option('display.float_format', lambda x: '%.3f' % x)
outlet_size_sales = df.groupby("Outlet_Size")[["Item_Outlet_Sales"]].sum().reset_index()
outlet_size_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False])
Outlet_Size Item_Outlet_Sales
0 Medium 11119470.539
1 Small 4300311.651
2 High 2002285.640

'Medium-sized' outlets have outperformed 'Small' and 'High-sized' outlets.

The above result is more clear using plotly.

data = df.groupby("Outlet_Size")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
px.bar(data, x= 'Outlet_Size', y="Item_Outlet_Sales",color='Outlet_Size',color_discrete_sequence=px.colors.qualitative.Set3)

Using the colour sequence Set3 in this case.
Let’s move on to the next factor analysis.

Item Category v/s Sales

Try to analyze the data on your own for the variables 'Item_type' and 'Item_Outlet_Sales' before looking at the code below.

Don’t worry, you already know how to do it!
If you are done, evaluate your code using the following lines of code as reference:-

df.groupby("Item_Type")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
Item_Type Item_Outlet_Sales
0 Household 13086.965
1 Fruits and Vegetables 12117.560
2 Snack Foods 10993.690
3 Canned 10306.584
4 Dairy 10256.649
5 Health and Hygiene 9779.936
6 Frozen Foods 9678.069
7 Soft Drinks 9554.230
8 Meat 9390.443
9 Breads 8958.339
10 Breakfast 8209.314
11 Starchy Foods 8132.081
12 Baking Goods 7931.675
13 Hard Drinks 7843.124
14 Seafood 6503.534
15 Others 6008.845

We observe that the 'Household' item category has sold the most followed by 'Fruits and Vegetables', whereas, 'Others'  item category have sold the least. The above analysis is achieved by keeping all cities, outlet size, and outlet type constant.

Give plotly a try on your own too. Use any of the plotly color sequence of your choice.

Use the below lines of code as reference:-

data=df.groupby("Item_Type")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
px.bar(data, x= 'Item_Type', y="Item_Outlet_Sales",color='Item_Type',color_discrete_sequence=px.colors.qualitative.Bold)

Congratulations! You have successfully used plotly to visualize the result all by yourself!

The plot looks vibrant and equally informative. We have used the color sequence Bold in this case.
Moving on to our next factor analysis.

Item Visibility and Weight v/s Sales

Analyzing these 2 factors one by one will give us a better understanding of the result. Let's start with the item visibility.

First, we will have a look at the unique values for Item_Visibility using the pandas function unique(), which returns an array of all the unique values in a column.

df['Item_Visibility'].unique()
array([0.0000000,0.0160473 , 0.01927822, 0.01676007, ..., 0.03518627, 0.14522065,0.04487828])

We see that Item_Visibility=0 does not make sense and should be removed.

df.drop(df.loc[df['Item_Visibility']==0].index, inplace=True)

Now, we are ready for the analysis.

pd.set_option('display.float_format', lambda x: '%.3f' % x)
item_visibility_sales = df.groupby("Item_Visibility")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
item_visibility_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False])

It is important to note that the Item_Visibility has a total of 7880 rows. In such cases, bar graphs are not a suitable plotting option.

We will visualize the result with the help of a scatter plot. Let's see how.

fig = px.scatter(df,x='Item_Visibility', y='Item_Outlet_Sales',color='Item_Visibility')
fig.show()

We observe that items with visibility between 0.005 to 0.18 generate the most sales, whereas, items with visibility of more than 0.2 generate the least sales. We can also say that products with a very high visibility are not making a lot of sales. This can be due to many reasons like, big items (or items with more visibility) are more costly, so they are sold less, or because big items are hard to carry around and therefore not purchased daily or in bulk.

Since Item_Visibility can (not necessarily) be correlated with Item_Weight, we will check if Item_Weight analysis gives us the same result.

First, checking to see if the Item_Weight is greater than or equal to 0. We can simply do so by checking the minimum value for this variable.

df['Item_Weight'].min()
4.555

We see that the minimum value is 4.555 units. Now, we shall proceed with the further steps.

pd.set_option('display.float_format', lambda x: '%.3f' % x)
item_visibility_sales = df.groupby("Item_Weight")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
item_visibility_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False])

We see that product with the a weight of 12.85 units has made the most sales and Item_Weight around 9.1 units has the least sales.

Visualizing the same using a scatter plot to get a clear picture.

fig = px.scatter(df,x='Item_Weight', y='Item_Outlet_Sales',color='Item_Weight')
fig.show()

The scatter plot also shows that products with weight 12.6 units are making the most sales. The total sales corresponding to other item weight are scattered across the plot.

Item MRP v/s Sales

Which category do these products fall under?

Checking the sales for different item MRP below:-

df.groupby("Item_MRP")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()

The output will show that the product with MRP 234.9958 has the most sales, followed by MRP 253.0356, with the 2nd highest sales.
Whereas, the product with MRP 31.290 has the least sales. Let's find out what these items are.

Finding out the item type for the Item_MRP making the highest sales.

df.loc[df['Item_MRP'] ==234.9958]
Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type Item_Outlet_Sales
7061 20.350 Regular 0.015 Fruits and Vegetables 234.996 1987 High Tier 3 Supermarket Type1 5842.395
7189 12.858 Low Fat 0.011 Household 234.996 1985 Medium Tier 3 Supermarket Type3 13086.965

The item types for items with MRP 244.996 are 'Fruits and Vegetables' and 'Household'

Finding out the item type for the Item_MRP making the lowest sales.

df.loc[df['Item_MRP'] ==31.290]
Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type Item_Outlet_Sales
154 12.858 Low Fat 0.042 Soft Drinks 31.290 1985 Medium Tier 3 Supermarket Type3 898.830

The item type for items with MRP 31.290 is 'Soft Drinks'.

Visualize the above analysis using a scatter plot by following the same procedure that we discussed before.

Give it a try!

Products performing better and worse in Tier 1 cities v/s Tier 2 & 3 cities in terms of Sales

In this last section, we will answer our last two questions. They are:-

  • What are some products that sell better in Tier 1 cities as compared to Tier 2 and Tier 3 cities?
  • Are there any products selling better in Tier 2 and 3 cities as compared to Tier 1 cities?

We will follow a different approach to reach our target.

First, we will split our data into two categories:-

  1. Tier 1 cities
  2. Tier 2 & 3 cities

This will result in two different dataframes.

The code for the same is as follows:-

df_tier1=pd.DataFrame(df.loc[df['Outlet_Location_Type'] =='Tier 1'])#for tier 1
df_tier2=pd.DataFrame(df.loc[df['Outlet_Location_Type'] =='Tier 2']) #for tier 2
df_tier3=pd.DataFrame(df.loc[df['Outlet_Location_Type'] =='Tier 3']) #for tier 3
df_cities=pd.concat([df_tier2,df_tier3],axis=0) #combining dfs for tier 2 & 3

We have saved the Tier 1 cities data in the variable "df_tier1" and the same for Tier 2 & 3 in the variable named "df_cities"

Now, we will analyze how each product is performing for firstly Tier 1 cities and then for Tier 2 and 3 cities.

  • For Tier 1 cities :-
df_tier1_list=df_tier1.groupby("Item_Type")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
df_tier1_list.rename(columns = {'Item_Type':'Items_Tier_1','Item_Outlet_Sales':'Sales_tier_1'}, inplace = True)
Items_Tier_1 Sales_tier_1
0 Health and Hygiene 9779.936
1 Starchy Foods 8132.081
2 Dairy 7833.803
3 Canned 7590.120
4 Fruits and Vegetables 7549.506
5 Frozen Foods 7370.406
6 Household 7148.029
7 Soft Drinks 7017.532
8 Breads 6972.258
9 Breakfast 6930.978
10 Snack Foods 6692.622
11 Seafood 6503.534
12 Meat 6331.758
13 Hard Drinks 6308.455
14 Baking Goods 6093.402
15 Others 6008.845
  • For Tier 2 & 3 cities:-
df_cities_list=df_cities.groupby("Item_Type")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
df_cities_list.rename(columns = {'Item_Type':'Items_Tier2_3','Item_Outlet_Sales':'Sales_tier2_3'}, inplace = True)
Items_Tier2_3 Sales_tier2_3
0 Household 13086.965
1 Fruits and Vegetables 12117.560
2 Snack Foods 10993.690
3 Canned 10306.584
4 Dairy 10256.649
5 Frozen Foods 9678.069
6 Soft Drinks 9554.230
7 Meat 9390.443
8 Breads 8958.339
9 Breakfast 8209.314
10 Baking Goods 7931.675
11 Hard Drinks 7843.124
12 Starchy Foods 7443.644
13 Health and Hygiene 7142.702
14 Seafood 5992.200
15 Others 5546.114

Concatenating these two dataframes into one variable named 'df_sales'

df_sales=pd.concat([df_tier1_list,df_cities_list],axis=1)
df_sales

The output will show a combined dataframe. This helps us to see the sales corresponding to each item type for the two categories of cities.

Let's plot the same data. We get an intuition that a line plot is a better fit. This will help us compare the sales trend for Tier 1 v/s Tier 2 & 3 cities.

Before we proceed with the above, it's important to import another plotly library, Graph Objects.

import plotly.graph_objects as go   #importing the library
x=df_sales['Items_Tier_1']
y1=df_sales['Sales_tier_1']
y2=df_sales['Sales_tier2_3']

# Create traces

fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y1,
                    mode='lines+markers',line_color='rgb(0,100,80)',
                    name='tier 1 sales'))
fig.add_trace(go.Scatter(x=x, y=y2,
                    mode='lines+markers',
                    name='tier 2&3 sales'))

fig.show()

With only one quick look at the graph, we can clearly see that no item type has performed better in Tier 1 cities than that for Tier 2& 3 except for the category 'Others', that too with a very slight margin.

Sales generated by Tier 2 and 3 cities have outnumbered that for Tier 1 cities.

And with that, we come to an end of our analysis.

Conclusions

Hence we conclude:-

  1. Sales have been made the most for:

    • Supermarket Type 1
    • Tier 3 cities
    • Medium sized outlet
    • Household Items
    • Item weight 12.6 units
  2. Tier 2 and 3 cities have outperformed Tier 1 cities in terms of sales for all types of items except for the category 'Others'.

We have managed to find the answer to each question, and we can now share our conclusions with the team, together with the processed data for further analysis.

Hoping now you have a clearer vision of how to work on datasets in the retail/FMCG sector and can handle simple graphs using plotly.

Stay tuned for more pandas series.
See you next time!