Data Analysis and Visualization in the retail/FMCG sector

We have worked with datasets of various sectors in our previous pandas series. Let us now learn how to analyze a dataset in the retail and FMCG sector using Pandas, but this time let’s take this to the next level and include some plotting libraries as well. We will work with Pandas and Plotly.

The Plotly python library is an interactive, open-source plotting library that supports over 40 unique chart types covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases.

Let’s get our hands dirty and experiment with plotly for visualization purposes in this article.

To understand how data is handled in the retail/FMCG sector, the ‘BigMart Sales’ dataset is a perfect fit. While working on the same, we can learn to manipulate the data to our advantage and draw key insights while also visualizing the results.

Imagine you are working in the procurement and logistics team of a company that wants to launch a chain of supermarkets/grocery stores across the country. Your team wants to optimize the purchasing patterns for different aspects of the dataset.

You want to know how the following factors affect the sales:-

Does the outlet type (grocery store or supermarket) have any impact on the overall sales?
Which type of city has the most overall sales? Or which outlet location makes the most overall sales?
Does the outlet size have any impact on the overall sales?
Which category of products sells the most and the least?
Do the product visibility and weight have any impact on the sales of the product?
What is the average MRP of the product that sells the most and the least? Which category do these products fall under?
What are some products that sell better in Tier 1 cities as compared to Tier 2 and Tier 3 cities?
Are there any products selling better in Tier 2 and 3 cities as compared to Tier 1 cities?

Data to download: https://www.kaggle.com/brijbhushannanda1979/bigmart-sales-data?select=Train.csv

The training dataset, ‘Train.csv’ will be sufficient to answer all of the questions, therefore, we will be using it for our analysis.

✔️

Check us out on Twitter for more such cool articles

Follow @DatascienceFM

Pre-Processing the data

To get started, we need to import some useful libraries that will help us import the dataset into our python environment, manipulate and analyze the same and later help us to visualize it.

import pandas as pd
import numpy as np
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
print("The modules are imported")

While using platforms like Google Colab, we need to connect our google drive and store the data. Or else, we’ll need to give the direct path to Pandas of the file location on your system.

Using the former method to import the dataset:-

df = pd.read_csv('Train.csv')

Before moving forward, let’s have a quick look at some of the important attributes of the datasets.

df.head()

	Item_Identifier	Item_Weight	Item_Fat_Content	Item_Visibility	Item_Type	Item_MRP	Outlet_Identifier	Outlet_Establishment_Year	Outlet_Size	Outlet_Location_Type	Outlet_Type	Item_Outlet_Sales
0	FDA15	9.300	Low Fat	0.016	Dairy	249.809	OUT049	1999	Medium	Tier 1	Supermarket Type1	3735.138
1	DRC01	5.920	Regular	0.019	Soft Drinks	48.269	OUT018	2009	Medium	Tier 3	Supermarket Type2	443.423
2	FDN15	17.500	Low Fat	0.017	Meat	141.618	OUT049	1999	Medium	Tier 1	Supermarket Type1	2097.270
3	FDX07	19.200	Regular	0.000	Fruits and Vegetables	182.095	OUT010	1998	NaN	Tier 3	Grocery Store	732.380
4	NCD19	8.930	Low Fat	0.000	Household	53.861	OUT013	1987	High	Tier 3	Supermarket Type1	994.705

Our dataset has 8523 rows and 12 columns.

Moreover, we observe that the columns ‘Item_Identifier” and “Outlet_Identifier” have no use to us, so we will remove them from the dataset.

df.drop(['Item_Identifier','Outlet_Identifier'],axis=1,inplace=True)

Before we could answer any question, it’s important to do a general EDA on our dataset.

We will begin with missing value imputation.

df.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

Looking at the output, we observe that only 2 columns have missing values in them, namely, Item_Weight and Outlet_Size.

We can do the mean or median imputation for numerical variables and mode imputation for categorical variables.

Starting with Item_Weight, knowing that it’s a numerical variable, we will choose between either mean or median imputation. To do so, first look at the boxplot for the same to check for the presence of outliers.

Plotting the Boxplot using Plotly:

fig = px.box(df, y="Item_Weight")
fig.show()

Since we are clear of all outliers, it will be safe to say that we can use the mean imputation.

df["Item_Weight"].fillna(df["Item_Weight"].mean(),inplace=True)

Checking if all the missing values are filled.

df["Item_Weight"].isnull().sum()

We have successfully, filled all missing values for Item_Weight.

Following the same procedure for ‘Outlet_Size’, but since it is a categorical variable, we will use the mode imputation.

df["Outlet_Size"].fillna(df["Outlet_Size"].mode()[0],inplace=True)
df["Outlet_Size"].isnull().sum()

Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

Voila! We have successfully filled in all of the missing values.

Now we are ready to go!

Analysis of sales corresponding to each factor.

Outlet Type v/s Sales

While moving forward to analyze if and how the outlet type is correlated with the sales, we will again be using the 'groupby()' function of pandas, like we did while analyzing the ‘Chipotle Orders’ datasets.

Doing a quick revision, groupby() function is used to group or combine large amounts of data and compute operations on these groups.

Here, we are combining the groupby() function with some other simple yet interesting functions of pandas like sort_values(), reset_index() and set_option.

outlet_type_sales = df.groupby("Outlet_Type")[["Item_Outlet_Sales"]].sum()
pd.set_option('display.float_format', lambda x: '%.3f' % x) #converting from scientific notation to numerical format
outlet_type_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()

In the above snippet of code we have grouped the data based on the ‘Outlet_Type’ and summed up the 'Item_Outlet_Sales' for each of the Outlet Type using the sum() function. Storing the same in a variable named ‘outlet_type_sales’.

Now before we could read the overall sales per outlet type, we had to convert the values from scientific notation to their numerical format using the set_option function of pandas, after which we firstly arrange the obtained dataframe in descending order (Outlet type with the most sales on top) using the sort_values() function of pandas and finally resetting the index.

	Outlet_Type	Item_Outlet_Sales
0	Supermarket Type1	12917342.263
1	Supermarket Type3	3453926.051
2	Supermarket Type2	1851822.830
3	Grocery Store	368034.266

From the output we obtained from the above code, we note that the outlet type 'Supermarket Type 1' makes the most overall sales across all cities, whereas, 'Grocery Store' makes the least sales.

Taking this one step further and visualizing the same result.

One of the many benefits of visualizing the result is that the plotted patterns make more sense when graphically represented because visuals and diagrams make it easier for us to identify strongly correlated parameters.

So, let’s learn today how to use plotly to our advantage. We have already imported the library at the beginning of our session.

data = df.groupby("Outlet_Type")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
px.bar(data, x= 'Outlet_Type', y="Item_Outlet_Sales",color='Outlet_Type',color_discrete_sequence=px.colors.qualitative.Pastel1)

The variable data has the grouped data of the overall sales corresponding to different outlet type (the same way we did above). Here, we take the Outlet_type on the x-axis. It will have 4 values, namely, Supermarket Type1, Supermarket Type 2, Supermarket Type 3, and lastly, Grocery Store. Y-axis will have the sum of variable Item_Outlet_sales corresponding to different values on the x-axis.

The function color_discrete_sequence helps us customize the colour sequence for the plot. Here, we have used the colour sequence “Pastel1” using the input colors.qualitative.Pastel1.

You can choose any of the built-in qualitative colour sequences from the px.colors.qualitative module using the following code:-

fig = px.colors.qualitative.swatches()
fig.show()

Make sure the plotly library is already imported before implementing the above code.

It is important to note that the input corresponding to color only helps choose the variable for the legend to be displayed besides the plot. We have taken “Outlet_Type” to be shown as an input for legend.

This is how the final plot looks.

The plot is self-explanatory, thanks to plotly, we can observe the outlet type v/s sales in the hover data wherever we place our cursor.

We will follow a similar procedure while analyzing the overall sales for

Outlet_location
Outlet_Size
Item_visibility and Item_weight
Item_Type
Item_ MRP.

And later we will go on comparing the sales trend for Tier 1 cities v/s Tier 2&3 cities.

Outlet location v/s Sales

Using the same functions to see the sales per outlet location by grouping the data with respect to the Outlet_Location and aggregating the corresponding sales. This time we will convert the scientific values into numbers in the very beginning.

pd.set_option('display.float_format', lambda x: '%.3f' % x)
location_type_sales = df.groupby("Outlet_Location_Type")[["Item_Outlet_Sales"]].sum().reset_index()
location_type_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False])

	Outlet_Location_Type	Item_Outlet_Sales
0	Tier 3	7131575.550
1	Tier 2	6066611.805
2	Tier 1	4223880.474

The output shows that Tier 3 cities perform the best in terms of sales, followed by Tier 2 and lastly, Tier 1 cities.

This is a very important inference we have drawn since this contradicts the most general hypothesis that Tier 1 cities might generate the most sales followed by Tier 2 and Tier 3 cities.

Visualizing the same using plotly.

data = df.groupby("Outlet_Location_Type")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
px.bar(data, x= 'Outlet_Location_Type', y="Item_Outlet_Sales",color='Outlet_Location_Type',color_discrete_sequence=px.colors.qualitative.T10)

Using the color sequence T10 in this case.

Outlet Size v/s Sales

Let’s do the same analysis for outlet size v/s corresponding sales.

pd.set_option('display.float_format', lambda x: '%.3f' % x)
outlet_size_sales = df.groupby("Outlet_Size")[["Item_Outlet_Sales"]].sum().reset_index()
outlet_size_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False])

	Outlet_Size	Item_Outlet_Sales
0	Medium	11119470.539
1	Small	4300311.651
2	High	2002285.640

'Medium-sized' outlets have outperformed 'Small' and 'High-sized' outlets.

The above result is more clear using plotly.

data = df.groupby("Outlet_Size")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
px.bar(data, x= 'Outlet_Size', y="Item_Outlet_Sales",color='Outlet_Size',color_discrete_sequence=px.colors.qualitative.Set3)

Using the colour sequence Set3 in this case.
Let’s move on to the next factor analysis.

Item Category v/s Sales

Try to analyze the data on your own for the variables 'Item_type' and 'Item_Outlet_Sales' before looking at the code below.

Don’t worry, you already know how to do it!
If you are done, evaluate your code using the following lines of code as reference:-

df.groupby("Item_Type")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()

	Item_Type	Item_Outlet_Sales
0	Household	13086.965
1	Fruits and Vegetables	12117.560
2	Snack Foods	10993.690
3	Canned	10306.584
4	Dairy	10256.649
5	Health and Hygiene	9779.936
6	Frozen Foods	9678.069
7	Soft Drinks	9554.230
8	Meat	9390.443
9	Breads	8958.339
10	Breakfast	8209.314
11	Starchy Foods	8132.081
12	Baking Goods	7931.675
13	Hard Drinks	7843.124
14	Seafood	6503.534
15	Others	6008.845

We observe that the 'Household' item category has sold the most followed by 'Fruits and Vegetables', whereas, 'Others' item category have sold the least. The above analysis is achieved by keeping all cities, outlet size, and outlet type constant.

Give plotly a try on your own too. Use any of the plotly color sequence of your choice.

Use the below lines of code as reference:-

data=df.groupby("Item_Type")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
px.bar(data, x= 'Item_Type', y="Item_Outlet_Sales",color='Item_Type',color_discrete_sequence=px.colors.qualitative.Bold)

Congratulations! You have successfully used plotly to visualize the result all by yourself!

The plot looks vibrant and equally informative. We have used the color sequence Bold in this case.
Moving on to our next factor analysis.

Item Visibility and Weight v/s Sales

Analyzing these 2 factors one by one will give us a better understanding of the result. Let's start with the item visibility.

First, we will have a look at the unique values for Item_Visibility using the pandas function unique(), which returns an array of all the unique values in a column.

df['Item_Visibility'].unique()

array([0.0000000,0.0160473 , 0.01927822, 0.01676007, ..., 0.03518627, 0.14522065,0.04487828])

We see that Item_Visibility=0 does not make sense and should be removed.

df.drop(df.loc[df['Item_Visibility']==0].index, inplace=True)

Now, we are ready for the analysis.

pd.set_option('display.float_format', lambda x: '%.3f' % x)
item_visibility_sales = df.groupby("Item_Visibility")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
item_visibility_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False])

It is important to note that the Item_Visibility has a total of 7880 rows. In such cases, bar graphs are not a suitable plotting option.

We will visualize the result with the help of a scatter plot. Let's see how.

fig = px.scatter(df,x='Item_Visibility', y='Item_Outlet_Sales',color='Item_Visibility')
fig.show()

We observe that items with visibility between 0.005 to 0.18 generate the most sales, whereas, items with visibility of more than 0.2 generate the least sales. We can also say that products with a very high visibility are not making a lot of sales. This can be due to many reasons like, big items (or items with more visibility) are more costly, so they are sold less, or because big items are hard to carry around and therefore not purchased daily or in bulk.

Since Item_Visibility can (not necessarily) be correlated with Item_Weight, we will check if Item_Weight analysis gives us the same result.

First, checking to see if the Item_Weight is greater than or equal to 0. We can simply do so by checking the minimum value for this variable.

df['Item_Weight'].min()

4.555

We see that the minimum value is 4.555 units. Now, we shall proceed with the further steps.

pd.set_option('display.float_format', lambda x: '%.3f' % x)
item_visibility_sales = df.groupby("Item_Weight")[["Item_Outlet_Sales"]].sum().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
item_visibility_sales.sort_values(by=['Item_Outlet_Sales'],ascending=[False])

We see that product with the a weight of 12.85 units has made the most sales and Item_Weight around 9.1 units has the least sales.

Visualizing the same using a scatter plot to get a clear picture.

fig = px.scatter(df,x='Item_Weight', y='Item_Outlet_Sales',color='Item_Weight')
fig.show()

The scatter plot also shows that products with weight 12.6 units are making the most sales. The total sales corresponding to other item weight are scattered across the plot.

Item MRP v/s Sales

Which category do these products fall under?

Checking the sales for different item MRP below:-

df.groupby("Item_MRP")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()

The output will show that the product with MRP 234.9958 has the most sales, followed by MRP 253.0356, with the 2nd highest sales.
Whereas, the product with MRP 31.290 has the least sales. Let's find out what these items are.

Finding out the item type for the Item_MRP making the highest sales.

df.loc[df['Item_MRP'] ==234.9958]

	Item_Weight	Item_Fat_Content	Item_Visibility	Item_Type	Item_MRP	Outlet_Establishment_Year	Outlet_Size	Outlet_Location_Type	Outlet_Type	Item_Outlet_Sales
7061	20.350	Regular	0.015	Fruits and Vegetables	234.996	1987	High	Tier 3	Supermarket Type1	5842.395
7189	12.858	Low Fat	0.011	Household	234.996	1985	Medium	Tier 3	Supermarket Type3	13086.965

The item types for items with MRP 244.996 are 'Fruits and Vegetables' and 'Household'

Finding out the item type for the Item_MRP making the lowest sales.

df.loc[df['Item_MRP'] ==31.290]

	Item_Weight	Item_Fat_Content	Item_Visibility	Item_Type	Item_MRP	Outlet_Establishment_Year	Outlet_Size	Outlet_Location_Type	Outlet_Type	Item_Outlet_Sales
154	12.858	Low Fat	0.042	Soft Drinks	31.290	1985	Medium	Tier 3	Supermarket Type3	898.830

The item type for items with MRP 31.290 is 'Soft Drinks'.

Visualize the above analysis using a scatter plot by following the same procedure that we discussed before.

Give it a try!

Products performing better and worse in Tier 1 cities v/s Tier 2 & 3 cities in terms of Sales

In this last section, we will answer our last two questions. They are:-

What are some products that sell better in Tier 1 cities as compared to Tier 2 and Tier 3 cities?
Are there any products selling better in Tier 2 and 3 cities as compared to Tier 1 cities?

We will follow a different approach to reach our target.

First, we will split our data into two categories:-

Tier 1 cities
Tier 2 & 3 cities

This will result in two different dataframes.

The code for the same is as follows:-

df_tier1=pd.DataFrame(df.loc[df['Outlet_Location_Type'] =='Tier 1'])#for tier 1

df_tier2=pd.DataFrame(df.loc[df['Outlet_Location_Type'] =='Tier 2']) #for tier 2
df_tier3=pd.DataFrame(df.loc[df['Outlet_Location_Type'] =='Tier 3']) #for tier 3
df_cities=pd.concat([df_tier2,df_tier3],axis=0) #combining dfs for tier 2 & 3

We have saved the Tier 1 cities data in the variable "df_tier1" and the same for Tier 2 & 3 in the variable named "df_cities"

Now, we will analyze how each product is performing for firstly Tier 1 cities and then for Tier 2 and 3 cities.

For Tier 1 cities :-

df_tier1_list=df_tier1.groupby("Item_Type")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
df_tier1_list.rename(columns = {'Item_Type':'Items_Tier_1','Item_Outlet_Sales':'Sales_tier_1'}, inplace = True)

	Items_Tier_1	Sales_tier_1
0	Health and Hygiene	9779.936
1	Starchy Foods	8132.081
2	Dairy	7833.803
3	Canned	7590.120
4	Fruits and Vegetables	7549.506
5	Frozen Foods	7370.406
6	Household	7148.029
7	Soft Drinks	7017.532
8	Breads	6972.258
9	Breakfast	6930.978
10	Snack Foods	6692.622
11	Seafood	6503.534
12	Meat	6331.758
13	Hard Drinks	6308.455
14	Baking Goods	6093.402
15	Others	6008.845

For Tier 2 & 3 cities:-

df_cities_list=df_cities.groupby("Item_Type")[["Item_Outlet_Sales"]].max().sort_values(by=['Item_Outlet_Sales'],ascending=[False]).reset_index()
df_cities_list.rename(columns = {'Item_Type':'Items_Tier2_3','Item_Outlet_Sales':'Sales_tier2_3'}, inplace = True)

	Items_Tier2_3	Sales_tier2_3
0	Household	13086.965
1	Fruits and Vegetables	12117.560
2	Snack Foods	10993.690
3	Canned	10306.584
4	Dairy	10256.649
5	Frozen Foods	9678.069
6	Soft Drinks	9554.230
7	Meat	9390.443
8	Breads	8958.339
9	Breakfast	8209.314
10	Baking Goods	7931.675
11	Hard Drinks	7843.124
12	Starchy Foods	7443.644
13	Health and Hygiene	7142.702
14	Seafood	5992.200
15	Others	5546.114

Concatenating these two dataframes into one variable named 'df_sales'

df_sales=pd.concat([df_tier1_list,df_cities_list],axis=1)
df_sales

The output will show a combined dataframe. This helps us to see the sales corresponding to each item type for the two categories of cities.

Let's plot the same data. We get an intuition that a line plot is a better fit. This will help us compare the sales trend for Tier 1 v/s Tier 2 & 3 cities.

Before we proceed with the above, it's important to import another plotly library, Graph Objects.

import plotly.graph_objects as go   #importing the library
x=df_sales['Items_Tier_1']
y1=df_sales['Sales_tier_1']
y2=df_sales['Sales_tier2_3']

# Create traces

fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y1,
                    mode='lines+markers',line_color='rgb(0,100,80)',
                    name='tier 1 sales'))
fig.add_trace(go.Scatter(x=x, y=y2,
                    mode='lines+markers',
                    name='tier 2&3 sales'))

fig.show()

With only one quick look at the graph, we can clearly see that no item type has performed better in Tier 1 cities than that for Tier 2& 3 except for the category 'Others', that too with a very slight margin.

Sales generated by Tier 2 and 3 cities have outnumbered that for Tier 1 cities.

And with that, we come to an end of our analysis.

Conclusions

Hence we conclude:-

Sales have been made the most for:
- Supermarket Type 1
- Tier 3 cities
- Medium sized outlet
- Household Items
- Item weight 12.6 units
Tier 2 and 3 cities have outperformed Tier 1 cities in terms of sales for all types of items except for the category 'Others'.

We have managed to find the answer to each question, and we can now share our conclusions with the team, together with the processed data for further analysis.

Hoping now you have a clearer vision of how to work on datasets in the retail/FMCG sector and can handle simple graphs using plotly.

Stay tuned for more pandas series.
See you next time!