Pandas Group By - Key areas you should watch out for

Pandas Group By - Key areas you should watch out for

We surveyed Stack Overflow questions related to Pandas Group By and came across few areas that have been repeatedly discussed -

  • Computing multiple statistics for each group
  • Sorting within groups
  • Extracting the first row in each group

Extracting multiple statistics for each group

Common aggregation functions are mean, median, and count amongst others.

Functions Descriptions
mean() Mean of values
median() Median of values
min() Minimum
max() Maximum
mode() Mode of values
std() Standard Deviation
var() Variance
count() Number of non-null data points
sum() Sum of values

E.g.,

import pandas as pd
import numpy as np
df = pd.DataFrame({'X': ['A', 'A', 'B', 'B'],
                   'Y': [20, 25, 35, 45],
                   'Z': [34, 56, 45, 76]})
  
# using agg() for mean and median
df.groupby('X').agg([np.mean, np.median])

The output will look like the following;

We recommend the following SO questions as exercises for you to learn more about extracting multiple aggregations;

Get statistics for each group such as count, mean, etc., using pandas groupby.

Get the rows which have the maximum value in groups using groupby.

Group by one columns and find sum and max value for another in pandas.

How to get sum of multiple columns in pandas dataframe group by?

Sorting within groups

Sorting within a group is a very common need and the following questions will help you learn how to get this right.

How to do Pandas Group By sort within groups?

How to do pandas sort within groupby on a particular column?

Pandas groupby sort within groups retaining multiple aggregates?

Pandas sorting observations within groupby groups.

Extracting the first row of each group

E.g.,

df = pd.DataFrame({'Name': ['Ben', 'Adi', 'Ben', 'Adi'],
                   'Age': [20, 50, 23, 45]})
#Group dataframe by name column
df1 = df.groupby('Name')

first_values = df1.first()
#Reset indices to match format
first_values = first_values.reset_index()
print(first_values)

How to get first row of each group in pandas dataframe?

Get topmost n records within each group in pandas?

The SO questions we selected above are to help you learn more about areas that many Pandas users find challenging.

Bookmark this blog and come back soon to learn more and improve your data science skills.