pandas

Mega Guide to Pandas for Data Scientists

In summary, we have taken a high-level rundown about python pandas, what it has to offer, various resources to learn it, and more.

Introduction

Pandas is the foundational library required in any analytics project, making a wide variety of tasks such as data cleaning, data manipulation, and handling missing data very easy.

In this article, we will explore certain aspects of Pandas that will help you gain a deeper insight as a Data Scientist.

Let's get started!

What is Pandas?

Pandas stand for Python Data Analysis Library, and as the name suggests, it allows to explore, clean, and process a dataset. Pandas is a fast, powerful, flexible, easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language.

pandas - Python Data Analysis Library

Python Data Analysis Library

Let's take a look at the timeline of Pandas;

2008: The development of pandas started
2009: pandas becomes open source
2012: The first edition of Python for Data Analysis is published
2015: pandas becomes a NumFOCUS sponsored project
2018: First in-person core developer sprint

Why should a Data Scientist care about Pandas ?

Steve Lohr of The New York Times said: "Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."

Datasets come in various sizes and are different in nature. Therefore, it is undeniable that 80% of a data scientist’s time and effort is spent cleaning and preparing the data for analysis.

Pandas offers a variety of features that helps you to load the raw data, clean the raw data, the transformation of raw data into a further high-quality dataset i.e. free from errors and missing values. With the help of pandas, you can create multidimensional structure data, that is pandas will store the data in 2D formats i.e. in the form of rows and columns (much like spreadsheets ).

This process of generating a usable dataset by reshaping and refining the raw data is mandatory for further hypothesis testing and machine learning model building.

What can Pandas do?

To summarize everything, Pandas is capable of the further mentioned tasks:

Load the raw dataset

The workhorse function for reading text files (a.k.a. flat files) is read_csv().
read_excel takes a path to an Excel file, and the sheet_name indicating which sheet to parse.
pd.read_excel("path_to_files.xls",sheet_name="Sheet1")
Reading data from a SQL databases
So far we have talked about reading data from CSV files and excel files. Pandas can read from HTML, JSON and SQL files! Let's take a look at how to read data from SQL files.

You can read data from a SQL database using the pd.read_sql function. read_sql will automatically convert SQL column names to DataFrame column names.

read_sql features 2 arguments: a SELECT statement, and a database connection object. You from any type of SQL database be it SQLite, MySQL.

Writing to a Database:

The Series and DataFrame objects have an instance method to_csv which allows storing the contents of the object as a comma-separated-values file.
To write a DataFrame object to a sheet of an Excel file, you can use the to_excel instance method. The arguments are largely the same as to_csv.
Writing to a SQLite database
Pandas has a write_frame function which creates a database table from a dataframe, this only works for SQLite databases.

Create the summary statistics of the data :

Summary statistics in Pandas, can be obtained by using describe function – describe(). Describe Function gives the mean, standard deviation, and Inter Quartile Range values.

Group & sort data:

By “group by” we are referring to a process involving one or more of the following steps:
1.Splitting the data into groups based on some criteria
2.Applying a function to each group independently
3.Combining the results into a data structure

Finding interesting bits of data in a DataFrame is often easier if you change the rows' order. You can sort the rows by passing a column name to .sort_values().

Inspection of data:

With a number of user-friendly methods and attributes, Pandas makes data inspection fairly hassle-free.

Handling Missing data:

Easy handling of missing data (represented as NaN, NA, or NaT) in floating point as well as non-floating point data

Add, modify, select and delete data:

The major functions to add, modify, select and delete data respectively are mentioned further :
df.append()
df.update()
df.iloc[[, ], [, ]]
df.drop()

Filtration of Data:

Data Filtering is one of the most frequent data manipulation operations. Pandas offers an efficient way to perform filtering especially when it comes to speed.

Merge and join data:

Pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

Pivot and reshape data:

To reshape the data into a pivot table format, we use the DataFrame.pivot() method (also implemented as a top-level function pivot())

Data visualization:

Pandas provide the basic blocks to easily create decent-looking plots. On DataFrame, plot() is a convenience to plot all of the columns with labels.

Check out the ecosystem documentation to check visualization libraries beyond basics;

pandas - Python Data Analysis Library

Python Data Analysis Library

Time series data:

Pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications.

These are some of all the tasks Pandas can help you accomplish. To get a better look into what else Pandas can do, check out their User Guide which discusses how pandas approaches the problem, with many examples throughout.

User Guide — pandas 1.4.2 documentation

How to Scale to large datasets in Pandas?

Using Pandas to scale large datasets (larger than memory datasets) is somewhat difficult because Pandas provides data structures for in-memory analytics.

Here are a few ways how you can use Pandas even while working with large datasets:

Load Less Data :

Loading only the columns that you need from the raw dataset using
df.pd.read_parquet("datasetname",columns=columns)

Another way is to load the entire dataset and then filter in the necessary columns using df.pd.read_parquet("datasetname")[columns=columns]

If we were to measure the memory usage of the two calls, we’d see that specifying columns uses about 1/10th of the memory in this case.

Usage of efficient datatypes :

The default Pandas datatypes are not very memory efficient, especially for text columns with less unique values. For example, converting the before mentioned column into categorical using df["name"]=df["name"].astype("category"), we can store each unique name once and use space-efficient integers to know which specific name is used in each row. This will cut down memory usage.

Use Chunking :

Chunking refers to splitting large datasets into smaller more usable datasets. But, Chunking works well when the operation you’re performing requires zero or minimal coordination between chunks. For more complicated workflows, other options must be revised.

Using other libraries :

Pandas is a library that offers a Dataframe API. Check out their list of libraries implementing a Dataframe API.

pandas ecosystem — pandas 1.4.2 documentation

Dask, a parallel computing library, has dask.dataframe, a pandas-like API for working with larger than memory datasets in parallel. Dask can use multiple threads or processes on a single machine, or a cluster of machines to process data in parallel.

We can import data frames by using import dask.dataframe as dd. The API is pretty similar to Pandas, we can see that:

There are familiar attributes like .columns and .dtypes
There are familiar methods like .groupby, .sum, etc.
There are new attributes like .npartitions and .divisions

The partitions and divisions are how Dask parallelizes computation. A Dask DataFrame is made up of many pandas DataFrames. A single method call on a Dask DataFrame ends up making many pandas method calls, and Dask knows how to coordinate everything to get the result.

You can look at a few Dask examples to see how to use Dask in variety of situations:

Dask Examples — Dask Examples documentation

Dask Examples documentation

Datasets you can play with if you want to learn Pandas

Unquestionably, Pandas is one of the most important libraries to master on the path to becoming a Data Scientist, and what better way of doing that than to practice your skills on a real-life dataset. Here is a list of a few tutorials that feature a real-life dataset that will help you build a strong foundation.

This post helps you answer some business questions, using the Spotify dataset. It helps you build knowledge to use pandas to gain insight from the data. Some of the questions are:

What tracks are most popular amongst Spotify users?

Is there a correlation between popularity and a track's traits? and more

Learn to analyze a dataset using pandas and visualize the results in plotly and answer some business questions to find out the sales pattern and different features. For Example ;

What are the top 10 games currently making the most sales globally?

What are the top games for different regions?

This post discusses the customer churn dataset and lways to analyze and visualize it. Some of the important questions to be asked while analyzing consumer churn data are :

What is the correlation between customer churn and the total tenure period?

Which business services have seen the maximum amount of customer churn?

Other resources:

Awesome-pandas is an unofficial collection of resources for learning Pandas. Here you will find videos, cheat sheets, tutorials, and books/papers.

Here are the links to a few Pandas cheatsheets you can quickly take a look at:

Python for DataScience Cheat Sheet - Pandas Basics

Cheat Sheet: The pandas DataFrame Object

Python For Data Science Cheat Sheet Pandas

For more , check out our collection of cheat sheets that cover all the necessary data science techniques that will help you to get the most out of your Python code.

Find Pandas Help here

How to iterate over rows in a DataFrame in Pandas

I have a DataFrame from Pandas: import pandas as pdinp = [{‘c1’:10, ‘c2’:100}, {‘c1’:11,‘c2’:110}, {‘c1’:12,‘c2’:120}]df = pd.DataFrame(inp)print df Output: c1 c20 10 1001 11 1102...

Stack OverflowRoman

“Large data” workflows using pandas

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it’s out-of-core support. However, SAS is horribl...

Stack OverflowZelazny7

Pandas Merging 101

How can I perform a (INNER| (LEFT|RIGHT|FULL) OUTER) JOIN with pandas?How do I add NaNs for missing rows after a merge?How do I get rid of NaNs after merging?Can I merge on the index?How do I m...

Stack Overflowcs95

These are a few of many frequently asked questions related to Pandas. Getting doubts while performing tasks is completely normal for a beginner as well as for an experienced professional. Stack Overflow has an extensive collection of such doubts along with verified answers. The below link lists all the questions related to pandas, find your solution here:

Highest scored ‘pandas’ questions - Page 1

Stack Overflow | The World’s Largest Online Community for Developers

Stack Overflow

Conclusion

In summary, we have taken a high-level rundown about python pandas, what it has to offer, various resources to learn it, and more. It is hoped that this article provides the ultimate layout to refer to know anything about Pandas.

If you found this article interesting and wish to get such content in your mailbox directly - subscribe.

✔️

@DatascienceFM

See you next time!