Pandas is the foundational library required in any analytics project, making a wide variety of tasks such as data cleaning, data manipulation, and handling missing data very easy.
In this article, we will explore certain aspects of Pandas that will help you gain a deeper insight as a Data Scientist.
Let's get started!
What is Pandas?
Pandas stand for Python Data Analysis Library, and as the name suggests, it allows to explore, clean, and process a dataset. Pandas is a fast, powerful, flexible, easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language.
Let's take a look at the timeline of Pandas;
- 2008: The development of pandas started
- 2009: pandas becomes open source
- 2012: The first edition of Python for Data Analysis is published
- 2015: pandas becomes a NumFOCUS sponsored project
- 2018: First in-person core developer sprint
Why should a Data Scientist care about Pandas ?
Steve Lohr of The New York Times said: "Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."
Datasets come in various sizes and are different in nature. Therefore, it is undeniable that 80% of a data scientist’s time and effort is spent cleaning and preparing the data for analysis.
Pandas offers a variety of features that helps you to load the raw data, clean the raw data, the transformation of raw data into a further high-quality dataset i.e. free from errors and missing values. With the help of pandas, you can create multidimensional structure data, that is pandas will store the data in 2D formats i.e. in the form of rows and columns (much like spreadsheets ).
This process of generating a usable dataset by reshaping and refining the raw data is mandatory for further hypothesis testing and machine learning model building.
What can Pandas do?
To summarize everything, Pandas is capable of the further mentioned tasks:
Load the raw dataset
- The workhorse function for reading text files (a.k.a. flat files) is
read_exceltakes a path to an Excel file, and the
sheet_nameindicating which sheet to parse.
- Reading data from a SQL databases
So far we have talked about reading data from CSV files and excel files. Pandas can read from HTML, JSON and SQL files! Let's take a look at how to read data from SQL files.
You can read data from a SQL database using the
read_sqlwill automatically convert SQL column names to DataFrame column names.
read_sqlfeatures 2 arguments: a
SELECTstatement, and a database connection object. You from any type of SQL database be it SQLite, MySQL.
Writing to a Database:
DataFrameobjects have an instance method
to_csvwhich allows storing the contents of the object as a comma-separated-values file.
- To write a
DataFrameobject to a sheet of an Excel file, you can use the
to_excelinstance method. The arguments are largely the same as
- Writing to a SQLite database
Pandas has a
write_framefunction which creates a database table from a dataframe, this only works for SQLite databases.
Create the summary statistics of the data :
Summary statistics in Pandas, can be obtained by using describe function – describe(). Describe Function gives the mean, standard deviation, and Inter Quartile Range values.
Group & sort data:
By “group by” we are referring to a process involving one or more of the following steps:
1.Splitting the data into groups based on some criteria
2.Applying a function to each group independently
3.Combining the results into a data structure
Finding interesting bits of data in a DataFrame is often easier if you change the rows' order. You can sort the rows by passing a column name to
Inspection of data:
With a number of user-friendly methods and attributes, Pandas makes data inspection fairly hassle-free.
Handling Missing data:
Easy handling of missing data (represented as
NaT) in floating point as well as non-floating point data
Add, modify, select and delete data:
The major functions to add, modify, select and delete data respectively are mentioned further :
df.iloc[[, ], [, ]]
Filtration of Data:
Data Filtering is one of the most frequent data manipulation operations. Pandas offers an efficient way to perform filtering especially when it comes to speed.
Merge and join data:
Pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
Pivot and reshape data:
Pandas provide the basic blocks to easily create decent-looking plots. On DataFrame,
plot() is a convenience to plot all of the columns with labels.
Check out the ecosystem documentation to check visualization libraries beyond basics;
Time series data:
Pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications.
These are some of all the tasks Pandas can help you accomplish. To get a better look into what else Pandas can do, check out their User Guide which discusses how pandas approaches the problem, with many examples throughout.
How to Scale to large datasets in Pandas?
Using Pandas to scale large datasets (larger than memory datasets) is somewhat difficult because Pandas provides data structures for in-memory analytics.
Here are a few ways how you can use Pandas even while working with large datasets:
Load Less Data :
Loading only the columns that you need from the raw dataset using
Another way is to load the entire dataset and then filter in the necessary columns using
If we were to measure the memory usage of the two calls, we’d see that specifying
columns uses about 1/10th of the memory in this case.
Usage of efficient datatypes :
The default Pandas datatypes are not very memory efficient, especially for text columns with less unique values. For example, converting the before mentioned column into categorical using
df["name"]=df["name"].astype("category"), we can store each unique name once and use space-efficient integers to know which specific name is used in each row. This will cut down memory usage.
Use Chunking :
Chunking refers to splitting large datasets into smaller more usable datasets. But, Chunking works well when the operation you’re performing requires zero or minimal coordination between chunks. For more complicated workflows, other options must be revised.
Using other libraries :
Pandas is a library that offers a Dataframe API. Check out their list of libraries implementing a Dataframe API.
Dask, a parallel computing library, has dask.dataframe, a pandas-like API for working with larger than memory datasets in parallel. Dask can use multiple threads or processes on a single machine, or a cluster of machines to process data in parallel.
We can import data frames by using import dask.dataframe as dd. The API is pretty similar to Pandas, we can see that:
- There are familiar attributes like
- There are familiar methods like
- There are new attributes like
The partitions and divisions are how Dask parallelizes computation. A Dask DataFrame is made up of many pandas DataFrames. A single method call on a Dask DataFrame ends up making many pandas method calls, and Dask knows how to coordinate everything to get the result.
You can look at a few Dask examples to see how to use Dask in variety of situations:
Datasets you can play with if you want to learn Pandas
Unquestionably, Pandas is one of the most important libraries to master on the path to becoming a Data Scientist, and what better way of doing that than to practice your skills on a real-life dataset. Here is a list of a few tutorials that feature a real-life dataset that will help you build a strong foundation.
This post helps you answer some business questions, using the Spotify dataset. It helps you build knowledge to use pandas to gain insight from the data. Some of the questions are:
What tracks are most popular amongst Spotify users?
Is there a correlation between popularity and a track's traits? and more
Learn to analyze a dataset using pandas and visualize the results in plotly and answer some business questions to find out the sales pattern and different features. For Example ;
What are the top 10 games currently making the most sales globally?
What are the top games for different regions?
This post discusses the customer churn dataset and lways to analyze and visualize it. Some of the important questions to be asked while analyzing consumer churn data are :
What is the correlation between customer churn and the total tenure period?
Which business services have seen the maximum amount of customer churn?
Awesome-pandas is an unofficial collection of resources for learning Pandas. Here you will find videos, cheat sheets, tutorials, and books/papers.
Here are the links to a few Pandas cheatsheets you can quickly take a look at:
For more , check out our collection of cheat sheets that cover all the necessary data science techniques that will help you to get the most out of your Python code.
Find Pandas Help here
These are a few of many frequently asked questions related to Pandas. Getting doubts while performing tasks is completely normal for a beginner as well as for an experienced professional. Stack Overflow has an extensive collection of such doubts along with verified answers. The below link lists all the questions related to pandas, find your solution here:
In summary, we have taken a high-level rundown about python pandas, what it has to offer, various resources to learn it, and more. It is hoped that this article provides the ultimate layout to refer to know anything about Pandas.
If you found this article interesting and wish to get such content in your mailbox directly - subscribe.
See you next time!