Analyzing Startup Investments
Welcome back to another pandas series with us. This time we will be addressing a very different category. As the heading of the post suggests, we will analyze the startup investment dataset. If you have been following our pandas series, you know that this topic is completely different from the previous datasets we have worked with. If you are new here, then there is no better dataset to work on together. So let's get started!
The 'Startup Investment (Crunchbase)' dataset will be our center of study (link given below, keep reading!). The dataset has various features giving information like the name of the startup, homepage URL, total funding received, and a lot more. However, for this particular post, our focus will be on a few very specific areas only.
We will analyze this dataset from the perspective of a VC analyst. If you are working in a new VC firm, you want to analyze the relation of different startups with respect to the following areas:
- Different sectors
- Funding received
- Status (operating, closed, etc.)
- Regions in the USA with the most startups
- Regions in India with the most startups
- Top countries with the most startups
- Year of founding
- Startups that:
a. got funding in less than 1 year
b. got funded after more than 20 years
c. got funded before they were founded
Dataset to download: https://www.kaggle.com/arindam235/startup-investments-crunchbase
Pre-processing the data
Before we begin with our work, it's important to import some useful libraries into our python environment.
import pandas as pd
import plotly.express as px
import datetime as dt
import warnings
warnings.filterwarnings("ignore")
print("The modules are imported")
We have imported all the useful modules and now we are ready to read our dataset.
df = pd.read_csv('investments_VC.csv', encoding = "ISO-8859-1")
df.head(2)
permalink | name | homepage_url | category_list | market | funding_total_usd | status | country_code | state_code | region | city | funding_rounds | founded_at | founded_month | founded_quarter | founded_year | first_funding_at | last_funding_at | seed | venture | equity_crowdfunding | undisclosed | convertible_note | debt_financing | angel | grant | private_equity | post_ipo_equity | post_ipo_debt | secondary_market | product_crowdfunding | round_A | round_B | round_C | round_D | round_E | round_F | round_G | round_H | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | /organization/waywire | #waywire | http://www.waywire.com | |Entertainment|Politics|Social Media|News| | News | 17,50,000 | acquired | USA | NY | New York City | New York | $1.00 | 2012-06-01 | 2012-06 | 2012-Q2 | $2,012.00 | 2012-06-30 | 2012-06-30 | $1,750,000.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 |
1 | /organization/tv-communications | &TV Communications | http://enjoyandtv.com | |Games| | Games | 40,00,000 | operating | USA | CA | Los Angeles | Los Angeles | $2.00 | NaN | NaN | NaN | NaN | 2010-06-04 | 2010-09-23 | $0.00 | $4,000,000.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 | $0.00 |
Once we execute the above codes, we see the different columns that are present in our dataset. We can check the shape and the different columns also using the command df.shape
and df.columns
. The dataset has a total of 54294 rows and 39 columns.
It's important to understand what information each column is providing before we could drop the least useful ones. To do that, we will create a table with the column names and descriptions for each column present.
description = ['Static hyperlink for the startup on Crunchbase\'s website','name of the startup','Website address of the startup',
'in which category the startups fall','which market the startup caters to','total funding received(in USD)',
'current operating status','country of origin','state of origin','region','city of origin','total rounds of funding',
'date of founding','month of founding','quarter of founding','year of founding','date of first funding','date of last funding',
'seed funding received(in USD)','venture funding received(in USD)','funding received by diluting equity',
'other undisclosed funding sources','funding received from convertible notes','funding received from debts',
'funding received from angel investors','funding from grants','funding from private equity',
'funding from equity dilution after IPO','funding from debts after IPO','funding from secondary markets',
'funding from crowdfunding','round A funding','round B funding','round C funding','round D funding','round E funding',
'round F funding','round G funding','round H funding']
df_desc = pd.DataFrame(list(zip(df.columns, description)), columns =['Column', 'Description'])
df_desc
We have created a list description
with the description for each of the columns. Next, we have combined this with the column names using the zip() function of python. So, what exactly does the zip() function do?
- zip(): This function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together and so on.If the passed iterators have different lengths, the iterator with the least items decides the length of the new iterator.
In the end, we converted this table into a dataframe using the pd.DataFrame
function of pandas. Now we know how exactly the code will be executed.
Column | Description | |
---|---|---|
permalink | Static hyperlink for the startup on Crunchbase's website | |
name | name of the startup | |
homepage_url | Website of the startup | |
category_list | Different categories of startups | |
market | Different markets of startup | |
funding_total_usd | total funding received(in USD) | |
status | Current operating status | |
country_code | Country of startup | |
state_code | State of statup | |
region | region | |
city | city | |
funding_rounds | total rounds of funding | |
founded_at | date of founding | |
founded_month | month of founding | |
founded_quarter | quarter of founding | |
founded_year | year of founding | |
first_funding_at | date of first funding | |
last_funding_at | date of last funding | |
seed | seed funding received(in USD) | |
venture | venture funding received(in USD) | |
equity_crowdfunding | funding received by diluting equity | |
undisclosed | other undisclosed funding sources | |
convertible_note | funding received from convertible notes | |
debt_financing | funding received from debts | |
angel | funding received from angel investors | |
grant | funding from grants | |
private_equity | funding from private equity | |
post_ipo_equity | funding from equity dilution after IPO | |
post_ipo_debt | funding from debts after IPO | |
secondary_market | funding from secondary markets | |
product_crowdfunding | funding from crowdfunding | |
round_A | round A funding | |
round_B | round B funding | |
round_C | round C funding | |
round_D | round D funding | |
round_E | round E funding | |
round_F | round F funding | |
round_G | round G funding | |
round_H | round H funding |
Once we understand what each column means, we realize that not all of these are useful to us, therefore, before we start with our analysis it's important to drop the columns we don't need and with that, drop the duplicate rows as well.
#deleting duplicate rows.
df = df.drop_duplicates()
print(df.shape)
#cleaning the dataframe by dropping uneccessary columns
df = df.drop(['permalink', 'homepage_url'], axis=1)
#Removing the row with no 'name'
df.dropna(how='any', subset=['name'], axis=0, inplace=True)
#Changing the values in column "funding_total_usd" from string to float
df[' funding_total_usd '] = df[' funding_total_usd '].str.strip().str.replace(",","")
df[' funding_total_usd '] = df[' funding_total_usd '].replace("-",0).astype("float")
#Replacing missing status with "unknown"
df['status'] = df['status'].replace(np.nan,"unknown")
#renaming columns
df.rename(columns = {' funding_total_usd ':'funding_total_usd',' market ':'market'}, inplace = True)
With the above lines of code, we have cleaned our way through the dataset. We can see the final dataframe using the df.head()
function.
We are ready to start with our analysis!
Analyzing the dataset
Relation of startups with different sectors
In this section, we will find out the top sectors/markets that are generating the most startups. We observe that the column market
provides the various sectors. Using the df['market].nunique()
function, we can see that the total number of unique markets is 753. In simpler words, we have 753 unique values for the column market
.
It's easier and more convenient to know the top sectors with the most startups, rather than knowing all the 753 different sectors out there. Therefore, for this particular piece of analysis, we are working with just the top 20 sectors.
So, let's see what these top sectors are.
#creating dataframe of top 20 markets with the most counts of startups
top_20_counts=pd.DataFrame(df['market'].value_counts()[:20].sort_values(ascending=False).reset_index())
#renaming columns
top_20_counts.rename(columns={'index':'market','market':'counts'},inplace=True)
#print the head of the dataframe
top_20_counts.head()
We have created a separate dataframe with the top 20 markets and the number of startups they generate. Let's check what this dataframe looks like.
market | counts | |
---|---|---|
Software | 4619 | |
Biotechnology | 3680 | |
Mobile | 1983 | |
E-Commerce | 1803 | |
Curated Web | 1653 |
The head()
function has returned the first 5 rows of the dataframe, with 'Software' being on top of the list with the count of over 4619, followed by, 'Biotechnology', 'Mobile', and more.
Let's plot the graph for the same. This will help us to grasp the information more easily while also conveying a story to the decision-makers.
The bar graph has made it very easy for us to see these top sectors for startups across the world.
Let's see what our next section is.
Relation of startups with the funding received
Here, we will analyze the sectors with the most funding received. Moreover, we will find the names of the top startups that received the highest funding across the world.
- Highest funded sectors
The code to find the highest funded sectors is given below.
#creating a new dataframe
df_market=pd.DataFrame(df[['market','funding_total_usd']])
#drop funding = 0
df_market.drop(df_market.loc[df_market['funding_total_usd']==0.0].index, inplace=True)
#this adds the dollar sign and commas and limits decimals to 2 places
pd.options.display.float_format = '${:,.2f}'.format
df_market_=df_market.groupby('market')[['funding_total_usd']].sum().sort_values(by=['funding_total_usd'],ascending=False).reset_index()
market | funding_total_usd | |
---|---|---|
Biotechnology | $73,163,695,349.00 | |
Mobile | $49,470,108,988.00 | |
Software | $42,221,976,814.00 | |
Clean Technology | $37,208,938,855.00 | |
Health Care | $29,349,498,090.00 |
The see the first 5 rows for the highest funded sectors across the globe in the above dataframe. 'Biotechnology' is the highest funded amongst all the other sectors with a total of around $73 Billion. This is followed by 'Mobile' and 'Software'.
Let's plot our top 20 highest funded sectors to visualize how each sector is performing in comparison to the other.
data = df_market_
y=df_market_['funding_total_usd'][:20]
x=df_market_['market'][:20]
fig_2 = go.Figure(data=[
go.Bar(name='total funding', x=x, y=y, text= y,textposition='auto')])
# Customize aspect
fig_2.update_traces(marker_color='rgb(179,226,205)', marker_line_color='rgb(102,194,165)',
marker_line_width=1.5, opacity=0.6)
fig_2.update_layout(title_text='Top 20 markets')
fig_2.show()
We can see how much each sector is funded all over the world.
Let's see what are the names of the top-funded startups.
- Highest funded startups
Try finding the top 10 highest funded startups across the world. Follow a similar procedure as we did above. Don't worry, if you get stuck, you can refer to the codes below. We always have your back!
#startup names and funding received
name_funds=pd.DataFrame(df[['name','funding_total_usd']])
name_funds=name_funds.sort_values(by=['funding_total_usd'],ascending=False).reset_index()
name_funds.drop(['index'], axis=1)
#dataframe consisting of top 20 startups with highest funding
pd.options.display.float_format = '${:,.2f}'.format
startups_10=pd.DataFrame(name_funds[['name','funding_total_usd']][:10])
startups_10
name | funding_total_usd | |
---|---|---|
Verizon Communications | $30,079,503,000.00 | |
Sberbank | $5,800,000,000.00 | |
Clearwire | $5,700,000,000.00 | |
Charter Communications | $5,162,513,431.00 | |
First Data Corporation | $3,500,000,000.00 | |
COFCO | $3,200,000,000.00 | |
sigmacare | $2,600,000,000.00 | |
$2,425,700,000.00 | ||
Carestream | $2,400,000,000.00 | |
Flipkart | $2,351,140,000.00 |
The top highest funded startups are 'Verizon Communications', 'Sberbank', 'Clearwire'.
Let's jump to the next section, where we will discuss the status of different startups.
Status of different startups
If we look closely, we notice that the column status
has 4 unique values, namely, 'Acquired', 'Operating', 'Unknown', and 'Closed', which simply means that the status of a startup can be either of these 4.
We will first find out how many startups does each category has.
status=pd.DataFrame(df['status'].value_counts().reset_index())
status
The above code will count the number of startups in each category using the value_counts()
function.
status | count | |
---|---|---|
operating | 41680 | |
acquired | 3682 | |
closed | 2602 | |
unknown | 1313 |
A very high number of startups are operating well, that's a good sign!
Plotting the same information to understand the proportion each category holds.
colors = ['gold', 'mediumturquoise', 'darkorange', 'lightgreen']
fig_4 = go.Figure(data=[go.Pie(labels=status['index'],
values=status['status'],pull=[0.3, 0, 0, 0])])
fig_4.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
marker=dict(colors=colors))
fig_4.show()
Now, the proportion for each category is clear to us. Out of the total, 41680 startups are operating, 3682 are acquired, 2602 are closed and the status of 1313 startups is unknown to us.
Before we proceed to the next section, let's see how many rounds of funding each of these categories received. We will make use of the columns round_A
.... round_H
for this. Scroll up to see what each of these columns was in the description table we created.
#segregating the dataframe on the basis of status
acquired_df=df.query('status == "acquired"')
operating_df=df.query('status == "operating"')
unknown_df=df.query('status == "unknown"')
closed_df=df.query('status == "closed"')
#Plotting a funnel chart to understand the proportion
from plotly import graph_objects as go
fig_5 = go.Figure()
fig_5.add_trace(go.Funnel(
name = 'acquired',
y = ["Round A", "Round B","Round C","Round D",
"Round E","Round F","Round G","Round H"],
x = [acquired_df['round_A'].mean(),acquired_df['round_B'].mean(),acquired_df['round_C'].mean(),
acquired_df['round_D'].mean(),acquired_df['round_E'].mean(),
acquired_df['round_F'].mean(),acquired_df['round_G'].mean(),acquired_df['round_H'].mean()],
textinfo = "percent initial"))
fig_5.add_trace(go.Funnel(
name = 'operating',
orientation = "h",
y = ["Round A", "Round B","Round C","Round D",
"Round E","Round F","Round G","Round H"],
x = [operating_df['round_A'].mean(),operating_df['round_B'].mean(),operating_df['round_C'].mean(),
operating_df['round_D'].mean(),operating_df['round_E'].mean(),
operating_df['round_F'].mean(),operating_df['round_G'].mean(),operating_df['round_H'].mean()],
textposition = "inside",
textinfo = "percent previous"))
fig_5.add_trace(go.Funnel(
name = 'unknown',
orientation = "h",
y = ["Round A", "Round B","Round C","Round D",
"Round E","Round F","Round G","Round H"],
x = [unknown_df['round_A'].mean(),unknown_df['round_B'].mean(),unknown_df['round_C'].mean(),
unknown_df['round_D'].mean(),unknown_df['round_E'].mean(),
unknown_df['round_F'].mean(),unknown_df['round_G'].mean(),unknown_df['round_H'].mean()],
textposition = "outside",
textinfo = "percent total"))
fig_5.add_trace(go.Funnel(
name = 'closed',
orientation = "h",
y = ["Round A", "Round B","Round C","Round D",
"Round E","Round F","Round G","Round H"],
x = [closed_df['round_A'].mean(),closed_df['round_B'].mean(),closed_df['round_C'].mean(),
closed_df['round_D'].mean(),closed_df['round_E'].mean(),
closed_df['round_F'].mean(),closed_df['round_G'].mean(),closed_df['round_H'].mean()],
textposition = "outside",
textinfo = "percent total"))
fig_5.show()
We see that round_B
funding has the most startups under it, followed by round_A
funding.
Let's find out what the next section is about.
Regions in USA with the most startups
Looking at the startup distribution in different states of the USA. We will directly visualize the results for this part using a Choropleth Map.
Choropleth maps can be used while plotting data points that include country or city names or coordinates. So, what is a choropleth map?
- Choropleth Map:- It is a type of thematic map in which a set of pre-defined areas is colored or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area
#dataframe with country code USA
USA = pd.DataFrame(df.query('country_code == "USA"')['state_code'].value_counts(),columns=['state_code'])
USA.reset_index(inplace=True)
#plotting a choropleth map
fig_6 = go.Figure(data=go.Choropleth(
locations=USA['index'],
z = USA['state_code'].astype(int), # Data to be color-coded
locationmode = 'USA-states', # set of locations match entries in `locations`
colorscale = 'purpor',
colorbar_title = "startup counts",
))
fig_6.update_layout(
title_text = 'Startup investment USA state counting',
geo_scope='usa',height=500
)
fig_6.show()
One of the areas with the highest number of startups is California with over 9.902K startups.
Regions in India with the most startups
We will follow a similar approach to find the regions in India with the most startups, but this time, let's visualize our result using a Treemap.
- Treemap:- Treemapping is a method for displaying hierarchical data using nested figures, usually rectangles.
Code for the same is as follows:-
#dataframe with country as india
india=pd.DataFrame(df_india[['name','region','funding_total_usd','status','city']])
#plotting a treemap
india=df.query('country_code=="IND"')
city =pd.DataFrame(india['region'].value_counts()[:30],columns=['region'])
city.reset_index(inplace=True)
fig_7 = px.treemap(city, path=['index'], values='region',title='Top 30 regions in india')
fig_7.show()
'Banglore', 'New Delhi' and 'Mumbai' are the top cities in India with the highest startup count.
- Top countries and cities in the world with the highest startup count
Try to plot a treemap for this part on your own. Follow the steps given above. Tally your results with the answers given below, do they match?
1. Top countries are USA, Great Britain and Canada.
2. Top cities are San Francisco, New York and London.
Year of founding
We have come to the last section now, in which we will visualize the founding years of startups. Has there been a spike in the formation of new startups over the years? Let's find out.
We will use the founded_at
column for this analysis. The column goes back to the year 1785. We will only observe the years after 1950. Anything earlier than that has little to no relevance to us.
# change the datatype from string to datetime
df['founded_at'] = pd.to_datetime(df['founded_at'],errors='coerce')
# drop Null value
df['founded_at'].dropna(inplace=True)
foundation = pd.DataFrame(df['founded_at'])
foundation['year'] = foundation['founded_at'].dt.year
#plotting
foundation['year'].value_counts().iplot(kind='bar',yTitle='counting', linecolor='black',opacity=0.7,bargap=0.1,gridcolor='white',color='red',title='Year of founding')
2012 has the most number of startups. We can say that the number of new startups has spiked from 2000 to 2014.
Startups that
-
Got funding in less than 1 year
We will discuss about the startups that got funded in less than 1 year of being founded. To find the same, we will create a new column in our dataframe which gives the difference between first_funding_at
and founded_at
. The value for this difference will tell us how much time it took for different startups to get funded. Let's learn how.
# change the datatype from string to datetime
df['first_funding_at'] = pd.to_datetime(df['first_funding_at'],errors='coerce')
# drop Null value
df['first_funding_at'].dropna(inplace=True)
# Define the differrence between the date of foundation and first funding
df['difference'] = df['first_funding_at'] - df['founded_at']
# extract only numeric value
df['difference'] = df['difference'].dt.days
#extracting positive values only and saving in one_yr
one_yr = df[df['difference']>0]
one_yr['difference_level'] = 0
#extracting difference less than 365 days
one_yr=pd.DataFrame(one_yr.loc[one_yr['difference']<365])
one_year=pd.DataFrame(one_yr[['name','market','funding_total_usd','status','country_code']].reset_index())
one_year.drop('index',axis=1,inplace=True)
one_year
name | market | funding_total_usd | status | country_code | |
---|---|---|---|---|---|
Bharat Matrimony | Match-Making | $11,750,000.00 | operating | IND | |
Cincinnati State Technical and Community College | Education | $2,750,000.00 | operating | USA | |
Society of Cable Telecommunications Engineers (SCTE) | Hardware + Software | $2,200,000.00 | operating | USA |
We see that 'Bharat Matrimony', 'Cincinnati State Technical and Community College' and 'Society of Cable Telecommunications Engineers (SCTE)' are the only three startups that got funded in less than a year. We can also see the market type and total funding received for the same.
-
Got funding in more than 20 years
Looking at the different startups that took more than 2o years to get funded. Note that the idea to find this will remain same like we did above except for the number of days as input. We can find out that the number of days in 20 years is 7300 ( we can find this using a single google search also). Therefore, in place of 365 days, we will input 7300 days and observe the results.
over_20_=pd.DataFrame(over_20_.loc[over_20_['difference']>=7300])
over_20=pd.DataFrame(over_20_[['name','market','funding_total_usd','status','country_code']].reset_index())
over_20.drop('index',axis=1,inplace=True)
over_20
name | market | funding_total_usd | status | country_code | |
---|---|---|---|---|---|
Skinfix | Health and Wellness | $676,393.00 | operating | CAN | |
Winters Bros. Waste Systems | Local Businesses | $28,500,000.00 | operating | USA |
-
Got funded before they were founded
When we look at the new column difference
, we observe that many values are negative, which implies that many startups got funded even before they were founded. In this section, we will study more about the same.
For this, we will simply take difference < 0 (difference less than zero) in the code.
negative = df[df['difference']<0]
negative
The above code returns a dataframe will all the startups that got funded before the founding date. We have 38,269 startups that fall under the same category.
In the end, let's visualize the number of startups that got funded after the date of founding for different range of years.
# Range setting for funding level
positive = df[df['difference']>0]
positive['difference_level'] = 0
positive.loc[positive['difference']<365,'difference_level'] ='under 1year'
positive.loc[ (positive['difference']>=365) & (positive['difference']<1095),'difference_level' ] ='1-3years'
positive.loc[ (positive['difference']>=1095) & (positive['difference']<1825),'difference_level'] ='3-5years'
positive.loc[(positive['difference']>=1825) & (positive['difference']<3650) ,'difference_level'] ='5-10years'
positive.loc[(positive['difference']>=3650) & (positive['difference']<7300) ,'difference_level'] ='10-20years'
positive.loc[positive['difference']>=7300,'difference_level'] ='over 20years'
#Plotting the values
positive['difference_level'].value_counts().iplot(kind='bar',
yTitle='counts',
linecolor='black',
opacity=0.3,
color='turquoise',
theme='pearl',
bargap=0.5,
gridcolor='white',
title='No. of startups and years')
We observe that most startups got funded between 10-20 years of being founded and the only 2 were funded after 20 years of being founded.
It's important to note that we have not incorporated the startups that got funded before being founded in this graph because the count for that is very high, it will make the other values negligible and we will not be able to observe the count for the years with a lesser count. However, if you wish to visualize how that graph might look like, remove the line of code df[df['difference']>0]
and see how the graph turns out to be.
With this, we come to the end of our analysis.
Conclusions
We gained some interesting insights into the dataset. They are as follows:-
- 'Software' is the sector with the highest startup count
- 'Biotechnology' is the highest funded sector, and 'Verizon Communications' is the highest funded startup.
- 41,680 startups are operating across the world, while the rest being either acquired, closed, or unknown.
- California has the highest number of startups across the USA.
- Banglore has the highest number of startups across India.
- The year 2012 had the newest startups.
- There are 38,269 startups that were funded before being founded, 3 startups got funded within 1 year of being founded and only 2 startups took more than 20 years to get funded.
We have found the answers to all the questions, and we can now share our report with the team, together with the processed data for further analysis.
Hope that now you feel more confident to handle datasets of a similar kind.
Waiting for you in our next pandas series.
See you then!