pandas

Analyzing Startup Investments

Analyzing the startup investment dataset to gain insights about how different features are related to startups across the globe.

Welcome back to another pandas series with us. This time we will be addressing a very different category. As the heading of the post suggests, we will analyze the startup investment dataset. If you have been following our pandas series, you know that this topic is completely different from the previous datasets we have worked with. If you are new here, then there is no better dataset to work on together. So let's get started!

The 'Startup Investment (Crunchbase)' dataset will be our center of study (link given below, keep reading!). The dataset has various features giving information like the name of the startup, homepage URL, total funding received, and a lot more. However, for this particular post, our focus will be on a few very specific areas only.

We will analyze this dataset from the perspective of a VC analyst. If you are working in a new VC firm, you want to analyze the relation of different startups with respect to the following areas:

Different sectors
Funding received
Status (operating, closed, etc.)
Regions in the USA with the most startups
Regions in India with the most startups
Top countries with the most startups
Year of founding
Startups that:
a. got funding in less than 1 year
b. got funded after more than 20 years
c. got funded before they were founded

Dataset to download: https://www.kaggle.com/arindam235/startup-investments-crunchbase

💡

Before that, follow us on Twitter to stay updated about more such content.

Follow @DatascienceFM

Pre-processing the data

Before we begin with our work, it's important to import some useful libraries into our python environment.

import pandas as pd
import plotly.express as px
import datetime as dt
import warnings
warnings.filterwarnings("ignore")
print("The modules are imported")

We have imported all the useful modules and now we are ready to read our dataset.

df = pd.read_csv('investments_VC.csv', encoding = "ISO-8859-1")
df.head(2)

	permalink	name	homepage_url	category_list	market	funding_total_usd	status	country_code	state_code	region	city	funding_rounds	founded_at	founded_month	founded_quarter	founded_year	first_funding_at	last_funding_at	seed	venture	equity_crowdfunding	undisclosed	convertible_note	debt_financing	angel	grant	private_equity	post_ipo_equity	post_ipo_debt	secondary_market	product_crowdfunding	round_A	round_B	round_C	round_D	round_E	round_F	round_G	round_H
0	/organization/waywire	#waywire	http://www.waywire.com	\|Entertainment\|Politics\|Social Media\|News\|	News	17,50,000	acquired	USA	NY	New York City	New York	$1.00	2012-06-01	2012-06	2012-Q2	$2,012.00	2012-06-30	2012-06-30	$1,750,000.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00
1	/organization/tv-communications	&TV Communications	http://enjoyandtv.com	\|Games\|	Games	40,00,000	operating	USA	CA	Los Angeles	Los Angeles	$2.00	NaN	NaN	NaN	NaN	2010-06-04	2010-09-23	$0.00	$4,000,000.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00	$0.00

Once we execute the above codes, we see the different columns that are present in our dataset. We can check the shape and the different columns also using the command df.shape and df.columns. The dataset has a total of 54294 rows and 39 columns.

It's important to understand what information each column is providing before we could drop the least useful ones. To do that, we will create a table with the column names and descriptions for each column present.

description = ['Static hyperlink for the startup on Crunchbase\'s website','name of the startup','Website address of the startup',
       'in which category the startups fall','which market the startup caters to','total funding received(in USD)',
        'current operating status','country of origin','state of origin','region','city of origin','total rounds of funding',
        'date of founding','month of founding','quarter of founding','year of founding','date of first funding','date of last funding',
        'seed funding received(in USD)','venture funding received(in USD)','funding received by diluting equity',
        'other undisclosed funding sources','funding received from convertible notes','funding received from debts',
        'funding received from angel investors','funding from grants','funding from private equity',
        'funding from equity dilution after IPO','funding from debts after IPO','funding from secondary markets',
        'funding from crowdfunding','round A funding','round B funding','round C funding','round D funding','round E funding',
        'round F funding','round G funding','round H funding']
df_desc = pd.DataFrame(list(zip(df.columns, description)), columns =['Column', 'Description'])
df_desc

We have created a list description with the description for each of the columns. Next, we have combined this with the column names using the zip() function of python. So, what exactly does the zip() function do?

zip(): This function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together and so on.If the passed iterators have different lengths, the iterator with the least items decides the length of the new iterator.

In the end, we converted this table into a dataframe using the pd.DataFrame function of pandas. Now we know how exactly the code will be executed.

	Column	Description
	permalink	Static hyperlink for the startup on Crunchbase's website
	name	name of the startup
	homepage_url	Website of the startup
	category_list	Different categories of startups
	market	Different markets of startup
	funding_total_usd	total funding received(in USD)
	status	Current operating status
	country_code	Country of startup
	state_code	State of statup
	region	region
	city	city
	funding_rounds	total rounds of funding
	founded_at	date of founding
	founded_month	month of founding
	founded_quarter	quarter of founding
	founded_year	year of founding
	first_funding_at	date of first funding
	last_funding_at	date of last funding
	seed	seed funding received(in USD)
	venture	venture funding received(in USD)
	equity_crowdfunding	funding received by diluting equity
	undisclosed	other undisclosed funding sources
	convertible_note	funding received from convertible notes
	debt_financing	funding received from debts
	angel	funding received from angel investors
	grant	funding from grants
	private_equity	funding from private equity
	post_ipo_equity	funding from equity dilution after IPO
	post_ipo_debt	funding from debts after IPO
	secondary_market	funding from secondary markets
	product_crowdfunding	funding from crowdfunding
	round_A	round A funding
	round_B	round B funding
	round_C	round C funding
	round_D	round D funding
	round_E	round E funding
	round_F	round F funding
	round_G	round G funding
	round_H	round H funding

Once we understand what each column means, we realize that not all of these are useful to us, therefore, before we start with our analysis it's important to drop the columns we don't need and with that, drop the duplicate rows as well.

#deleting duplicate rows.
df = df.drop_duplicates()
print(df.shape)

#cleaning the dataframe by dropping uneccessary columns
df = df.drop(['permalink', 'homepage_url'], axis=1)

#Removing the row with no 'name'
df.dropna(how='any', subset=['name'], axis=0, inplace=True)

#Changing the values in column "funding_total_usd" from string to float
df[' funding_total_usd '] = df[' funding_total_usd '].str.strip().str.replace(",","")
df[' funding_total_usd '] = df[' funding_total_usd '].replace("-",0).astype("float")

#Replacing missing status with "unknown"
df['status'] = df['status'].replace(np.nan,"unknown")

#renaming columns
df.rename(columns = {' funding_total_usd ':'funding_total_usd',' market ':'market'}, inplace = True)

With the above lines of code, we have cleaned our way through the dataset. We can see the final dataframe using the df.head() function.

We are ready to start with our analysis!

Analyzing the dataset

Relation of startups with different sectors

In this section, we will find out the top sectors/markets that are generating the most startups. We observe that the column market provides the various sectors. Using the df['market].nunique() function, we can see that the total number of unique markets is 753. In simpler words, we have 753 unique values for the column market.

It's easier and more convenient to know the top sectors with the most startups, rather than knowing all the 753 different sectors out there. Therefore, for this particular piece of analysis, we are working with just the top 20 sectors.

So, let's see what these top sectors are.

#creating dataframe of top 20 markets with the most counts of startups
top_20_counts=pd.DataFrame(df['market'].value_counts()[:20].sort_values(ascending=False).reset_index())

#renaming columns
top_20_counts.rename(columns={'index':'market','market':'counts'},inplace=True)

#print the head of the dataframe
top_20_counts.head()

We have created a separate dataframe with the top 20 markets and the number of startups they generate. Let's check what this dataframe looks like.

	market	counts
	Software	4619
	Biotechnology	3680
	Mobile	1983
	E-Commerce	1803
	Curated Web	1653

The head() function has returned the first 5 rows of the dataframe, with 'Software' being on top of the list with the count of over 4619, followed by, 'Biotechnology', 'Mobile', and more.

Let's plot the graph for the same. This will help us to grasp the information more easily while also conveying a story to the decision-makers.

The bar graph has made it very easy for us to see these top sectors for startups across the world.

Let's see what our next section is.

Relation of startups with the funding received

Here, we will analyze the sectors with the most funding received. Moreover, we will find the names of the top startups that received the highest funding across the world.

Highest funded sectors

The code to find the highest funded sectors is given below.

#creating a new dataframe
df_market=pd.DataFrame(df[['market','funding_total_usd']])

#drop funding = 0 
df_market.drop(df_market.loc[df_market['funding_total_usd']==0.0].index, inplace=True)

#this adds the dollar sign and commas and limits decimals to 2 places
pd.options.display.float_format = '${:,.2f}'.format  

df_market_=df_market.groupby('market')[['funding_total_usd']].sum().sort_values(by=['funding_total_usd'],ascending=False).reset_index()

	market	funding_total_usd
	Biotechnology	$73,163,695,349.00
	Mobile	$49,470,108,988.00
	Software	$42,221,976,814.00
	Clean Technology	$37,208,938,855.00
	Health Care	$29,349,498,090.00

The see the first 5 rows for the highest funded sectors across the globe in the above dataframe. 'Biotechnology' is the highest funded amongst all the other sectors with a total of around $73 Billion. This is followed by 'Mobile' and 'Software'.

Let's plot our top 20 highest funded sectors to visualize how each sector is performing in comparison to the other.

data = df_market_
y=df_market_['funding_total_usd'][:20]
x=df_market_['market'][:20]

fig_2 = go.Figure(data=[
    go.Bar(name='total funding', x=x, y=y, text= y,textposition='auto')])

# Customize aspect
fig_2.update_traces(marker_color='rgb(179,226,205)', marker_line_color='rgb(102,194,165)',
                  marker_line_width=1.5, opacity=0.6)
fig_2.update_layout(title_text='Top 20 markets')

fig_2.show()

We can see how much each sector is funded all over the world.

Let's see what are the names of the top-funded startups.

Highest funded startups

Try finding the top 10 highest funded startups across the world. Follow a similar procedure as we did above. Don't worry, if you get stuck, you can refer to the codes below. We always have your back!

#startup names and funding received
name_funds=pd.DataFrame(df[['name','funding_total_usd']])
name_funds=name_funds.sort_values(by=['funding_total_usd'],ascending=False).reset_index()
name_funds.drop(['index'], axis=1)

#dataframe consisting of top 20 startups with highest funding
pd.options.display.float_format = '${:,.2f}'.format
startups_10=pd.DataFrame(name_funds[['name','funding_total_usd']][:10])
startups_10

	name	funding_total_usd
	Verizon Communications	$30,079,503,000.00
	Sberbank	$5,800,000,000.00
	Clearwire	$5,700,000,000.00
	Charter Communications	$5,162,513,431.00
	First Data Corporation	$3,500,000,000.00
	COFCO	$3,200,000,000.00
	sigmacare	$2,600,000,000.00
	Facebook	$2,425,700,000.00
	Carestream	$2,400,000,000.00
	Flipkart	$2,351,140,000.00

The top highest funded startups are 'Verizon Communications', 'Sberbank', 'Clearwire'.

Let's jump to the next section, where we will discuss the status of different startups.

Status of different startups

If we look closely, we notice that the column status has 4 unique values, namely, 'Acquired', 'Operating', 'Unknown', and 'Closed', which simply means that the status of a startup can be either of these 4.

We will first find out how many startups does each category has.

status=pd.DataFrame(df['status'].value_counts().reset_index())
status

The above code will count the number of startups in each category using the value_counts() function.

	status	count
	operating	41680
	acquired	3682
	closed	2602
	unknown	1313

A very high number of startups are operating well, that's a good sign!

Plotting the same information to understand the proportion each category holds.

colors = ['gold', 'mediumturquoise', 'darkorange', 'lightgreen']

fig_4 = go.Figure(data=[go.Pie(labels=status['index'],
                             values=status['status'],pull=[0.3, 0, 0, 0])])
fig_4.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(colors=colors))
fig_4.show()

Now, the proportion for each category is clear to us. Out of the total, 41680 startups are operating, 3682 are acquired, 2602 are closed and the status of 1313 startups is unknown to us.

Before we proceed to the next section, let's see how many rounds of funding each of these categories received. We will make use of the columns round_A .... round_H for this. Scroll up to see what each of these columns was in the description table we created.

#segregating the dataframe on the basis of status
acquired_df=df.query('status == "acquired"')
operating_df=df.query('status == "operating"')
unknown_df=df.query('status == "unknown"')
closed_df=df.query('status == "closed"')

#Plotting a funnel chart to understand the proportion
from plotly import graph_objects as go

fig_5 = go.Figure()

fig_5.add_trace(go.Funnel(
    name = 'acquired',
    y = ["Round A", "Round B","Round C","Round D",
         "Round E","Round F","Round G","Round H"],
    x = [acquired_df['round_A'].mean(),acquired_df['round_B'].mean(),acquired_df['round_C'].mean(),
         acquired_df['round_D'].mean(),acquired_df['round_E'].mean(),
         acquired_df['round_F'].mean(),acquired_df['round_G'].mean(),acquired_df['round_H'].mean()],
    textinfo = "percent initial"))

fig_5.add_trace(go.Funnel(
    name = 'operating',
    orientation = "h",
    y = ["Round A", "Round B","Round C","Round D",
         "Round E","Round F","Round G","Round H"],
    x = [operating_df['round_A'].mean(),operating_df['round_B'].mean(),operating_df['round_C'].mean(),
         operating_df['round_D'].mean(),operating_df['round_E'].mean(),
         operating_df['round_F'].mean(),operating_df['round_G'].mean(),operating_df['round_H'].mean()],
    textposition = "inside",
    textinfo = "percent previous"))

fig_5.add_trace(go.Funnel(
    name = 'unknown',
    orientation = "h",
    y = ["Round A", "Round B","Round C","Round D",
         "Round E","Round F","Round G","Round H"],
    x = [unknown_df['round_A'].mean(),unknown_df['round_B'].mean(),unknown_df['round_C'].mean(),
         unknown_df['round_D'].mean(),unknown_df['round_E'].mean(),
         unknown_df['round_F'].mean(),unknown_df['round_G'].mean(),unknown_df['round_H'].mean()],
    textposition = "outside",
    textinfo = "percent total"))

fig_5.add_trace(go.Funnel(
    name = 'closed',
    orientation = "h",
    y = ["Round A", "Round B","Round C","Round D",
         "Round E","Round F","Round G","Round H"],
    x = [closed_df['round_A'].mean(),closed_df['round_B'].mean(),closed_df['round_C'].mean(),
         closed_df['round_D'].mean(),closed_df['round_E'].mean(),
         closed_df['round_F'].mean(),closed_df['round_G'].mean(),closed_df['round_H'].mean()],
    textposition = "outside",
    textinfo = "percent total"))

fig_5.show()

We see that round_B funding has the most startups under it, followed by round_A funding.

Let's find out what the next section is about.

Regions in USA with the most startups

Looking at the startup distribution in different states of the USA. We will directly visualize the results for this part using a Choropleth Map.

Choropleth maps can be used while plotting data points that include country or city names or coordinates. So, what is a choropleth map?

Choropleth Map:- It is a type of thematic map in which a set of pre-defined areas is colored or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area

#dataframe with country code USA
USA = pd.DataFrame(df.query('country_code == "USA"')['state_code'].value_counts(),columns=['state_code'])
USA.reset_index(inplace=True)

#plotting a choropleth map
fig_6 = go.Figure(data=go.Choropleth(
    locations=USA['index'],
    z = USA['state_code'].astype(int), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'purpor',
    colorbar_title = "startup counts",
))

fig_6.update_layout(
    title_text = 'Startup investment USA state counting',
    geo_scope='usa',height=500
)
fig_6.show()

One of the areas with the highest number of startups is California with over 9.902K startups.

Regions in India with the most startups

We will follow a similar approach to find the regions in India with the most startups, but this time, let's visualize our result using a Treemap.

Treemap:- Treemapping is a method for displaying hierarchical data using nested figures, usually rectangles.

Code for the same is as follows:-

#dataframe with country as india
india=pd.DataFrame(df_india[['name','region','funding_total_usd','status','city']])

#plotting a treemap
india=df.query('country_code=="IND"')
city =pd.DataFrame(india['region'].value_counts()[:30],columns=['region'])
city.reset_index(inplace=True)
fig_7 = px.treemap(city, path=['index'], values='region',title='Top 30 regions in india')
fig_7.show()

'Banglore', 'New Delhi' and 'Mumbai' are the top cities in India with the highest startup count.

Top countries and cities in the world with the highest startup count

Try to plot a treemap for this part on your own. Follow the steps given above. Tally your results with the answers given below, do they match?

1. Top countries are USA, Great Britain and Canada.
2. Top cities are San Francisco, New York and London.

Year of founding

We have come to the last section now, in which we will visualize the founding years of startups. Has there been a spike in the formation of new startups over the years? Let's find out.

We will use the founded_at column for this analysis. The column goes back to the year 1785. We will only observe the years after 1950. Anything earlier than that has little to no relevance to us.

# change the datatype from string to datetime 
df['founded_at'] = pd.to_datetime(df['founded_at'],errors='coerce')

# drop Null value
df['founded_at'].dropna(inplace=True)

foundation = pd.DataFrame(df['founded_at'])

foundation['year'] = foundation['founded_at'].dt.year

#plotting
foundation['year'].value_counts().iplot(kind='bar',yTitle='counting', linecolor='black',opacity=0.7,bargap=0.1,gridcolor='white',color='red',title='Year of founding')

2012 has the most number of startups. We can say that the number of new startups has spiked from 2000 to 2014.

Startups that

Got funding in less than 1 year

We will discuss about the startups that got funded in less than 1 year of being founded. To find the same, we will create a new column in our dataframe which gives the difference between first_funding_at and founded_at. The value for this difference will tell us how much time it took for different startups to get funded. Let's learn how.

# change the datatype from string to datetime
df['first_funding_at'] = pd.to_datetime(df['first_funding_at'],errors='coerce')

# drop Null value
df['first_funding_at'].dropna(inplace=True)

# Define the differrence between the date of foundation and first funding
df['difference'] = df['first_funding_at'] - df['founded_at']

# extract only numeric value
df['difference'] = df['difference'].dt.days

#extracting positive values only and saving in one_yr
one_yr = df[df['difference']>0]
one_yr['difference_level'] = 0

#extracting difference less than 365 days
one_yr=pd.DataFrame(one_yr.loc[one_yr['difference']<365])
one_year=pd.DataFrame(one_yr[['name','market','funding_total_usd','status','country_code']].reset_index())
one_year.drop('index',axis=1,inplace=True)
one_year

name	market	funding_total_usd	status	country_code
Bharat Matrimony	Match-Making	$11,750,000.00	operating	IND
Cincinnati State Technical and Community College	Education	$2,750,000.00	operating	USA
Society of Cable Telecommunications Engineers (SCTE)	Hardware + Software	$2,200,000.00	operating	USA

We see that 'Bharat Matrimony', 'Cincinnati State Technical and Community College' and 'Society of Cable Telecommunications Engineers (SCTE)' are the only three startups that got funded in less than a year. We can also see the market type and total funding received for the same.

Got funding in more than 20 years

Looking at the different startups that took more than 2o years to get funded. Note that the idea to find this will remain same like we did above except for the number of days as input. We can find out that the number of days in 20 years is 7300 ( we can find this using a single google search also). Therefore, in place of 365 days, we will input 7300 days and observe the results.

over_20_=pd.DataFrame(over_20_.loc[over_20_['difference']>=7300])
over_20=pd.DataFrame(over_20_[['name','market','funding_total_usd','status','country_code']].reset_index())
over_20.drop('index',axis=1,inplace=True)
over_20

	name	market	funding_total_usd	status	country_code
	Skinfix	Health and Wellness	$676,393.00	operating	CAN
	Winters Bros. Waste Systems	Local Businesses	$28,500,000.00	operating	USA

Got funded before they were founded

When we look at the new column difference, we observe that many values are negative, which implies that many startups got funded even before they were founded. In this section, we will study more about the same.

For this, we will simply take difference < 0 (difference less than zero) in the code.

negative = df[df['difference']<0]
negative

The above code returns a dataframe will all the startups that got funded before the founding date. We have 38,269 startups that fall under the same category.

In the end, let's visualize the number of startups that got funded after the date of founding for different range of years.

# Range setting for funding level 

positive = df[df['difference']>0]
positive['difference_level'] = 0
positive.loc[positive['difference']<365,'difference_level'] ='under 1year'
positive.loc[ (positive['difference']>=365) & (positive['difference']<1095),'difference_level' ] ='1-3years'
positive.loc[ (positive['difference']>=1095) & (positive['difference']<1825),'difference_level'] ='3-5years'
positive.loc[(positive['difference']>=1825) & (positive['difference']<3650) ,'difference_level'] ='5-10years'
positive.loc[(positive['difference']>=3650) & (positive['difference']<7300) ,'difference_level'] ='10-20years'
positive.loc[positive['difference']>=7300,'difference_level'] ='over 20years'

#Plotting the values
positive['difference_level'].value_counts().iplot(kind='bar',
                                                      yTitle='counts', 
                                                      linecolor='black', 
                                                      opacity=0.3,
                                                      color='turquoise',
                                                      theme='pearl',
                                                      bargap=0.5,
                                                      gridcolor='white',
                                                      title='No. of startups and years')

We observe that most startups got funded between 10-20 years of being founded and the only 2 were funded after 20 years of being founded.

It's important to note that we have not incorporated the startups that got funded before being founded in this graph because the count for that is very high, it will make the other values negligible and we will not be able to observe the count for the years with a lesser count. However, if you wish to visualize how that graph might look like, remove the line of code df[df['difference']>0] and see how the graph turns out to be.

With this, we come to the end of our analysis.

Conclusions

We gained some interesting insights into the dataset. They are as follows:-

'Software' is the sector with the highest startup count
'Biotechnology' is the highest funded sector, and 'Verizon Communications' is the highest funded startup.
41,680 startups are operating across the world, while the rest being either acquired, closed, or unknown.
California has the highest number of startups across the USA.
Banglore has the highest number of startups across India.
The year 2012 had the newest startups.
There are 38,269 startups that were funded before being founded, 3 startups got funded within 1 year of being founded and only 2 startups took more than 20 years to get funded.

We have found the answers to all the questions, and we can now share our report with the team, together with the processed data for further analysis.

Hope that now you feel more confident to handle datasets of a similar kind.

Waiting for you in our next pandas series.
See you then!

Analyzing Startup Investments

Pre-processing the data

Analyzing the dataset

Relation of startups with different sectors

Relation of startups with the funding received

Status of different startups

Regions in USA with the most startups

Regions in India with the most startups

Year of founding

Startups that

Got funding in less than 1 year

Got funding in more than 20 years

Got funded before they were founded

Conclusions

Read next

Pandas Walkthrough E-book on 3 Most Important Concepts

Mega Guide to Pandas for Data Scientists

Pandas Group By - Key areas you should watch out for