data science

Starter Datasets for Data Science: A blog around the top 10 datasets for beginners

Understand different datasets from various domains, the role of data science in each one of them and learn to ask the right questions to get the best results from the given data.

In the last few tutorials of the pandas series, we worked on various datasets and used fundamental tools like dataframe.groupby, dataframe.query and many more to break down the datasets and get better results. We then went on to analyze them with another library called Plotly using different charts to visualize the results.

Today, we will focus on datasets from different domains and learn a bit about them and brainstorming the right question to ask without getting into the nitty-gritty of coding.

With the exponential rise in data collection and analysis, many businesses are looking for new ways to perform complex analysis of their data with the help of a Data Scientist. The job of a data scientist is to work towards finding solutions based on results that are generated from complex algorithms being run on datasets collected by their organization. The goal of this post is to learn and understand data science concepts from real world datasets.

Here in this post we will discuss some basic datasets from different domains for our data analysis task. If you are into data science or want to be a part, then this is the right resource for you.

We have curated a list of 10 datasets that you can use for analysis straight off the bat. Our focus will be on exploring datasets from different domains like Banking and Finance, Healthcare, Sales Forecasting, Sports, Gaming and more.

We will cover extensively the following datasets:

BigMart Sales Analysis
Video Game Sales Analysis
Heart Disease Analysis
Loan Prediction Analysis
Online Auctions Analysis
Customer Churn Analysis
Brazilian E-commerce Public Dataset Analysis
Fraud Detection Analysis
FIFA 19 Dataset Analysis
Startup Investment Crunchbase Analysis

Let's discuss about them one-by-one.

Note: We have already provided elaborate and beginner friendly tutorials (including codes and concepts) on some of the above-mentioned topics in our pandas series. Check them out here.

1) BigMart Sales Dataset Analysis

Are you a data scientist enthusiast curious about retail and FMCG sales forecasting? Then this is the dataset for you!

Get the data here.

Understanding the dataset

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined.

This dataset comes under the sales forecast category for retail and FMCG sectors. If we understand how to handle this dataset well, we can handle different sales forecast dataset in a similar fashion.

For the purpose of learning how to analyze the dataset, we can use the train.csv file only. This is how our dataset looks like:

	Item_Identifier	Item_Weight	Item_Fat_Content	Item_Visibility	Item_Type	Item_MRP	Outlet_Identifier	Outlet_Establishment_Year	Outlet_Size	Outlet_Location_Type	Outlet_Type	Item_Outlet_Sales
0	FDA15	9.30	Low Fat	0.016047	Dairy	249.8092	OUT049	1999	Medium	Tier 1	Supermarket Type1	3735.1380
1	DRC01	5.92	Regular	0.019278	Soft Drinks	48.2692	OUT018	2009	Medium	Tier 3	Supermarket Type2	443.4228

Total Features: 12
Target Variable: Item_Outlet_Sales
Total no. of observations: 8523 (Train.csv)

What questions to ask?

Using a data science approach we want to track what attributes might affect the sales. Therefore, it is important to ask the right questions and brainstorm as many attributes as we can.

What are the most important factors in determining the best sales figures? Here are a few questions that we should be asking. We want to know the following:

Does the outlet type (grocery store or supermarket) have any impact on the overall sales?
Which type of city has the most overall sales? Or which outlet location makes the most overall sales?
Does the outlet size have any impact on the overall sales?
Which category of products sells the most and the least?
Do the product visibility and weight have any impact on the sales of the product?
What is the average MRP of the product that sells the most and the least? Which category do these products fall under?
What are some products that sell better in Tier 1 cities as compared to Tier 2 and Tier 3 cities?
Are there any products selling better in Tier 2 and 3 cities as compared to Tier 1 cities?

These questions collectively cover all the important attributes of our dataset and using these we can find out how each one of them affect the overall sales.

A similar approach can be used in dealing with different sales forecasting datasets as well.

You can also check out our tutorial on the same dataset here, and learn how to answer each one of the above questions.

💡

Before that, follow us on Twitter to stay updated about more such content.

Follow @DatascienceFM

2) Video Game Sales Analysis

If you want to learn Data Science as a Gamer, this is the dataset that might excite you. Video games are a billion-dollar business and have been for many years. Through this dataset, you will learn how to quantify different factors and analyze their affect on the sales of video games and determine which is the best video game system to invest in.

Get the data here.

Let's see what this dataset has to tell us.

Understanding the dataset

This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com. Many attributes are also given to analyze the effect on sales.

The following are the first two rows of our dataset:

	Rank	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
0	1	Wii Sports	Wii	2006.0	Sports	Nintendo	41.49	29.02	3.77	8.46	82.74
1	2	Super Mario Bros.	NES	1985.0	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24

Total Features: 11
Total no. of observations: 16598

We can analyze the sales depending on the regions using the following columns:

NA_Sales : Sales in North America (in millions)

EU_Sales : Sales in Europe (in millions)

JP_Sales : Sales in Japan (in millions)

Other_Sales : Sales in the rest of the world (in millions)

Global_Sales : Total worldwide sales.

What questions to ask?

Since we have to analyze the sales, we can do so for different regions like, North America, Europe, Japan and also Globally. Some of the questions that come in mind looking at the dataset are:

Which region has performed the best in terms of sales?
The top gaming consoles are Microsoft (Xbox), Sony (Playstation) and Nintendo, with Google acting as a new competitor. Does the dataset also back this information? Analyze w.r.t. different regions and also, globally.
What are the top 10 games currently making the most sales globally?
Are there any games that have performed well regionally but not globally? What are they?
Are there any games with release year older than 2000 that are still making high sales? What are they?
What are the top gaming genres that are making high sales?
Does the publisher have any impact on the regional and global sales?
Is there any region that has out-performed global average sales?

These are some of the important questions that we can ask looking at the dataset. The information, analysis, and methodologies used to analyze this data have a tremendous impact on whether or not our sales succeed or fail. Therefore, the answers to these questions have great importance to us.

3) Heart Disease Analysis

Data scientists are springing up in the healthcare sector every day as data drives the future of healthcare. This dataset helps to prepare you for the same. Cardiovascular diseases (CVDs) are the leading cause of death globally. An estimated 17.9 million people died from CVDs in 2019, representing 32% of all global deaths. Therefore, we chose the heart disease dataset to understand how to analyze a dataset in this domain, while also learning different factors that might lead to heart disease. If you are someone interested in the healthcare domain, then this is the perfect dataset to get started.

Get the data here.

Understanding the dataset

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them.

The following are the first two rows of the dataset.

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	0	2	1

Total Features: 14
Target Variable: target
Total no. of observations: 303

The columns give information on the following:

age: age
sex: sex
cp: chest pain type (4 values)
trestbps: resting blood pressure
chol: serum cholesterol in mg/dl
fbs: fasting blood sugar > 120 mg/dl
restecg: resting electrocardiographic results (values 0,1,2)
thalach: maximum heart rate achieved
exang: exercise induced angina. Stable angina is usually triggered by physical activity. When you climb stairs, exercise or walk, your heart demands more blood, but narrowed arteries slow down blood flow
oldpeak: oldpeak = ST depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment.The ST/heart rate slope (ST/HR slope), has been proposed as a more accurate ECG criterion for diagnosing significant coronary artery disease (CAD)
ca: number of major vessels (0-3) colored by fluoroscopy.
thal: Thalassemia is an inherited blood disorder characterised by less oxygen-carrying protein (haemoglobin) and fewer red blood cells in the body than normal.3 = normal; 6 = fixed defect; 7 = reversable defect
target: Heart disease (0 = no, 1 = yes)

These attributes can give us some useful insights into the dataset.

What questions to ask?

It's important that we get the most information out of the attributes available to us and see how each of of them affect the patient.

What is the percentage of patients who have heart disease?
What is the ratio of male to female patients who have heart disease?
Does age play a role in heart disease? What is the average age when heart diseases among the patients have spiked up?
When there is too much cholesterol in your blood, it builds up in the walls of your arteries, causing heart diseases. Can you back this up with the data? What is the average cholesterol levels of patients with heart diseases?
Compare the following factors of patients with and without heart diseases:

Resting blood pressure
Fasting blood sugar level
Max. heart rate achieved
Exercise induced angina
No. of major vessels
Thalassemia

We have covered the key attributes. The results are very crucial to predict the patient's health. This will enable physicians to take preventive action against future heart diseases. In addition, they will be able to validate the same using these results.

4) Loan Prediction Analysis

The Loan Prediction is a system that helps Loan Officer to detect the customer’s eligible for loan. Through this dataset we want to automate the process of identifying customers that are eligible to get a loan. This is a great dataset for anyone interested in familiarizing themselves with the banking and financial domain and want to make their first step into it.

Get the data here.

Since, our focus for this post is on learning to analyze the dataset and not to build a model, we will use the train.csv file.

Understanding the dataset

We have some important attributes that might help us identify potential customers eligible for loan. Following are the first 2 rows of the dataset:

	Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
0	LP001002	Male	No	0	Graduate	No	5849	0.0	NaN	360.0	1.0	Urban	Y
1	LP001003	Male	Yes	1	Graduate	No	4583	1508.0	128.0	360.0	1.0	Rural	N

Total Features: 13
Target Variable: Loan Status
Total no. of observations: 614 (Train.csv)

What questions to ask?

This dataset has some important attributes and we can only master our data skills by first learning to ask the right questions to understand the data.

How is applicant's and co-applicant's income correlated with the loan status?
We get an intuition that applicants with higher loan amount will have lower chances to get the loan. Does this dataset back our assumption?
What is the affect of number of dependents of an applicant to the loan status?
Does higher level of education always mean greater chance of getting the loan? Can we back this using the data?
A credit history is a record of a borrower's responsible repayment of debts. Can we analyze the correlation between the credit history and loan status using our dataset?
How is property area affecting the loan status?

These are some areas we can focus on to look for potential applicants that are eligible to get a loan. We can brainstorm more factors as well, like EMI, loan term and more and can create them using pre-existing factors in our dataset.

5) Online Auctions Analysis

Are you someone who is interested in understanding how the auction industry works and want to analyze the same? Then this could be something you may be interested in.

An auction is usually a process of buying and selling goods or services by offering them up for bid, taking bids, and then selling the item to the highest bidder or buying the item from the lowest bidder. Remember, a bid is a binding contract. This dataset is around the trend of auction analysis and what attributes to be aware about as a buyer.

Get the data here.

Understanding the dataset

The datasets are from a companion website for the book Modeling Online Auctions, by Wolfgang Jank and Galit Shmueli. The datasets contain eBay auction information on Cartier wristwatches, Palm Pilot M515 PDAs, Xbox game consoles, and Swarowski beads. We have two files, auction.csv and swarovski.csv. We will be discussing the former here.

Let's see how the first two rows of the dataset looks like.

	auctionid	bid	bidtime	bidder	bidderrate	openbid	price	item	auction_type
0	1638893549	175.0	2.230949	schadenfreud	0.0	99.0	177.5	Cartier wristwatch	3 day auction
1	1638893549	100.0	2.600116	chuik	0.0	99.0	177.5	Cartier wristwatch	3 day auction

Total Features: 9
Total no. of observations: 10681 (auction.csv)

What questions to ask?

We can deal with this dataset in two different ways, depending on the purpose for the analysis. Questions can be asked from a buyers perspective, as well as, the seller's perspective.

Are there any auctions that start with the smallest opening bid? What are they?
What types of auctions start with the highest opening bid?
What are the most common type of auction?
Are there any auctions that are more profitable than others?
Should an auction be started with a higher or lower opening bid to attract more bidders?

The above are some of the questions that one can be interested in as a buyer or seller. We can always brainstorm more innovative questions depending on the types of attributes available to us.

6) Customer Churn Analysis

Businesses are always looking for ways to improve their customer acquisition and lifetime value. Churn rate is one of the most powerful customer analytics metrics that can answer any of those questions. It’s a measurement of how many existing customers are leaving your product or service. By measuring churn rate, you can dig into why customers left and take action to improve the churn experience.

A data scientist will help teams measure the rate at which customers quit using churn analytics. They’ll also use machine learning to recommend product improvements. Plus, they’ll act as the liaison between product and engineering to ensure a great customer experience.

If you are someone wanting to work in this domain, then this might be the right dataset for.

Get the data here.

Understanding the dataset

We are using the Telecommunication Customer Churn dataset. Each row in the dataset represents a customer, whereas, each column contains different attributes.

Let's have a look at the first two rows of the dataset.

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	0	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	0	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.5	No

Total Features: 21
Target Variable: Churn
Total no. of observations: 7043

The dataset includes information about customers who left within the last month, services each customer signed up for, customer account information and demographic information about customers.

What questions to ask?

Data science can help you ask the right questions to get to the bottom of what factors are affecting the churn rate in your customers. Looking at the dataset, we want to know the relation of the following with the churn rate:

What is the average tenure period for customers that stayed and those who left the company?
What services are preferred more by customers that stayed to those who left the company?
Can the churn rate increase due to more number of senior citizens involved? What is the proportion of senior citizens to total customers that left?
What are the different contracts customers are enrolled in for customers that stayed and those who left the company?
What is the total and monthly charge of customers that left?

The answer to these questions are crucial to our analysis and can help us identify problem areas to work on and improve customer experience in the company.

Similar questions can be asked to analysis customer churn data for any company.

7) Brazilian E-commerce Public Dataset Analysis

Looking at shopping patterns on any platform, allows us to make data-driven decisions. We can explore what types of spikes can occur and when during the year by analyzing this dataset with our favorite programming language. This is a must-have data set for any data scientist analyzing any e-commerce data.

Get the data here.

Understanding the dataset

This end-to-end dataset contains orders from Olist from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. We are using the olist_payments_dataset.csv files for our analysis.

The first two rows are given below:

	order_id	payment_sequential	payment_type	payment_installments	payment_value
0	b81ef226f3fe1789b1e8b2acac839d17	1	credit_card	8	99.33
1	a9810da82917af2d9aefd1278f1dcfa0	1	credit_card	1	24.39

Total Features: 5
Target Variable: payment_value
Total no. of observations: 103886

The dataset contains the payment information about all the orders like the order ID, payment type, payment value and more.

What questions to ask?

How many installments do customers usually choose to pay in?
What is the most preferred way of payment for customers?
What is the correlation between the installments and payment value?
What is the average, minimum and maximum payments made by customers?

These questions are the right approach to analyze this dataset. This helps us extract useful information from different columns. This will help us better understanding customer purchasing patterns as well.

You can explore the different files given in the dataset and come up with more innovative questions to extract the most information from the given attributes.

8) Fraud Detection Analysis

What is fraud detection analysis? Whether you are a fraud analyst, statistician, data scientist, a risk manager or in the financial services industry or someone who is curious, this is the dataset for you.

Fraud that involves cell phones, insurance claims, tax return claims, credit card transactions, government procurement etc. represent significant problems for governments and businesses and specialized analysis techniques for discovering fraud using them are required.

We will do the credit card fraud detection analysis.

Get the data here.

Understanding the dataset

The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

Let's have a look at the dataset.

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	Amount	Class
0	0.0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	0.090794	-0.551600	-0.617801	-0.991390	-0.311169	1.468177	-0.470401	0.207971	0.025791	0.403993	0.251412	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62	0
1	0.0	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	-0.166974	1.612727	1.065235	0.489095	-0.143772	0.635558	0.463917	-0.114805	-0.183361	-0.145783	-0.069083	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69	0

Total Features: 31
Target Variable: Class
Total no. of observations: 284807

Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

What questions to ask?

Every fraud detection system is different, but data scientist jobs all focus on digging deep into a database and determine if there's a fraudulent pattern of behavior. But before we can do any of this, it's important to ask the right questions.

What is the proportion of fraud to normal transactions in the dataset?
What's the relation between the time of transaction and the amount?
What are the highest correlated columns?
What is the average time of transaction?
What is the average amount of transaction?

Using the above questions, we can find insightful answers that will help us analyze the dataset better and come up with useful conclusions.

9) FIFA 19 Dataset

It is the time of the year when everyone lays out their enthusiasm for football during the UEFA European Football Championship, and today you can join in on the fun! If you are a football enthusiast and want to analyze a dataset on the same, then you have come to the right place. Here, we are going to give you an insight into the Data Scientist roles in Football Analytics.

Get the data here.

Understanding the dataset

The dataset contains detailed attributes for every player registered in the latest edition of FIFA 19 database.

Let's look at the dataset to understand it better.

	Unnamed: 0	ID	Name	Age	Photo	Nationality	Flag	Overall	Potential	Club	Club Logo	Value	Wage	Special	Preferred Foot	International Reputation	Weak Foot	Skill Moves	Work Rate	Body Type	Real Face	Position	Jersey Number	Joined	Loaned From	Contract Valid Until	Height	Weight	LS	ST	RS	LW	LF	CF	RF	RW	LAM	CAM	RAM	LM	LCM	CM	RCM	RM	LWB	LDM	CDM	RDM	RWB	LB	LCB	CB	RCB	RB	Crossing	Finishing	HeadingAccuracy	ShortPassing	Volleys	Dribbling	Curve	FKAccuracy	LongPassing	BallControl	Acceleration	SprintSpeed	Agility	Reactions	Balance	ShotPower	Jumping	Stamina	Strength	LongShots	Aggression	Interceptions	Positioning	Vision	Penalties	Composure	Marking	StandingTackle	SlidingTackle	GKDiving	GKHandling	GKKicking	GKPositioning	GKReflexes	Release Clause
0	0	158023	L. Messi	31	https://cdn.sofifa.org/players/4/19/158023.png	Argentina	https://cdn.sofifa.org/flags/52.png	94	94	FC Barcelona	https://cdn.sofifa.org/teams/2/light/241.png	€110.5M	€565K	2202	Left	5.0	4.0	4.0	Medium/ Medium	Messi	Yes	RF	10.0	Jul 1, 2004	NaN	2021	5'7	159lbs	88+2	88+2	88+2	92+2	93+2	93+2	93+2	92+2	93+2	93+2	93+2	91+2	84+2	84+2	84+2	91+2	64+2	61+2	61+2	61+2	64+2	59+2	47+2	47+2	47+2	59+2	84.0	95.0	70.0	90.0	86.0	97.0	93.0	94.0	87.0	96.0	91.0	86.0	91.0	95.0	95.0	85.0	68.0	72.0	59.0	94.0	48.0	22.0	94.0	94.0	75.0	96.0	33.0	28.0	26.0	6.0	11.0	15.0	14.0	8.0	€226.5M
1	1	20801	Cristiano Ronaldo	33	https://cdn.sofifa.org/players/4/19/20801.png	Portugal	https://cdn.sofifa.org/flags/38.png	94	94	Juventus	https://cdn.sofifa.org/teams/2/light/45.png	€77M	€405K	2228	Right	5.0	4.0	5.0	High/ Low	C. Ronaldo	Yes	ST	7.0	Jul 10, 2018	NaN	2022	6'2	183lbs	91+3	91+3	91+3	89+3	90+3	90+3	90+3	89+3	88+3	88+3	88+3	88+3	81+3	81+3	81+3	88+3	65+3	61+3	61+3	61+3	65+3	61+3	53+3	53+3	53+3	61+3	84.0	94.0	89.0	81.0	87.0	88.0	81.0	76.0	77.0	94.0	89.0	91.0	87.0	96.0	70.0	95.0	95.0	88.0	79.0	93.0	63.0	29.0	95.0	82.0	85.0	95.0	28.0	31.0	23.0	7.0	11.0	15.0	14.0	11.0	€127.1M

Total Features: 89
Target Variable: Class
Total no. of observations: 18207

The dataset contains columns like name, age, nationality and more.

What questions to ask?

What is the answer to football's most pressing data questions? This is the story of how data can tell you which players will score more goals, who the best goalkeepers are, quickest strikers and toughest defenders and much much more.

However, with a dataset with so many attributes it's hard to know where to begin. Let's just highlight a few important columns and ask questions around them.

What is the range of age when players perform the best?
What are the top 20 clubs with highest player's performance?
Who are the top 10 players according to potential and overall rating?
How are height and weight of the player correlated with the overall performance?
Out of all the factors (shot-power, stamina, strength etc.), which is the most effective in analyzing player's overall performance?

The above questions can help us bridges the gap between the raw data and the insights it contains into our dataset.

10) Startup Investment Crunchbase Analysis

Want to analyze various factors of your startup? Then this is the dataset that will give you some exposure on the same.

The startup investment review helps the readers know about the various numbers and metrics of a startups which is provided by Crunchbase. The Data Scientist will not only help find key metrics in any startup's data but will also analyze big data using various algorithms to give an edge over the other players by analyzing and predicting trends in the dataset.

Get the data here.

Understanding the dataset

The dataset has various features giving information like the name of the startup, homepage URL, total funding received, and a lot more. However, for this particular post, our focus will be on a few very specific areas only.

The dataset looks like this:

	permalink	name	homepage_url	category_list	market	funding_total_usd	status	country_code	state_code	region	city	funding_rounds	founded_at	founded_month	founded_quarter	founded_year	first_funding_at	last_funding_at	seed	venture	equity_crowdfunding	undisclosed	convertible_note	debt_financing	angel	grant	private_equity	post_ipo_equity	post_ipo_debt	secondary_market	product_crowdfunding	round_A	round_B	round_C	round_D	round_E	round_F	round_G	round_H
0	/organization/waywire	#waywire	http://www.waywire.com	\|Entertainment\|Politics\|Social Media\|News\|	News	17,50,000	acquired	USA	NY	New York City	New York	1.0	2012-06-01	2012-06	2012-Q2	2012.0	2012-06-30	2012-06-30	1750000.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	/organization/tv-communications	&TV Communications	http://enjoyandtv.com	\|Games\|	Games	40,00,000	operating	USA	CA	Los Angeles	Los Angeles	2.0	NaN	NaN	NaN	NaN	2010-06-04	2010-09-23	0.0	4000000.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Total Features: 39
Total no. of observations: 54294

What questions to ask?

Everyone wants to know what factors affecting the startups the most, therefore, to identify those attributes, we frame questions that bring out the most useful results.

What is the relation of different startups with different sectors?
What is the relation of different startups with funding received?
What is the relation of different startups with status (operating, closed, etc.)?
What are the regions in the USA with the most startups?
What are the regions in India with the most startups?
What are the top countries with the most startups?
What is the year of founding for startups?
Identify the startups that:
a. got funding in less than 1 year
b. got funded after more than 20 years
c. got funded before they were founded

Analyzing the above factor will help us understand how the startups doing better than others are different.

You can also check out our tutorial on the same dataset here, and learn how to answer each one of the above questions.

Summary

We have looked at 10 different datasets and how they can help us hone our skills. We have seen how there are challenges in all domains, and how we need to be prepared for them. Along with this, we also discussed what is it feels like to be a data scientist in different domains.

We are sure now you can find the right dataset that suits your work best among the given domains.

Always remember that you always benefit from a little work so don't be afraid to jump into the unknown!

If you are looking for jobs in AI and DS check out Deep Learning Careers

Starter Datasets for Data Science: A blog around the top 10 datasets for beginners

1) BigMart Sales Dataset Analysis

2) Video Game Sales Analysis

3) Heart Disease Analysis

4) Loan Prediction Analysis

5) Online Auctions Analysis

6) Customer Churn Analysis

7) Brazilian E-commerce Public Dataset Analysis

8) Fraud Detection Analysis

9) FIFA 19 Dataset

10) Startup Investment Crunchbase Analysis

Summary

Read next

Guide to using OpenAI Assistant API

Crafting Real-World Like Data For E-Commerce Domain Databases

Movie Recommender System Using PySpark