Starter Datasets for Data Science: A blog around the top 10 datasets for beginners

Understand different datasets from various domains, the role of data science in each one of them and learn to ask the right questions to get the best results from the given data.

Starter Datasets for Data Science: A blog around the top 10 datasets for beginners

In the last few tutorials of the pandas series, we worked on various datasets and used fundamental tools like dataframe.groupby, dataframe.query and many more to break down the datasets and get better results. We then went on to analyze them with another library called Plotly using different charts to visualize the results.

Today, we will focus on datasets from different domains and learn a bit about them and brainstorming the right question to ask without getting into the nitty-gritty of coding.

With the exponential rise in data collection and analysis, many businesses are looking for new ways to perform complex analysis of their data with the help of a Data Scientist. The job of a data scientist is to work towards finding solutions based on results that are generated from complex algorithms being run on datasets collected by their organization. The goal of this post is to learn and understand data science concepts from real world datasets.

Here in this post we will discuss some basic datasets from different domains for our data analysis task. If you are into data science or want to be a part, then this is the right resource for you.

We have curated a list of 10 datasets that you can use for analysis straight off the bat. Our focus will be on exploring datasets from different domains like Banking and Finance, Healthcare, Sales Forecasting, Sports, Gaming and more.

We will cover extensively the following datasets:

  1. BigMart Sales Analysis
  2. Video Game Sales Analysis
  3. Heart Disease Analysis
  4. Loan Prediction Analysis
  5. Online Auctions Analysis
  6. Customer Churn Analysis
  7. Brazilian E-commerce Public Dataset Analysis
  8. Fraud Detection Analysis
  9. FIFA 19 Dataset Analysis
  10. Startup Investment Crunchbase Analysis

Let's discuss about them one-by-one.

Note: We have already provided elaborate and beginner friendly tutorials (including codes and concepts) on some of the above-mentioned topics in our pandas series. Check them out here.

1) BigMart Sales Dataset Analysis

Are you a data scientist enthusiast curious about retail and FMCG sales forecasting? Then this is the dataset for you!

Get the data here.
  • Understanding the dataset

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined.

This dataset comes under the sales forecast category for retail and FMCG sectors. If we understand how to handle this dataset well, we can handle different sales forecast dataset in a similar fashion.

For the purpose of learning how to analyze the dataset, we can use the train.csv file only. This is how our dataset looks like:

Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type Item_Outlet_Sales
0 FDA15 9.30 Low Fat 0.016047 Dairy 249.8092 OUT049 1999 Medium Tier 1 Supermarket Type1 3735.1380
1 DRC01 5.92 Regular 0.019278 Soft Drinks 48.2692 OUT018 2009 Medium Tier 3 Supermarket Type2 443.4228

Total Features: 12
Target Variable: Item_Outlet_Sales
Total no. of observations: 8523 (Train.csv)

  • What questions to ask?

Using a data science approach we want to track what attributes might affect the sales. Therefore, it is important to ask the right questions and brainstorm as many attributes as we can.

What are the most important factors in determining the best sales figures? Here are a few questions that we should be asking. We want to know the following:

  1. Does the outlet type (grocery store or supermarket) have any impact on the overall sales?
  2. Which type of city has the most overall sales? Or which outlet location makes the most overall sales?
  3. Does the outlet size have any impact on the overall sales?
  4. Which category of products sells the most and the least?
  5. Do the product visibility and weight have any impact on the sales of the product?
  6. What is the average MRP of the product that sells the most and the least? Which category do these products fall under?
  7. What are some products that sell better in Tier 1 cities as compared to Tier 2 and Tier 3 cities?
  8. Are there any products selling better in Tier 2 and 3 cities as compared to Tier 1 cities?

These questions collectively cover all the important attributes of our dataset and using these we can find out how each one of them affect the overall sales.

A similar approach can be used in dealing with different sales forecasting datasets as well.

You can also check out our tutorial on the same dataset here, and learn how to answer each one of the above questions.

Before that, follow us on Twitter to stay updated about more such content.

2) Video Game Sales Analysis

If you want to learn Data Science as a Gamer, this is the dataset that might excite you. Video games are a billion-dollar business and have been for many years. Through this dataset, you will learn how to quantify different factors and analyze their affect on the sales of video games and determine which is the best video game system to invest in.

Get the data here.

Let's see what this dataset has to tell us.

  • Understanding the dataset

This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of Many attributes are also given to analyze the effect on sales.

The following are the first two rows of our dataset:

Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
0 1 Wii Sports Wii 2006.0 Sports Nintendo 41.49 29.02 3.77 8.46 82.74
1 2 Super Mario Bros. NES 1985.0 Platform Nintendo 29.08 3.58 6.81 0.77 40.24

Total Features: 11
Total no. of observations: 16598

We can analyze the sales depending on the regions using the following columns:

NA_Sales : Sales in North America (in millions)

EU_Sales : Sales in Europe (in millions)

JP_Sales : Sales in Japan (in millions)

Other_Sales : Sales in the rest of the world (in millions)

Global_Sales : Total worldwide sales.

  • What questions to ask?

Since we have to analyze the sales, we can do so for different regions like, North America, Europe, Japan and also Globally. Some of the questions that come in mind looking at the dataset are:

  1. Which region has performed the best in terms of sales?
  2. The top gaming consoles are Microsoft (Xbox), Sony (Playstation) and Nintendo, with Google acting as a new competitor. Does the dataset also back this information? Analyze w.r.t. different regions and also, globally.
  3. What are the top 10 games currently making the most sales globally?
  4. Are there any games that have performed well regionally but not globally? What are they?
  5. Are there any games with release year older than 2000 that are still making high sales? What are they?
  6. What are the top gaming genres that are making high sales?
  7. Does the publisher have any impact on the regional and global sales?
  8. Is there any region that has out-performed global average sales?

These are some of the important questions that we can ask looking at the dataset. The information, analysis, and methodologies used to analyze this data have a tremendous impact on whether or not our sales succeed or fail. Therefore, the answers to these questions have great importance to us.

3) Heart Disease Analysis

Data scientists are springing up in the healthcare sector every day as data drives the future of healthcare. This dataset helps to prepare you for the same. Cardiovascular diseases (CVDs) are the leading cause of death globally. An estimated 17.9 million people died from CVDs in 2019, representing 32% of all global deaths. Therefore, we chose the heart disease dataset to understand how to analyze a dataset in this domain, while also learning different factors that might lead to heart disease. If you are someone interested in the healthcare domain, then this is the perfect dataset to get started.  

Get the data here.

  • Understanding the dataset

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them.

The following are the first two rows of the dataset.

age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1

Total Features: 14
Target Variable: target
Total no. of observations: 303

The columns give information on the following:

  1. age: age
  2. sex: sex
  3. cp: chest pain type (4 values)
  4. trestbps: resting blood pressure
  5. chol: serum cholesterol in mg/dl
  6. fbs: fasting blood sugar > 120 mg/dl
  7. restecg: resting electrocardiographic results (values 0,1,2)
  8. thalach: maximum heart rate achieved
  9. exang: exercise induced angina. Stable angina is usually triggered by physical activity. When you climb stairs, exercise or walk, your heart demands more blood, but narrowed arteries slow down blood flow
  10. oldpeak: oldpeak = ST depression induced by exercise relative to rest
  11. slope: the slope of the peak exercise ST segment.The ST/heart rate slope (ST/HR slope), has been proposed as a more accurate ECG criterion for diagnosing significant coronary artery disease (CAD)
  12. ca: number of major vessels (0-3) colored by fluoroscopy.
  13. thal: Thalassemia is an inherited blood disorder characterised by less oxygen-carrying protein (haemoglobin) and fewer red blood cells in the body than normal.3 = normal; 6 = fixed defect; 7 = reversable defect
  14. target: Heart disease (0 = no, 1 = yes)

These attributes can give us some useful insights into the dataset.

  • What questions to ask?

It's important that we get the most information out of the attributes available to us and see how each of of them affect the patient.

  1. What is the percentage of patients who have heart disease?
  2. What is the ratio of male to female patients who have heart disease?
  3. Does age play a role in heart disease? What is the average age when heart diseases among the patients have spiked up?
  4. When there is too much cholesterol in your blood, it builds up in the walls of your arteries, causing heart diseases. Can you back this up with the data? What is the average cholesterol levels of patients with heart diseases?
  5. Compare the following factors of patients with and without heart diseases:
  • Resting blood pressure
  • Fasting blood sugar level
  • Max. heart rate achieved
  • Exercise induced angina
  • No. of major vessels
  • Thalassemia

We have covered the key attributes. The results are very crucial to predict the patient's health. This will enable physicians to take preventive action against future heart diseases. In addition, they will be able to validate the same using these results.

4) Loan Prediction Analysis

The Loan Prediction is a system that helps Loan Officer to detect the customer’s eligible for loan. Through this dataset we want to automate the process of identifying customers that are eligible to get a loan. This is a great dataset for anyone interested in familiarizing themselves with the banking and financial domain and want to make their first step into it.

Get the data here.

Since, our focus for this post is on learning to analyze the dataset and not to build a model, we will use the train.csv file.

  • Understanding the dataset

We have some important attributes that might help us identify potential customers eligible for loan. Following are the first 2 rows of the dataset:

Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
0 LP001002 Male No 0 Graduate No 5849 0.0 NaN 360.0 1.0 Urban Y
1 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.0 360.0 1.0 Rural N

Total Features: 13
Target Variable: Loan Status
Total no. of observations: 614 (Train.csv)

  • What questions to ask?

This dataset has some important attributes and we can only master our data skills by first learning to ask the right questions to understand the data.

  1. How is applicant's and co-applicant's income correlated with the loan status?
  2. We get an intuition that applicants with higher loan amount will have lower chances to get the loan. Does this dataset back our assumption?
  3. What is the affect of number of dependents of an applicant to the loan status?
  4. Does higher level of education always mean greater chance of getting the loan? Can we back this using the data?
  5. A credit history is a record of a borrower's responsible repayment of debts. Can we analyze the correlation between the credit history and loan status using our dataset?
  6. How is property area affecting the loan status?

These are some areas we can focus on to look for potential applicants that are eligible to get a loan. We can brainstorm more factors as well, like EMI, loan term and more and can create them using pre-existing factors in our dataset.

5) Online Auctions Analysis

Are you someone who is interested in understanding how the auction industry works and want to analyze the same? Then this could be something you may be interested in.

An auction is usually a process of buying and selling goods or services by offering them up for bid, taking bids, and then selling the item to the highest bidder or buying the item from the lowest bidder. Remember, a bid is a binding contract. This dataset is around the trend of auction analysis and what attributes to be aware about as a buyer.

Get the data here.

  • Understanding the dataset

The datasets are from a companion website for the book Modeling Online Auctions, by Wolfgang Jank and Galit Shmueli. The datasets contain eBay auction information on Cartier wristwatches, Palm Pilot M515 PDAs, Xbox game consoles, and Swarowski beads. We have two files, auction.csv and swarovski.csv. We will be discussing the former here.

Let's see how the first two rows of the dataset looks like.

auctionid bid bidtime bidder bidderrate openbid price item auction_type
0 1638893549 175.0 2.230949 schadenfreud 0.0 99.0 177.5 Cartier wristwatch 3 day auction
1 1638893549 100.0 2.600116 chuik 0.0 99.0 177.5 Cartier wristwatch 3 day auction

Total Features: 9
Total no. of observations: 10681 (auction.csv)

  • What questions to ask?

We can deal with this dataset in two different ways, depending on the purpose for the analysis. Questions can be asked from a buyers perspective, as well as, the seller's perspective.  

  1. Are there any auctions that start with the smallest opening bid? What are they?
  2. What types of auctions start with the highest opening bid?
  3. What are the most common type of auction?
  4. Are there any auctions that are more profitable than others?
  5. Should an auction be started with a higher or lower opening bid to attract more bidders?

The above are some of the questions that one can be interested in as a buyer or seller. We can always brainstorm more innovative questions depending on the types of attributes available to us.

6) Customer Churn Analysis

Businesses are always looking for ways to improve their customer acquisition and lifetime value. Churn rate is one of the most powerful customer analytics metrics that can answer any of those questions. It’s a measurement of how many existing customers are leaving your product or service. By measuring churn rate, you can dig into why customers left and take action to improve the churn experience.

A data scientist will help teams measure the rate at which customers quit using churn analytics. They’ll also use machine learning to recommend product improvements. Plus, they’ll act as the liaison between product and engineering to ensure a great customer experience.

If you are someone wanting to work in this domain, then this might be the right dataset for.

Get the data here.

  • Understanding the dataset

We are using the Telecommunication Customer Churn dataset. Each row in the dataset represents a customer, whereas, each column contains different attributes.

Let's have a look at the first two rows of the dataset.

customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.5 No

Total Features: 21
Target Variable: Churn
Total no. of observations: 7043

The dataset includes information about customers who left within the last month, services each customer signed up for, customer account information and demographic information about customers.

  • What questions to ask?

Data science can help you ask the right questions to get to the bottom of what factors are affecting the churn rate in your customers. Looking at the dataset, we want to know the relation of the following with the churn rate:

  1. What is the average tenure period for customers that stayed and those who left the company?
  2. What services are preferred more by customers that stayed to those who left the company?
  3. Can the churn rate increase due to more number of senior citizens involved? What is the proportion of senior citizens to total customers that left?
  4. What are the different contracts customers are enrolled in for customers that stayed and those who left the company?
  5. What is the total and monthly charge of customers that left?

The answer to these questions are crucial to our analysis and can help us identify problem areas to work on and improve customer experience in the company.

Similar questions can be asked to analysis customer churn data for any company.

7) Brazilian E-commerce Public Dataset Analysis

Looking at shopping patterns on any platform, allows us to make data-driven decisions. We can explore what types of spikes can occur and when during the year by analyzing this dataset with our favorite programming language. This is a must-have data set for any data scientist analyzing any e-commerce data.

Get the data here.

  • Understanding the dataset

This end-to-end dataset contains orders from Olist from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. We are using the  olist_payments_dataset.csv files for our analysis.

The first two rows are given below:

order_id payment_sequential payment_type payment_installments payment_value
0 b81ef226f3fe1789b1e8b2acac839d17 1 credit_card 8 99.33
1 a9810da82917af2d9aefd1278f1dcfa0 1 credit_card 1 24.39

Total Features: 5
Target Variable: payment_value
Total no. of observations: 103886

The dataset contains the payment information about all the orders like the order ID, payment type, payment value and more.

  • What questions to ask?
  1. How many installments do customers usually choose to pay in?
  2. What is the most preferred way of payment for customers?
  3. What is the correlation between the installments and payment value?
  4. What is the average, minimum and maximum payments made by customers?

These questions are the right approach to analyze this dataset. This helps us extract useful information from different columns. This will help us better understanding customer purchasing patterns as well.

You can explore the different files given in the dataset and come up with more innovative questions to extract the most information from the given attributes.

8) Fraud Detection Analysis

What is fraud detection analysis? Whether you are a fraud analyst, statistician, data scientist, a risk manager or in the financial services industry or someone who is curious, this is the dataset for you.

Fraud that involves cell phones, insurance claims, tax return claims, credit card transactions, government procurement etc. represent significant problems for governments and businesses and specialized analysis techniques for discovering fraud using them are required.

We will do the credit card fraud detection analysis.

Get the data here.

  • Understanding the dataset

The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

Let's have a look at the dataset.

Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 0.090794 -0.551600 -0.617801 -0.991390 -0.311169 1.468177 -0.470401 0.207971 0.025791 0.403993 0.251412 -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 -0.166974 1.612727 1.065235 0.489095 -0.143772 0.635558 0.463917 -0.114805 -0.183361 -0.145783 -0.069083 -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0

Total Features: 31
Target Variable: Class
Total no. of observations: 284807

Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

  • What questions to ask?

Every fraud detection system is different, but data scientist jobs all focus on digging deep into a database and  determine if there's a fraudulent pattern of behavior. But before we can do any of this, it's important to ask the right questions.

  1. What is the proportion of fraud to normal transactions in the dataset?
  2. What's the relation between the time of transaction and the amount?
  3. What are the highest correlated columns?
  4. What is the average time of transaction?
  5. What is the average amount of transaction?

Using the above questions, we can find insightful answers that will help us analyze the dataset better and come up with useful conclusions.

9) FIFA 19 Dataset

It is the time of the year when everyone lays out their enthusiasm for football during the UEFA European Football Championship, and today you can join in on the fun! If you are a football enthusiast and want to analyze a dataset on the same, then you have come to the right place. Here, we are going to give you an insight into the Data Scientist roles in Football Analytics.

Get the data here.

  • Understanding the dataset

The dataset contains detailed attributes for every player registered in the latest edition of FIFA 19 database.

Let's look at the dataset to understand it better.

Unnamed: 0 ID Name Age Photo Nationality Flag Overall Potential Club Club Logo Value Wage Special Preferred Foot International Reputation Weak Foot Skill Moves Work Rate Body Type Real Face Position Jersey Number Joined Loaned From Contract Valid Until Height Weight LS ST RS LW LF CF RF RW LAM CAM RAM LM LCM CM RCM RM LWB LDM CDM RDM RWB LB LCB CB RCB RB Crossing Finishing HeadingAccuracy ShortPassing Volleys Dribbling Curve FKAccuracy LongPassing BallControl Acceleration SprintSpeed Agility Reactions Balance ShotPower Jumping Stamina Strength LongShots Aggression Interceptions Positioning Vision Penalties Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes Release Clause
0 0 158023 L. Messi 31 Argentina 94 94 FC Barcelona €110.5M €565K 2202 Left 5.0 4.0 4.0 Medium/ Medium Messi Yes RF 10.0 Jul 1, 2004 NaN 2021 5'7 159lbs 88+2 88+2 88+2 92+2 93+2 93+2 93+2 92+2 93+2 93+2 93+2 91+2 84+2 84+2 84+2 91+2 64+2 61+2 61+2 61+2 64+2 59+2 47+2 47+2 47+2 59+2 84.0 95.0 70.0 90.0 86.0 97.0 93.0 94.0 87.0 96.0 91.0 86.0 91.0 95.0 95.0 85.0 68.0 72.0 59.0 94.0 48.0 22.0 94.0 94.0 75.0 96.0 33.0 28.0 26.0 6.0 11.0 15.0 14.0 8.0 €226.5M
1 1 20801 Cristiano Ronaldo 33 Portugal 94 94 Juventus €77M €405K 2228 Right 5.0 4.0 5.0 High/ Low C. Ronaldo Yes ST 7.0 Jul 10, 2018 NaN 2022 6'2 183lbs 91+3 91+3 91+3 89+3 90+3 90+3 90+3 89+3 88+3 88+3 88+3 88+3 81+3 81+3 81+3 88+3 65+3 61+3 61+3 61+3 65+3 61+3 53+3 53+3 53+3 61+3 84.0 94.0 89.0 81.0 87.0 88.0 81.0 76.0 77.0 94.0 89.0 91.0 87.0 96.0 70.0 95.0 95.0 88.0 79.0 93.0 63.0 29.0 95.0 82.0 85.0 95.0 28.0 31.0 23.0 7.0 11.0 15.0 14.0 11.0 €127.1M

Total Features: 89
Target Variable: Class
Total no. of observations: 18207

The dataset contains columns like name, age, nationality and more.

  • What questions to ask?

What is the answer to football's most pressing data questions? This is the story of how data can tell you which players will score more goals, who the best goalkeepers are, quickest strikers and toughest defenders and much much more.

However, with a dataset with so many attributes it's hard to know where to begin. Let's just highlight a few important columns and ask questions around them.

  1. What is the range of age when players perform the best?
  2. What are the top 20 clubs with highest player's performance?
  3. Who are the top 10 players according to potential and overall rating?
  4. How are height and weight of the player correlated with the overall performance?
  5. Out of all the factors (shot-power, stamina, strength etc.), which is the most effective in analyzing player's overall performance?

The above questions can help us bridges the gap between the raw data and the insights it contains into our dataset.

10) Startup Investment Crunchbase Analysis

Want to analyze various factors of your startup? Then this is the dataset that will give you some exposure on the same.

The startup investment review helps the readers know about the various numbers and metrics of a startups which is provided by Crunchbase. The Data Scientist will not only help find key metrics in any startup's data but will also analyze big data using various algorithms to give an edge over the other players by analyzing and predicting trends in the dataset.

Get the data here.

  • Understanding the dataset

The dataset has various features giving information like the name of the startup, homepage URL, total funding received, and a lot more. However, for this particular post, our focus will be on a few very specific areas only.

The dataset looks like this:

permalink name homepage_url category_list market funding_total_usd status country_code state_code region city funding_rounds founded_at founded_month founded_quarter founded_year first_funding_at last_funding_at seed venture equity_crowdfunding undisclosed convertible_note debt_financing angel grant private_equity post_ipo_equity post_ipo_debt secondary_market product_crowdfunding round_A round_B round_C round_D round_E round_F round_G round_H
0 /organization/waywire #waywire |Entertainment|Politics|Social Media|News| News 17,50,000 acquired USA NY New York City New York 1.0 2012-06-01 2012-06 2012-Q2 2012.0 2012-06-30 2012-06-30 1750000.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 /organization/tv-communications &TV Communications |Games| Games 40,00,000 operating USA CA Los Angeles Los Angeles 2.0 NaN NaN NaN NaN 2010-06-04 2010-09-23 0.0 4000000.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Total Features: 39
Total no. of observations: 54294

  • What questions to ask?

Everyone wants to know what factors affecting the startups the most, therefore, to identify those attributes, we frame questions that bring out the most useful results.

  1. What is the relation of different startups with different sectors?
  2. What is the relation of different startups with funding received?
  3. What is the relation of different startups with status (operating, closed, etc.)?
  4. What are the regions in the USA with the most startups?
  5. What are the regions in India with the most startups?
  6. What are the top countries with the most startups?
  7. What is the year of founding for startups?
  8. Identify the startups that:
    a. got funding in less than 1 year
    b. got funded after more than 20 years
    c. got funded before they were founded

Analyzing the above factor will help us understand how the startups doing better than others are different.

You can also check out our tutorial on the same dataset here, and learn how to answer each one of the above questions.


We have looked at 10 different datasets and how they can help us hone our skills. We have seen how there are challenges in all domains, and how we need to be prepared for them. Along with this, we also discussed what is it feels like to be a data scientist in different domains.

We are sure now you can find the right dataset that suits your work best among the given domains.

Always remember that you always benefit from a little work so don't be afraid to jump into the unknown!

If you are looking for jobs in AI and DS check out Deep Learning Careers