# Top Data Science Interview Questions for Freshers

This post covers : • Resources to clarify Data Science career-oriented questions. •Top Data Science fresher interview questions and their answers. • Various learning resources linked to the respective questions.

## Introduction

Data Science is one of the **most growing fields **nowadays and with its growth, more and more individuals are interested in being a part of this. From small startups to multinational companies, every company is now **leveraging its data to make important decisions. **

Even though Data Science is a technology buzzword recently, people still find it a little mystifying. Questions such as** "Who can become a Data Scientist?", "What skills does one require to become a Data Scientist?", "How does one build a Data Science portfolio?"** comes to mind all the time. Here, check out our post that clarifies some of these questions:

Let's assume that you have the basic requirement to be a Data Scientist and you applied for various fresher jobs and internships, and you got a call back for an interview (**Congratulations! **). Now, what?

Don't worry, we are here for you. This post will briefly guide you through **all the possible Data Science questions you might have. **

Let's begin!

### How does data cleaning play a vital role in any Data Science workflow?

• Cleaning data from multiple sources helps to transform it into a **format that data analysts or data scientists can work with.**

• Data Cleaning helps to** increase the accuracy of the model **in machine learning.

• It might take up to** 80% of the time** just cleaning data, thus making it a critical part of the analysis task.

### How can outliers be treated in any dataset?

• Outlier values can be identified using **univariate or any other graphical analysis method.**

• If the number of outlier values is few, then they can be assessed individually but for a large number of outliers, the values can be substituted with **either the 99th or the 1st percentile values.**

• All extreme values are **not** outlier values.

• The most common ways to treat outlier values are first, to **change the value** and bring it within a range, and second, to **exclude the values.**

### How do you treat missing values in any dataset?

• If there are **patterns in the missing values**, observing those patterns might lead to some** meaningful insights**. For example: In a survey about housing prices, some individuals in a certain locality have left a question blank. That can lead us to some conclusions about that locality.

• If there are no patterns identified, then the missing values can be substituted with **mean or median values (imputation),** or they can simply be ignored.

• Assigning a **default value to the missing values, be it mean, minimum, or maximum value, **into the data is important.

### What is the Normal Distribution?

Data is usually distributed in different ways, either with a **bias to the left or to the right** or it can all be jumbled up. However, the **data that is distributed around a central value **without any bias to the left or right reaches normal distribution in the form of a** bell-shaped curve. **

So, a set of random variables are said to follow a normal distribution when **most of the values cluster around the mean **form a bell-shaped distribution.

#### What is data normalization and why do we need it?

Data normalization is a very important preprocessing step, which is used to **rescale values to fit in a specific range to assure better convergence during backpropagation.** In general, it boils down to **subtracting the mean of each data point and dividing by its standard deviation. **If we don't do this, then some of the features (those with high magnitude) will be weighted more in the cost function (if a higher-magnitude feature changes by 1%, then that change is pretty big, but for smaller features, it's quite insignificant). The data normalization makes all features **weighted equally.**

### What is the goal of A/B testing?

• It is a statistical hypothesis testing for randomized experiments with two variables A and B.

• The goal of A/B Testing is to identify changes **to maximize the outcome in a positive manner.**

• An example of this could be improving the click-through rate for a banner ad.

#### What does the p-value signify about the statistical data?

P-value is used to determine the **significance of results after a hypothesis test in statistics. **P-value helps the readers to draw conclusions and it is always between 0 and 1.

• P-Value > 0.05 denotes weak evidence against the null hypothesis, which means that the null hypothesis cannot be rejected.

• P-value <= 0.05 denotes strong evidence against the null hypothesis, which means that the null hypothesis can be rejected.

• P-value = 0.05 is the marginal value, indicating that it is possible to go either way.

### Name a few libraries in Python used for Data Analysis?

• **NumPy**: It is used for **scientific computing **and **performing basic and advanced array operations**. It offers many handy features for performing operations on n-arrays and matrices. It helps to process arrays that store values of the same data type and makes performing **math operations **on arrays (and their **vectorization**) easier.

• **SciPy**: It is a useful library that includes modules for **linear algebra, integration, optimization, and statistics**. Its main functionality was built upon NumPy, so the arrays make use of this library.

• **Pandas**: It is a library created to help developers work with **“labeled”** and **“relational” **data intuitively. It’s based on two main data structures: **“Series”** (one-dimensional, like a list of items) and **“Data Frames”** (two-dimensional, like a table with multiple columns).

Learn more about Pandas from here :

• **Scikit-learn**: It is a group of packages in the **SciPy **stack that were created for **specific functionalities **– for example, **image processing**. Scikit-learn uses the **math operations of SciPy** to expose a concise interface to the most common **machine learning algorithms.**

• **Matplotlib**: This is a standard data science library that helps to **generate data visualizations **such as **two-dimensional diagrams and graphs **(histograms, scatterplots, non-Cartesian coordinates graphs).

• **Seaborn**: Seaborn is based on** Matplotlib **and serves as a useful Python machine learning tool for **visualizing statistical models** – heatmaps and other types of visualizations that summarize data and **depict the overall distributions.**

• **Plotly**: This web-based tool for **data visualizations** offers many useful out-of-box graphics. The library works very well in interactive web applications.

### What is the difference between Supervised Learning and Unsupervised Learning?

If an algorithm **learns something from the training data **so that the knowledge can be **applied to the test data**, then it is referred to as Supervised Learning. **Classification** is an example of Supervised Learning.

If the algorithm **does not learn anything beforehand** because there is no response variable or any training data, then it is referred to as unsupervised learning. **Clustering** is an example of unsupervised learning.

### What are the types of Supervised Machine learning? Explain and provide one example each?

Supervised learning problems can be divided into** regression** and **classification** problems.

• **Classification**: In a classification problem, the output variable is a category, such as “red” or “blue,” “disease” or “no disease,” “true” or “false,” etc.

• **Regression**: In a regression problem, the output variable is a real continuous value, such as “dollars” or “weight.”

### What is the functionality of Linear Regression and how does it work?

Linear regression allows us to quantify the relationship between a **particular variable and the outcome** we care about while controlling the other factors. The main goal of regression analysis is to find the **“best fit” for the linear relationship between the two variables.**

Regression analysis uses a methodology of **Ordinary Least Squares** also known as OLS. Ordinary Least Squares as the name suggests deals with least square.** It minimizes the sum of squared residuals. **It squares the residuals first and then adds them up, which increases the weight of the values that lie away from the line called the ‘**outliers**’.

### What are the assumptions of Linear Regression(if any) ?

The main assumptions of Linear Regression are :

• The true underlying relationship between the **dependent and independent variables is linear** in nature.

• Residuals are** statistically independent.**

• Variance of the **residuals is constant.**

• Residuals are **normally distributed.**

If any of these assumptions are violated, then any forecasts or confidence intervals may be **misleading or biased. **The linear regression is likely to perform poorly out of sample as a result.

### What do you understand by Logistic Regression? State a real-life application of the algorithm?

Logistic Regression is a **classification algorithm** used to predict a **binary outcome** for a given **set of independent variables.**

The output of logistic regression is either a** 0 or 1** with a threshold value of generally **0.5.** Any value **above 0.5** is considered **1**, and any point** below 0.5 **is considered **0**.

Logistic regression can be used to predict **customer churn.** Check our tutorial on analyzing and finding meaningful insights from consumer churn dataset:

What is the difference between a test set and a validation set?

The validation set can be considered as a** part of the training set, **as it is used for **parameter selection and to avoid overfitting **of the model being built. On the other hand, a test set is used for **testing or evaluating the performance **of a trained machine learning model.

In simple terms, the differences can be summarized as Training Set is to **fit the parameters i.e. weights**. Test Set assesses** the performance of the model** i.e., **evaluating the predictive power and generalization.**

The validation set is to **tune the parameters.**

What do you understand by Sensitivity? How is it calculated?

• Sensitivity is commonly used to **validate the accuracy of a classifier** (Logistic, SVM, RF, etc.)

• Sensitivity is nothing but **"Predicted TRUE events/Total events"**.True events here are the events that were true and model also **predicted them as true. **

• Calculation of sensitivity can be done as follows:

**Sensitivity = True Positives/Positives in Actual Dependent Variable **

Where True positives are positive events that are **correctly** classified.

### What is the trade-off between Bias and Variance?

If our model is **too simple** and has** very few parameters,** then it may have **high bias and low variance. **On the other hand, if our model has** large number of parameters**,** **then it’s going to have **high variance and low bias.** So we need to find the right/good balance without **overfitting and underfitting** the data.

### What are some important features for effective Data Visualization?

The data visualization should be **light** and must **highlight essential aspects** of the data: looking at important variables, what is relatively important, and what are the **trends and changes.** Besides, data visualization must be visually appealing but **should not have unnecessary information** in it.

One can answer this question in multiple ways: from technical points to mentioning key aspects, but be sure to remember saying these points:

• Data positioning

• Bars over circles and squares

• Use of color theory

• Reducing chart junk by avoiding 3D charts and eliminating the use of pie charts to show proportions

### What is a Scatter plot and what features might be visible in a Scatter plot?

A Scatter plot is a chart used to plot a **correlation **between two or more variables at the same time. It’s usually used for **numeric data.**

• **Correlation**: the two variables might have a relationship, for example, one might depend on another. But this is not the same as causation.

• **Associations**: the variables may be associated with one another.

• **Outliers**: there could be cases where the data in two dimensions does not follow the general pattern.

• **Clusters**: sometimes there could be groups of data that form a cluster on the plot.

• **Gaps**: some combinations of values might not exist in a particular case.

• **Barriers**: boundaries.

• **Conditional relationships**: some relationships between the variables rely on a condition to be met.

### When analyzing a histogram, what are some of the features to look for?

• Asymmetry

• Outliers

• Multimodality

• Gaps

• Heaping/Rounding: **Heaping example:** temperature data can consist of common values due to conversion from Fahrenheit to Celsius. **Rounding example:** weight data that are all multiples of 5.

• Impossibilities/Errors

### What libraries do Data Scientists use to plot data in Python?

**Matplotlib** is the main library used for **plotting data in Python.** However, the plots created with this library **need lots of fine-tuning to look shiny and professional. **For that reason, many data scientists prefer **Seaborn and Plotly,** which allows you to create **appealing and meaningful plots** with only one line of code.

Check out our post to know how to use Plotly to create amazing visualizations:

## Conclusion

It is hoped you found this post precise and inform and that this post will serve you as the** pitstop to revise all your basics before any fresher Data Science interview.**

For more such content, - **consider subscribing.**

All the best!