Top Data Science Interview Questions for Freshers
This post covers : • Resources to clarify Data Science career-oriented questions. •Top Data Science fresher interview questions and their answers. • Various learning resources linked to the respective questions.
Data Science is one of the most growing fields nowadays and with its growth, more and more individuals are interested in being a part of this. From small startups to multinational companies, every company is now leveraging its data to make important decisions.
Even though Data Science is a technology buzzword recently, people still find it a little mystifying. Questions such as "Who can become a Data Scientist?", "What skills does one require to become a Data Scientist?", "How does one build a Data Science portfolio?" comes to mind all the time. Here, check out our post that clarifies some of these questions:
Let's assume that you have the basic requirement to be a Data Scientist and you applied for various fresher jobs and internships, and you got a call back for an interview (Congratulations! ). Now, what?
Don't worry, we are here for you. This post will briefly guide you through all the possible Data Science questions you might have.
How does data cleaning play a vital role in any Data Science workflow?
• Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work with.
• Data Cleaning helps to increase the accuracy of the model in machine learning.
• It might take up to 80% of the time just cleaning data, thus making it a critical part of the analysis task.
How can outliers be treated in any dataset?
• Outlier values can be identified using univariate or any other graphical analysis method.
• If the number of outlier values is few, then they can be assessed individually but for a large number of outliers, the values can be substituted with either the 99th or the 1st percentile values.
• All extreme values are not outlier values.
• The most common ways to treat outlier values are first, to change the value and bring it within a range, and second, to exclude the values.
How do you treat missing values in any dataset?
• If there are patterns in the missing values, observing those patterns might lead to some meaningful insights. For example: In a survey about housing prices, some individuals in a certain locality have left a question blank. That can lead us to some conclusions about that locality.
• If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation), or they can simply be ignored.
• Assigning a default value to the missing values, be it mean, minimum, or maximum value, into the data is important.
What is the Normal Distribution?
Data is usually distributed in different ways, either with a bias to the left or to the right or it can all be jumbled up. However, the data that is distributed around a central value without any bias to the left or right reaches normal distribution in the form of a bell-shaped curve.
So, a set of random variables are said to follow a normal distribution when most of the values cluster around the mean form a bell-shaped distribution.
What is data normalization and why do we need it?
Data normalization is a very important preprocessing step, which is used to rescale values to fit in a specific range to assure better convergence during backpropagation. In general, it boils down to subtracting the mean of each data point and dividing by its standard deviation. If we don't do this, then some of the features (those with high magnitude) will be weighted more in the cost function (if a higher-magnitude feature changes by 1%, then that change is pretty big, but for smaller features, it's quite insignificant). The data normalization makes all features weighted equally.
What is the goal of A/B testing?
• It is a statistical hypothesis testing for randomized experiments with two variables A and B.
• The goal of A/B Testing is to identify changes to maximize the outcome in a positive manner.
• An example of this could be improving the click-through rate for a banner ad.
What does the p-value signify about the statistical data?
P-value is used to determine the significance of results after a hypothesis test in statistics. P-value helps the readers to draw conclusions and it is always between 0 and 1.
• P-Value > 0.05 denotes weak evidence against the null hypothesis, which means that the null hypothesis cannot be rejected.
• P-value <= 0.05 denotes strong evidence against the null hypothesis, which means that the null hypothesis can be rejected.
• P-value = 0.05 is the marginal value, indicating that it is possible to go either way.
Name a few libraries in Python used for Data Analysis?
• NumPy: It is used for scientific computing and performing basic and advanced array operations. It offers many handy features for performing operations on n-arrays and matrices. It helps to process arrays that store values of the same data type and makes performing math operations on arrays (and their vectorization) easier.
• SciPy: It is a useful library that includes modules for linear algebra, integration, optimization, and statistics. Its main functionality was built upon NumPy, so the arrays make use of this library.
• Pandas: It is a library created to help developers work with “labeled” and “relational” data intuitively. It’s based on two main data structures: “Series” (one-dimensional, like a list of items) and “Data Frames” (two-dimensional, like a table with multiple columns).
Learn more about Pandas from here :
• Scikit-learn: It is a group of packages in the SciPy stack that were created for specific functionalities – for example, image processing. Scikit-learn uses the math operations of SciPy to expose a concise interface to the most common machine learning algorithms.
• Matplotlib: This is a standard data science library that helps to generate data visualizations such as two-dimensional diagrams and graphs (histograms, scatterplots, non-Cartesian coordinates graphs).
• Seaborn: Seaborn is based on Matplotlib and serves as a useful Python machine learning tool for visualizing statistical models – heatmaps and other types of visualizations that summarize data and depict the overall distributions.
• Plotly: This web-based tool for data visualizations offers many useful out-of-box graphics. The library works very well in interactive web applications.
What is the difference between Supervised Learning and Unsupervised Learning?
If an algorithm learns something from the training data so that the knowledge can be applied to the test data, then it is referred to as Supervised Learning. Classification is an example of Supervised Learning.
If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example of unsupervised learning.
What are the types of Supervised Machine learning? Explain and provide one example each?
Supervised learning problems can be divided into regression and classification problems.
• Classification: In a classification problem, the output variable is a category, such as “red” or “blue,” “disease” or “no disease,” “true” or “false,” etc.
• Regression: In a regression problem, the output variable is a real continuous value, such as “dollars” or “weight.”
What is the functionality of Linear Regression and how does it work?
Linear regression allows us to quantify the relationship between a particular variable and the outcome we care about while controlling the other factors. The main goal of regression analysis is to find the “best fit” for the linear relationship between the two variables.
Regression analysis uses a methodology of Ordinary Least Squares also known as OLS. Ordinary Least Squares as the name suggests deals with least square. It minimizes the sum of squared residuals. It squares the residuals first and then adds them up, which increases the weight of the values that lie away from the line called the ‘outliers’.
What are the assumptions of Linear Regression(if any) ?
The main assumptions of Linear Regression are :
• The true underlying relationship between the dependent and independent variables is linear in nature.
• Residuals are statistically independent.
• Variance of the residuals is constant.
• Residuals are normally distributed.
If any of these assumptions are violated, then any forecasts or confidence intervals may be misleading or biased. The linear regression is likely to perform poorly out of sample as a result.
What do you understand by Logistic Regression? State a real-life application of the algorithm?
Logistic Regression is a classification algorithm used to predict a binary outcome for a given set of independent variables.
The output of logistic regression is either a 0 or 1 with a threshold value of generally 0.5. Any value above 0.5 is considered 1, and any point below 0.5 is considered 0.
Logistic regression can be used to predict customer churn. Check our tutorial on analyzing and finding meaningful insights from consumer churn dataset:
What is the difference between a test set and a validation set?
The validation set can be considered as a part of the training set, as it is used for parameter selection and to avoid overfitting of the model being built. On the other hand, a test set is used for testing or evaluating the performance of a trained machine learning model.
In simple terms, the differences can be summarized as Training Set is to fit the parameters i.e. weights. Test Set assesses the performance of the model i.e., evaluating the predictive power and generalization.
The validation set is to tune the parameters.
What do you understand by Sensitivity? How is it calculated?
• Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, RF, etc.)
• Sensitivity is nothing but "Predicted TRUE events/Total events".True events here are the events that were true and model also predicted them as true.
• Calculation of sensitivity can be done as follows:
Sensitivity = True Positives/Positives in Actual Dependent Variable
Where True positives are positive events that are correctly classified.
What is the trade-off between Bias and Variance?
If our model is too simple and has very few parameters, then it may have high bias and low variance. On the other hand, if our model has large number of parameters, then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data.
What are some important features for effective Data Visualization?
The data visualization should be light and must highlight essential aspects of the data: looking at important variables, what is relatively important, and what are the trends and changes. Besides, data visualization must be visually appealing but should not have unnecessary information in it.
One can answer this question in multiple ways: from technical points to mentioning key aspects, but be sure to remember saying these points:
• Data positioning
• Bars over circles and squares
• Use of color theory
• Reducing chart junk by avoiding 3D charts and eliminating the use of pie charts to show proportions
What is a Scatter plot and what features might be visible in a Scatter plot?
A Scatter plot is a chart used to plot a correlation between two or more variables at the same time. It’s usually used for numeric data.
• Correlation: the two variables might have a relationship, for example, one might depend on another. But this is not the same as causation.
• Associations: the variables may be associated with one another.
• Outliers: there could be cases where the data in two dimensions does not follow the general pattern.
• Clusters: sometimes there could be groups of data that form a cluster on the plot.
• Gaps: some combinations of values might not exist in a particular case.
• Barriers: boundaries.
• Conditional relationships: some relationships between the variables rely on a condition to be met.
When analyzing a histogram, what are some of the features to look for?
Heaping example: temperature data can consist of common values due to conversion from Fahrenheit to Celsius.
Rounding example: weight data that are all multiples of 5.
What libraries do Data Scientists use to plot data in Python?
Matplotlib is the main library used for plotting data in Python. However, the plots created with this library need lots of fine-tuning to look shiny and professional. For that reason, many data scientists prefer Seaborn and Plotly, which allows you to create appealing and meaningful plots with only one line of code.
Check out our post to know how to use Plotly to create amazing visualizations:
It is hoped you found this post precise and inform and that this post will serve you as the pitstop to revise all your basics before any fresher Data Science interview.
For more such content, - consider subscribing.
All the best!