Data Science Interview Questions: What The Experts Say

Guest Blog our consultant managing the role
Author: Guest Blog
Posting date: 8/22/2019 9:13 AM
Our friends at Data Science Dojo have compiled a list of 101 actual Data Science interview questions that have been asked between 2016-2019 at some of the largest recruiters in the Data Science industry – Amazon, Microsoft, Facebook, Google, Netflix, Expedia, etc. 

Data Science is an interdisciplinary field and sits at the intersection of computer science, statistics/mathematics, and domain knowledge. To be able to perform well, one needs to have a good foundation in not one but multiple fields, and it reflects in the interview. They've divided the questions into six categories: 

  • Machine Learning
  • Data Analysis
  • Statistics, Probability, and Mathematics
  • Programming
  • SQL
  • Experiential/Behavioural Questions

Once you've gone through all the questions, you should have a good understanding of how well you're prepared for your next Data Science interview.

Machine Learning


As one will expect, Data Science interviews focus heavily on questions that help the company test your concepts, applications, and experience on machine learning. Each question included in this category has been recently asked in one or more actual Data Science interviews at companies such as Amazon, Google, Microsoft, etc. These questions will give you a good sense of what sub-topics appear more often than others. You should also pay close attention to the way these questions are phrased in an interview. 

  • Explain Logistic Regression and its assumptions.
  • Explain Linear Regression and its assumptions.
  • How do you split your data between training and validation?
  • Describe Binary Classification.
  • Explain the working of decision trees.
  • What are different metrics to classify a dataset?
  • What's the role of a cost function?
  • What's the difference between convex and non-convex cost function?
  • Why is it important to know bias-variance trade off while modeling?
  • Why is regularisation used in machine learning models? What are the differences between L1 and L2 regularisation?
  • What's the problem of exploding gradients in machine learning?
  • Is it necessary to use activation functions in neural networks?
  • In what aspects is a box plot different from a histogram?
  • What is cross validation? Why is it used?
  • Can you explain the concept of false positive and false negative?
  • Explain how SVM works.
  • While working at Facebook, you're asked to implement some new features. What type of experiment would you run to implement these features?
  • What techniques can be used to evaluate a Machine Learning model?
  • Why is overfitting a problem in machine learning models? What steps can you take to avoid it?
  • Describe a way to detect anomalies in a given dataset.
  • What are the Naive Bayes fundamentals?
  • What is AUC - ROC Curve?
  • What is K-means?
  • How does the Gradient Boosting algorithm work?
  • Explain advantages and drawbacks of Support Vector Machines (SVM).
  • What is the difference between bagging and boosting?
  • Before building any model, why do we need the feature selection/engineering step?
  • How to deal with unbalanced binary classification?
  • What is the ROC curve and the meaning of sensitivity, specificity, confusion matrix?
  • Why is dimensionality reduction important?
  • What are hyperparameters, how to tune them, how to test and know if they worked for the particular problem?
  • How will you decide whether a customer will buy a product today or not given the income of the customer, location where the customer lives, profession, and gender? Define a machine learning algorithm for this.
  • How will you inspect missing data and when are they important for your analysis?
  • How will you design the heatmap for Uber drivers to provide recommendation on where to wait for passengers? How would you approach this?
  • What are time series forecasting techniques?
  • How does a logistic regression model know what the coefficients are?
  • Explain Principle Component Analysis (PCA) and it's assumptions.
  • Formulate Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) techniques.
  • What are neural networks used for?40. Why is gradient checking important?
  • Is random weight assignment better than assigning same weights to the units in the hidden layer?
  • How to find the F1 score after a model is trained?
  • How many topic modeling techniques do you know of? Explain them briefly.
  • How does a neural network with one layer and one input and output compare to a logistic regression?
  • Why Rectified Linear Unit/ReLU is a good activation function?
  • When using the Gaussian mixture model, how do you know it's applicable?
  • If a Product Manager says that they want to double the number of ads in Facebook's Newsfeed, how would you figure out if this is a good idea or not?What do you know about LSTM?
  • Explain the difference between generative and discriminative algorithms.
  • Can you explain what MapReduce is and how it works?
  • If the model isn't perfect, how would you like to select the threshold so that the model outputs 1 or 0 for label?
  • Are boosting algorithms better than decision trees? If yes, why?
  • What do you think are the important factors in the algorithm Uber uses to assign rides to drivers?
  • How does speech synthesis works?

Data Analysis


Machine Learning concepts are not the only area in which you'll be tested in the interview. Data pre-processing and data exploration are other areas where you can always expect a few questions. We're grouping all such questions under this category. Data Analysis is the process of evaluating data using analytical and statistical tools to discover useful insights. Once again, all these questions have been recently asked in one or more actual Data Science interviews at the companies listed above.  

  • What are the core steps of the data analysis process?
  • How do you detect if a new observation is an outlier?
  • Facebook wants to analyse why the "likes per user and minutes spent on a platform are increasing, but total number of users are decreasing". How can they do that?
  • If you have a chance to add something to Facebook then how would you measure its success?
  • If you are working at Facebook and you want to detect bogus/fake accounts. How will you go about that?
  • What are anomaly detection methods?
  • How do you solve for multicollinearity?
  • How to optimise marketing spend between various marketing channels?
  • What metrics would you use to track whether Uber's strategy of using paid advertising to acquire customers works?
  • What are the core steps for data preprocessing before applying machine learning algorithms?
  • How do you inspect missing data?
  • How does caching work and how do you use it in Data Science?

Statistics, Probability and Mathematics


As we've already mentioned, Data Science builds its foundation on statistics and probability concepts. Having a strong foundation in statistics and probability concepts is a requirement for Data Science, and these topics are always brought up in data science interviews. Here is a list of statistics and probability questions that have been asked in actual Data Science interviews.

  • How would you select a representative sample of search queries from 5 million queries?
  • Discuss how to randomly select a sample from a product user population.
  • What is the importance of Markov Chains in Data Science?
  • How do you prove that males are on average taller than females by knowing just gender or height.
  • What is the difference between Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP)?
  • What does P-Value mean?
  • Define Central Limit Theorem (CLT) and it's application?
  • There are six marbles in a bag, one is white. You reach in the bag 100 times. After drawing a marble, it is placed back in the bag. What is the probability of drawing the white marble at least once?
  • Explain Euclidean distance.
  • Define variance.
  • How will you cut a circular cake into eight equal pieces?
  • What is the law of large numbers?
  • How do you weigh nine marbles three times on a balance scale to select the heaviest one?
  • You call three random friends who live in Seattle and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of lying. All three say "yes". What's the probability it's actually raining?
  • Explain a probability distribution that is not normal and how to apply that?
  • You have two dice. What is the probability of getting at least one four? Also find out the probability of getting at least one four if you have n dice.
  • Draw the curve log(x+10)

Programming


When you appear for a data science interview your interviewers are not expecting you to come up with a highly efficient code that takes the lowest resources on computer hardware and executes it quickly. However, they do expect you to be able to use R, Python, or SQL programming languages so that you can access the data sources and at least build prototypes for solutions.

You should expect a few programming/coding questions in your data science interviews. You interviewer might want you to write a short piece of code on a whiteboard to assess how comfortable you are with coding, as well as get a feel for how many lines of codes you typically write in a given week. 

Here are some programming and coding questions that companies like Amazon, Google, and Microsoft have asked in their Data Science interviews. 

  • Write a function to check whether a particular word is a palindrome or not.
  • Write a program to generate Fibonacci sequence.
  • Explain about string parsing in R language
  • Write a sorting algorithm for a numerical dataset in Python.
  • Coding test: moving average Input 10, 20, 30, 10, ... Output: 10, 15, 20, 17.5, ...
  • Write a Python code to return the count of words in a string
  • How do you find percentile? Write the code for it
  • What is the difference between - (i) Stack and Queue and (ii) Linked list and Array?

Structured Query Language (SQL)


Real-world data is stored in databases and it ‘travels’ via queries. If there's one language a Data Science professional must know, it's SQL - or “Structured Query Language”. SQL is widely used across all job roles in Data Science and is often a ‘deal-breaker’. SQL questions are placed early on in the hiring process and used for screening. Here are some SQL questions that top companies have asked in their Data Science interviews. 

  • How would you handle NULLs when querying a data set?
  • How will you explain JOIN function in SQL in the simplest possible way?
  • Select all customers who purchased at least two items on two separate days from Amazon.
  • What is the difference between DDL, DML, and DCL?96. Why is Database Normalisation Important?
  • What is the difference between clustered and non-clustered index?

Situational/Behavioural Questions


Capabilities don’t necessarily guarantee performance. It's for this reason employers ask you situational or behavioural questions in order to assess how you would perform in a given situation. In some cases, a situational or behavioural question would force you to reflect on how you behaved and performed in a past situation. A situational question can help interviewers in assessing your role in a project you might have included in your resume, can reveal whether or not you're a team player, or how you deal with pressure and failure. Situational questions are no less important than any of the technical questions, and it will always help to do some homework beforehand. Recall your experience and be prepared! 

Here are some situational/behavioural questions that large tech companies typically ask:   

  • What was the most challenging project you have worked on so far? Can you explain your learning outcomes?
  • According to your judgement, does Data Science differ from Machine Learning?
  • If you're faced with Selection Bias, how will you avoid it?
  • How would you describe Data Science to a Business Executive?

If you're looking for new Data Science role, you can find our latest opportunities here. 

This article was written by Tooba Mukhtar and Rahim Rasool for Data Science Jojo. It has been republished with permission. You can view the original article, which includes answers to the above questions here. 

Related blog & news

With over 10 years experience working solely in the Data & Analytics sector our consultants are able to offer detailed insights into the industry.

Visit our Blogs & News portal or check out the related posts below.

The Search For Toilet Paper: A Q&A With The Data Society

We recently spoke Nisha Iyer, Head of Data Science, and Nupur Neti, a Data Scientist from Data Society.  Founded in 2014, Data Society consult and offer tailored Data Science training for businesses and organisations across the US. With an adaptable back-end model, they create training programs that are not only tailored when it comes to content, but also incorporate a company’s own Data to create real-life situations to work with.  However, recently they’ve been looking into another area: toilet paper.  Following mass, ill-informed, stock-piling as countries began to go into lockdown, toilet paper became one of a number of items that were suddenly unavailable. And, with a global pandemic declared, Data Society were one of a number of Data Science organisations who were looking to help anyway they could.  “When this Pandemic hit, we began thinking how could we help?” says Iyer. “There’s a lot of ways Data Scientists could get involved with this but our first thought was about how people were freaking out about toilet paper. That was the base of how we started, as kind of a joke. But then we realised we already had an app in place that could help.” The app in question began life as a project for the World Central Kitchen (WCK), a non-profit who help support communities after natural disasters occur.  With the need to go out and get nutritionally viable supplies upon arriving at a new location, WCK teams needed to know which local grocery stores had the most stock available.  “We were working with World Central Kitchen as a side project. What we built was an app that supposed to help locate resources during disasters. So we already had the base done.” The app in question allows the user to select their location and the products they are after. It then provides information on where you can get each item, and what their nutritional values are, with the aim of improving turnaround time for volunteers.  One of the original Data Scientists, Nupur Neti, explained how they built the platform: “We used a combination of R and Python to build the back-end processing and R Shiny to build the web application. We also included Google APIs that took your location and could find the closest store to you. Then, once you have the product and the sizes, we had an internal ranking algorithm which could rank the products selected based on optimisation, originally were based on nutritional value.”  The team figured that the same technology could help in the current situation, ranking based on stock levels rather than nutritional value. With an updated app, Iyer notes “People won’t have to go miles and stand in lines where they are not socially distancing. They’ll know to visit a local grocery store that does have what they need in stock, that they’ve probably not even thought of before.” However, creating an updated version presented its own challenges. Whereas the WCK app utilised static Data, this version has to rely on real-time Data. Unfortunately this isn’t as easy to come by, as Iyer knows too well:  “When we were building this for the nutrition app we reached out to groceries stores and got some responses for static Data. Now, we know there is real-time Data on stock levels because they’re scanning products in and out. Where is that inventory though? We don’t know.” After putting an article out asking for help finding live Data, crowdsourcing app OurStreets got in touch. They, like Data Society, were looking to help people find groceries in short supply. But, with a robust front and back-end in place, the app already live, and submissions flying in across the States, they were looking for a Data Science team who could make something of their findings.  “We have the opportunity,” says Iyer “to take the conceptual ideas behind our app and work with OurStreets robust framework to create a tool that could be used nationwide.” Before visiting a store, app users select what they are looking for. This allows them to check off what the store has against their expectations, as well as uploading a picture of what is available. They can also report on whether the store is effectively practising social distancing. Neti explains, that this Data holds lots of possibilities for their Data Science team: “Once we take their Data, our system will clean any submitted text using NLP and utilise image recognition on submitted pictures using Deep Learning. This quality Data, paired with the Social Distancing information, will allow us to gain better insights into how and what people are shopping for. We’ll then be able to look at trends, see what people are shopping for and where. Ultimately, it will also allow us to make recommendations as to where people should then go if they are looking for a product.”  In addition to crowdsourced information, Data Society are still keen to get their hands on any real-time Data that supermarkets have to offer. If you know where they could get their hands on it, you can get in touch with their team.  Outside of their current projects, Iyer remains optimistic for the world when it emerges from the current situation: “Things will return to normal. As dark a time as this is, I think it’s going to exemplify why people need to use Artificial Intelligence and Data Science more. If this type of app is publicised during the Coronavirus, maybe more people will understand the power of what Data and Data Science can do and more companies that are slow adaptors will see this and see how it could be helpful to their industry.”   If you want to make the world a better place using Data, we may have a role for you, including a number of remote opportunities. Or, if you’re looking to expand and build out your team with the best minds in Data, get in touch with one of expert consultants who will be able to advise on the best remote and long-term processes. 

Why Businesses Need To Put Fraud Prevention Front And Centre

If Fraudsters are anything, they are opportunists. Once the first new stories about COVID-19 started running, it wasn’t long until they were joined by tales of fraudsters selling face masks and hand sanitiser, asking panicked customers to transfer money and then disappearing without a trace.  And it’s not the first time we’ve seen this. Fraudsters are notoriously wise to periods of heightened sensitivity and uncertainty, often preying on the vulnerable. The 2008 financial crisis saw an increase in email-based phishing scams and a decade’s worth of technological advancements means that Fraud remains a many-headed beast.  Add into the mix a change in working styles and environments, and many businesses are more exposed to potential security breaches than they have been in years. Now, more than ever, companies need to make sure their Data is well protected and secure. THE FIRST LINE OF DEFENCE If you’re part of, or leading, a Fraud Prevention team, there are a number of ways you can support your business and keep on top of the situation. Here are just a few: Increase and update your investigation capacity. This team are the front line of your business’ Fraud defence team, interacting with customers daily and spotting new scams. During an uncertain period, retention and team stability is key. These are the people that understand the day-to-day Fraud challenges you face and will be essential in fighting any future challenges.  Sharing Fraud Prevention knowledge is key. Throughout this crisis, trends will be evolving quickly and working collaboratively across teams, and even other businesses, is the best way to combat this. We consistently hear from Fraud Managers that the key to beating Fraud is to share information and knowledge. Despite this, there is always a hesitation amongst companies to admit that they have been a victim to an attack. Perhaps now is the time to change this. Invest in Machine Learning and real time updates for your Fraud defences. Fraud technology has moved on from script writing in SQL and rule changes. Businesses need a real time reactive response and now is an important time to be embracing new technologies. There are a number AI-driven off the shelf packages available or, for a more bespoke solution, a Fraud Data Scientist can create something internally. Educate your team. It may seem simple, but the Fraud team can play a crucial role in minimising any potential risk from human-error. Educating employees on the risks they may face when working remotely, or what scams they need to look out for, is one of the most effective ways of fighting Fraud.  PREPARING YOUR BUSINESS Success in the fight against Fraud isn’t purely down to the group of individuals that make up the Fraud team. As a business, now is the time to be making decisions that can help you stay ahead of the Fraudsters. Here are some considerations: Consider investing in tech as an your immediate response. Not just to bolster your Fraud defences (although there are plenty of vendors offering AI-based solutions), but also technology for your employees to keep work as normal as possible such a sharing platforms, DevOps technology and video calling networks. One of the best ways to block some of the vulnerability loopholes fraudsters are trying to exploit is to keep working habits as close to normal as possible as you move to a remote solution. Be transparent with your customers. Consumers are being incredibly savvy and noting how businesses respond to the pandemic in a way that could have a big impact when normality returns. But they’re also being more empathetic and are willing to understand difficulties. For example, shopping delivery service Ocado were open and transparent when their system could not initially deal with demand. Having communicated the difficulties, worked through their issues and gone the extra mile to let customers know how they can be supported in this time, the received minimal backlash. There is an understanding that we’re all in this together. Finally, if you have the budget, continue to staff up - particularly in competitive fields such as Data Science. A lot of top Data professionals are currently at home and much more accessible than they have been in a long time. With a number of ways to remotely interview and onboard both permanent and contract staff, if you are able to get begin conversations with them now, you’ll have an edge in what will be a very competitive market come later in the year.  If you’re looking to take your next step in the world of Fraud, we may have a role for you, including a number of remote opportunities.  Or, if you’re looking to expand and build out your Fraud team, get in touch with one of expert consultants who will be able to advise on the best remote and long-term processes. 

RELATED Jobs

Salary

US$180000 - US$230000 per annum

Location

San Diego, California

Description

A lead role within a global leader in the medical inspection and robotics space.

Salary

£550 - £650 per day

Location

London

Description

As a Deep Learning Engineer, you will be working with 3D image recognition tools and object tracking.

Salary

US$160000 - US$190000 per annum + additional benefits

Location

San Diego, California

Description

Join a growing team of over 20 computer vision and machine learning engineers!

recently viewed jobs