Data Science Interview Questions: What The Experts Say

Guest Blog our consultant managing the role
Author: Guest Blog
Posting date: 8/22/2019 9:13 AM
Our friends at Data Science Dojo have compiled a list of 101 actual Data Science interview questions that have been asked between 2016-2019 at some of the largest recruiters in the Data Science industry – Amazon, Microsoft, Facebook, Google, Netflix, Expedia, etc. 

Data Science is an interdisciplinary field and sits at the intersection of computer science, statistics/mathematics, and domain knowledge. To be able to perform well, one needs to have a good foundation in not one but multiple fields, and it reflects in the interview. They've divided the questions into six categories: 

  • Machine Learning
  • Data Analysis
  • Statistics, Probability, and Mathematics
  • Programming
  • SQL
  • Experiential/Behavioural Questions

Once you've gone through all the questions, you should have a good understanding of how well you're prepared for your next Data Science interview.

Machine Learning

As one will expect, Data Science interviews focus heavily on questions that help the company test your concepts, applications, and experience on machine learning. Each question included in this category has been recently asked in one or more actual Data Science interviews at companies such as Amazon, Google, Microsoft, etc. These questions will give you a good sense of what sub-topics appear more often than others. You should also pay close attention to the way these questions are phrased in an interview. 

  • Explain Logistic Regression and its assumptions.
  • Explain Linear Regression and its assumptions.
  • How do you split your data between training and validation?
  • Describe Binary Classification.
  • Explain the working of decision trees.
  • What are different metrics to classify a dataset?
  • What's the role of a cost function?
  • What's the difference between convex and non-convex cost function?
  • Why is it important to know bias-variance trade off while modeling?
  • Why is regularisation used in machine learning models? What are the differences between L1 and L2 regularisation?
  • What's the problem of exploding gradients in machine learning?
  • Is it necessary to use activation functions in neural networks?
  • In what aspects is a box plot different from a histogram?
  • What is cross validation? Why is it used?
  • Can you explain the concept of false positive and false negative?
  • Explain how SVM works.
  • While working at Facebook, you're asked to implement some new features. What type of experiment would you run to implement these features?
  • What techniques can be used to evaluate a Machine Learning model?
  • Why is overfitting a problem in machine learning models? What steps can you take to avoid it?
  • Describe a way to detect anomalies in a given dataset.
  • What are the Naive Bayes fundamentals?
  • What is AUC - ROC Curve?
  • What is K-means?
  • How does the Gradient Boosting algorithm work?
  • Explain advantages and drawbacks of Support Vector Machines (SVM).
  • What is the difference between bagging and boosting?
  • Before building any model, why do we need the feature selection/engineering step?
  • How to deal with unbalanced binary classification?
  • What is the ROC curve and the meaning of sensitivity, specificity, confusion matrix?
  • Why is dimensionality reduction important?
  • What are hyperparameters, how to tune them, how to test and know if they worked for the particular problem?
  • How will you decide whether a customer will buy a product today or not given the income of the customer, location where the customer lives, profession, and gender? Define a machine learning algorithm for this.
  • How will you inspect missing data and when are they important for your analysis?
  • How will you design the heatmap for Uber drivers to provide recommendation on where to wait for passengers? How would you approach this?
  • What are time series forecasting techniques?
  • How does a logistic regression model know what the coefficients are?
  • Explain Principle Component Analysis (PCA) and it's assumptions.
  • Formulate Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) techniques.
  • What are neural networks used for?40. Why is gradient checking important?
  • Is random weight assignment better than assigning same weights to the units in the hidden layer?
  • How to find the F1 score after a model is trained?
  • How many topic modeling techniques do you know of? Explain them briefly.
  • How does a neural network with one layer and one input and output compare to a logistic regression?
  • Why Rectified Linear Unit/ReLU is a good activation function?
  • When using the Gaussian mixture model, how do you know it's applicable?
  • If a Product Manager says that they want to double the number of ads in Facebook's Newsfeed, how would you figure out if this is a good idea or not?What do you know about LSTM?
  • Explain the difference between generative and discriminative algorithms.
  • Can you explain what MapReduce is and how it works?
  • If the model isn't perfect, how would you like to select the threshold so that the model outputs 1 or 0 for label?
  • Are boosting algorithms better than decision trees? If yes, why?
  • What do you think are the important factors in the algorithm Uber uses to assign rides to drivers?
  • How does speech synthesis works?

Data Analysis

Machine Learning concepts are not the only area in which you'll be tested in the interview. Data pre-processing and data exploration are other areas where you can always expect a few questions. We're grouping all such questions under this category. Data Analysis is the process of evaluating data using analytical and statistical tools to discover useful insights. Once again, all these questions have been recently asked in one or more actual Data Science interviews at the companies listed above.  

  • What are the core steps of the data analysis process?
  • How do you detect if a new observation is an outlier?
  • Facebook wants to analyse why the "likes per user and minutes spent on a platform are increasing, but total number of users are decreasing". How can they do that?
  • If you have a chance to add something to Facebook then how would you measure its success?
  • If you are working at Facebook and you want to detect bogus/fake accounts. How will you go about that?
  • What are anomaly detection methods?
  • How do you solve for multicollinearity?
  • How to optimise marketing spend between various marketing channels?
  • What metrics would you use to track whether Uber's strategy of using paid advertising to acquire customers works?
  • What are the core steps for data preprocessing before applying machine learning algorithms?
  • How do you inspect missing data?
  • How does caching work and how do you use it in Data Science?

Statistics, Probability and Mathematics

As we've already mentioned, Data Science builds its foundation on statistics and probability concepts. Having a strong foundation in statistics and probability concepts is a requirement for Data Science, and these topics are always brought up in data science interviews. Here is a list of statistics and probability questions that have been asked in actual Data Science interviews.

  • How would you select a representative sample of search queries from 5 million queries?
  • Discuss how to randomly select a sample from a product user population.
  • What is the importance of Markov Chains in Data Science?
  • How do you prove that males are on average taller than females by knowing just gender or height.
  • What is the difference between Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP)?
  • What does P-Value mean?
  • Define Central Limit Theorem (CLT) and it's application?
  • There are six marbles in a bag, one is white. You reach in the bag 100 times. After drawing a marble, it is placed back in the bag. What is the probability of drawing the white marble at least once?
  • Explain Euclidean distance.
  • Define variance.
  • How will you cut a circular cake into eight equal pieces?
  • What is the law of large numbers?
  • How do you weigh nine marbles three times on a balance scale to select the heaviest one?
  • You call three random friends who live in Seattle and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of lying. All three say "yes". What's the probability it's actually raining?
  • Explain a probability distribution that is not normal and how to apply that?
  • You have two dice. What is the probability of getting at least one four? Also find out the probability of getting at least one four if you have n dice.
  • Draw the curve log(x+10)


When you appear for a data science interview your interviewers are not expecting you to come up with a highly efficient code that takes the lowest resources on computer hardware and executes it quickly. However, they do expect you to be able to use R, Python, or SQL programming languages so that you can access the data sources and at least build prototypes for solutions.

You should expect a few programming/coding questions in your data science interviews. You interviewer might want you to write a short piece of code on a whiteboard to assess how comfortable you are with coding, as well as get a feel for how many lines of codes you typically write in a given week. 

Here are some programming and coding questions that companies like Amazon, Google, and Microsoft have asked in their Data Science interviews. 

  • Write a function to check whether a particular word is a palindrome or not.
  • Write a program to generate Fibonacci sequence.
  • Explain about string parsing in R language
  • Write a sorting algorithm for a numerical dataset in Python.
  • Coding test: moving average Input 10, 20, 30, 10, ... Output: 10, 15, 20, 17.5, ...
  • Write a Python code to return the count of words in a string
  • How do you find percentile? Write the code for it
  • What is the difference between - (i) Stack and Queue and (ii) Linked list and Array?

Structured Query Language (SQL)

Real-world data is stored in databases and it ‘travels’ via queries. If there's one language a Data Science professional must know, it's SQL - or “Structured Query Language”. SQL is widely used across all job roles in Data Science and is often a ‘deal-breaker’. SQL questions are placed early on in the hiring process and used for screening. Here are some SQL questions that top companies have asked in their Data Science interviews. 

  • How would you handle NULLs when querying a data set?
  • How will you explain JOIN function in SQL in the simplest possible way?
  • Select all customers who purchased at least two items on two separate days from Amazon.
  • What is the difference between DDL, DML, and DCL?96. Why is Database Normalisation Important?
  • What is the difference between clustered and non-clustered index?

Situational/Behavioural Questions

Capabilities don’t necessarily guarantee performance. It's for this reason employers ask you situational or behavioural questions in order to assess how you would perform in a given situation. In some cases, a situational or behavioural question would force you to reflect on how you behaved and performed in a past situation. A situational question can help interviewers in assessing your role in a project you might have included in your resume, can reveal whether or not you're a team player, or how you deal with pressure and failure. Situational questions are no less important than any of the technical questions, and it will always help to do some homework beforehand. Recall your experience and be prepared! 

Here are some situational/behavioural questions that large tech companies typically ask:   

  • What was the most challenging project you have worked on so far? Can you explain your learning outcomes?
  • According to your judgement, does Data Science differ from Machine Learning?
  • If you're faced with Selection Bias, how will you avoid it?
  • How would you describe Data Science to a Business Executive?

If you're looking for new Data Science role, you can find our latest opportunities here. 

This article was written by Tooba Mukhtar and Rahim Rasool for Data Science Jojo. It has been republished with permission. You can view the original article, which includes answers to the above questions here. 

Related blog & news

With over 10 years experience working solely in the Data & Analytics sector our consultants are able to offer detailed insights into the industry.

Visit our Blogs & News portal or check out the related posts below.

How Big Data Is Impacting Logistics

How Big Data is Impacting Logistics

As Big Data can reveal patterns, trends and associations relating to human behaviour and interactions, it’s no surprise that Data & Analytics are changing the way that the supply chain sector operates today.  From informing and predicting buying trends to streamlining order processing and logistics, technological innovations are impacting the industry, boosting efficiency and improving supply chain management.  Analysing behavioural patterns Using pattern recognition systems, Artificial Intelligence is able to analyse Big Data. During this process, Artificial Intelligence defines and identifies external influences which may affect the process of operations (such as customer purchasing choices) using Machine Learning algorithms. From the Data collected, Artificial Intelligence is able to determine information or characteristics which can inform us of repetitive behaviour or predict statistically probable actions.  Consequently, organisation and planning can be undertaken with ease to improve the efficiency of the supply chain. For example, ordering a calculated amount of stock in preparation for a busy season can be made using much more accurate predictions - contributing to less over-stocking and potentially more profit. As a result, analysing behavioural patterns facilitates better management and administration, with a knock-on effect for improving processes.  Streamlining operations  Using image recognition technology, Artificial Intelligence enables quicker processes that are ideally suited for warehouses and stock control applications. Additionally, transcribing voice to text applications mean stock can be identified and processed quickly to reach its destination, reducing the human resource time required and minimising human error.  Artificial intelligence has also changed the way we use our inventory systems. Using natural language interaction, enterprises have the capability to generate reports on sales, meaning businesses can quickly identify stock concerns and replenish accordingly. Intelligence can even communicate these reports, so Data reliably reaches the next person in the supply chain, expanding capabilities for efficient operations to a level that humans physically cannot attain. It’s no surprise that when it comes to warehousing and packaging operations Artificial Intelligence can revolutionise the efficiency of current systems. With image recognition now capable of detecting which brands and logos are visible on cardboard boxes of all sizes, monitoring shelf space is now possible on a real-time basis. In turn, Artificial Intelligence is able to offer short term insights that would have previously been restricted to broad annual time frames for consumers and management alike.  Forecasting  Many companies manually undertake forecasting predictions using excel spreadsheets that are then subject to communication and data from other departments. Using this method, there’s ample room for human error as forecasting cannot be uniform across all regions in national or global companies. This can create impactful mistakes which have the potential to make predictions increasingly inaccurate.  Using intelligent stock management systems, Machine Learning algorithms can predict when stock replenishment will be required in warehouse environments. When combined with trend prediction technology, warehouses will effectively be capable enough to almost run themselves  negating the risk of human error and wasted time. Automating the forecasting process decreases cycle time, while providing early warning signals for unexpected issues, leaving businesses better prepared for most eventualities that may not have been spotted by the human eye.  Big Data is continuing to transform the world of logistics, and utilising it in the best way possible is essential to meeting customer demands and exercising agile supply chain management.  If you’re interested in utilising Artificial Intelligence and Machine Learning to help improve processes, Harnham may be able to help. Take a look at our latest opportunities or get in touch with one of our expert consultants to find out more.  Author Bio: Alex Jones is a content creator for Kendon Packaging. Now one of Britain's leading packaging companies, Kendon Packaging has been supporting businesses nationwide since the 1930s.

How Data Is Shifting Defence

How Data Is Shifting Defence

When looking at the cyber security measures in 2019 the outcome is uncertain. Threats come in the form of pariah states, extremely skilled individuals, and illiberal actors. However, what is certain is the leaps and bounds made in technology.  Before computers, defence documents were in government offices. By the Second World War this would progress onto secure sites, take Bletchley Park for example.   The real watershed would come years later in the Cold War. While there was no direct military action (aside from the proxy Korean and Vietnam War), this tension was illustrated elsewhere, with the space race and nuclear armaments to name but a few. Both sides went to extraordinary lengths to guard and seize intelligence through covert ops. As this classified information made its way onto computers and in turn brought about new risks. This theme continues to the present day; as technology improves, so do offensive and defensive capabilities.  Hard Power With the advancement in technology this has been used by militaries to take and saves lives. Only a matter of years ago aerial bombardment would have to involve putting pilots at risk, flying deep behind enemy lines. These days, a bombing run could be carried out anywhere in the globe with the ‘pilot’ not having to leave their chair. How? Through Unmanned Aerial Vehicles (UAVs). This removes any casualties to their pilots, using advanced systems in Computer Vision to operate across the globe.  The ethics of this remain debated and there are many who express doubts at the use of AI, fearing their destructive potential. Others, however, see this as necessary advancement.  Indeed, in asymmetric warfare, established states’ advanced technology is near enough untouchable. Take an example from the US Marines. Still in testing, an advanced platform can allow troops on the ground to see if a room has been cleared, saving friendly lives. This is way above the capabilities of rogue terrorist forces, and looks set to play a crucial role in saving lives. It would seem highly unlikely that the Taliban, for example, could use sophisticated weaponry to bring down a jet.  However, the danger in 2019 now lies with the established illiberal states who still pose a serious threat. It is paramount that nations continue to advance, to both deter and, if needed, counter a hostile force. Soft Power While NATO states have shown dominance in physical terms over past foes, 2019 brings uncertainty when it comes to soft power, most notably cyber-security. The threats to this are very real, and are a put civilians at risk - take the Sony and NHS hackings as an example.  Moreover, the notion of alleged election meddling continues to plague politics, notably the US 2016 Election and the Brexit referendum. There have been several accusations of state-sponsored foul play incorporating the use of bots to influence people’s decision making, mostly through continual pressure on either fake news or mass-support of certain decisions. They impact society directly into our homes, considering the popularity of social media platforms like Twitter and Facebook. Alongside many other nations, the UK is taking action to counter this type of threat. Only recently a specialist cyber-security division in the army has been established, quite literally to both counter, and if needed, launch cyber-attacks.   Ultimately, society has come a long way, physically and online when it comes to defence. Sophisticated weaponry continues to develop but is raising new ethical questions, particularly in regards to the use of AI and Computer Vision. Civilian institutions remain at risk, with many having been targeted in hacks or through intervention on social media. Threats may continue to evolve, but so will defence strategies, with the two competing to stay one step ahead of the other.   If you’re interested in applying Data & Analytics to national security, we may have a role for you. Take a look at our latest opportunities, or get in touch with one of our expert consultants to find out more. 



£70000 - £75000 per annum




Brand new role with a leading online travel company - using advanced machine learning and computer vision to build and optimise recommender systems!


600000kr - 800000kr per annum




One of Scandinavia's leading company within it's sector.


US$160000 - US$180000 per year + Bonus + Benefits


Boston, Massachusetts


Do you enjoy building machine and deep learning models to eliminate potential for fraud? You'll have the opportunity to be the technical lead for a new team.


60000kr - 800000kr per annum




Harnham is working with an e-commerce marketplace that is revolutionizing how high-end goods are delivered to consumers.

recently viewed jobs