From Broken Data Pipelines to Broken Data Headlines

Guest Blog our consultant managing the role
Author: Guest Blog
Posting date: 10/15/2020 9:24 AM
This week's guest post is written by Moray Barclay.

Two things have caused the UK’s Test & Trace application to lose 16,000 Covid-19 test results, both of which are close to my heart. The first is the application’s data pipeline, which is broken. The second is a lack of curiosity. The former does not necessarily mean that a data application will fail. But when compounded by the latter it is certain.

Data Pipelines


All data applications have several parts, including an interesting part (algorithms, recently in the news), a boring part (data wrangling, never in the news), a creative part (visualisation, often a backdrop to the news), and an enabling part (engineering, usually misunderstood by the news). 

Data engineering, in addition to the design and implementation of the IT infrastructure common to all software applications, includes the design and implementation of the data pipeline. As its name suggests, a data pipeline is the mechanism by which data is entered at one end of a data application and flows through the application via various algorithms to emerge in a very different form at the other end.

A well architected data application has a single pipeline from start to finish. This does not mean that there should be no human interaction with the data as it travels down the pipeline but it should be limited to actions which can do no harm. Human actions which do no harm include: pressing buttons to start running algorithms or other blocks of code, reading and querying data, and exporting data to do manual exploratory or forensic analysis within a data governance framework.

The data pipeline for Test & Trace will look something like this:   

  • a patient manually fills out a web-form, which automatically updates a patient list
  • for each test, the laboratory adds the test result for that patient
  • the lab sends an Excel file to Public Health England with the ID’s of positive patients
  • PHE manually transpose the data in the Excel file to the NHS Test & Trace system
  • the NHS T&T system pushes each positive patient contact details to NHS T&T agents
  • for each positive patient, an NHS T&T contact centre agent phones them.

This is a not a single pipeline because in the middle a human being needs to open up an editable file and transpose it into another file. The pipeline is therefore broken, splitting at the point at which the second Excel file is manually created. If you put yourself in the shoes of the person receiving one of these Excel files, you can probably identify several ways in which this manual manipulation of data could lead to harm.

And it is not just the data which needs to be moved manually from one side of the broken pipeline to the other side, it is the associated data types, and CSV files can easily lose data type information. This matters. You may have experienced importing or exporting data with an application which changes 06/10/20 to 10/06/20. Patient identifiers should be of data type text, even if they consist only of numbers, for future-proofing. Real numbers represented in exponential format should, obviously, be of a numeric data type. And so on.

One final point: the different versions of Excel (between the Pillar 2 laboratories and PHE) are a side-show, because otherwise this implies that had the versions been the same, then everything would be fine. This is wrong. The BBC have today reported that “To handle the problem, PHE is now breaking down the test result data into smaller batches to create a larger number of Excel templates. That should ensure none hit their cap.” This solves the specific Excel incompatibility problem (assuming the process of creating small batches is error-free) but has no bearing on the more fundamental problem of the broken data pipeline, which will stay until the manual Excel manipulation is replaced by a normal and not particularly complex automated process.

Curiosity


So where does curiosity fit in?

The first thing that any Data Analyst does when they receive data is to look at it. This is partly a technical activity, but it is also a question of judgement and it requires an element of curiosity. Does this data look right? What is the range between the earliest and the latest dates? If I graph one measurement over time (in this case positive tests over time), does the line look right? If I graph two variables (such as Day Of Week versus positive tests) what does the scatter chart look like? Better still, if I apply regression analysis to the scatter chart what is the relationship between the two variables and within what bounds of confidence? How does that relate to the forecast? Why?

This is not about skills. If I receive raw data in csv format I would open it in a python environment or an SQL database. But anyone given the freedom to use their curiosity can open a csv file in Notepad and see there are actually one million rows of data and not 65,000. Anyone given the freedom to use their curiosity can graph data in Excel to see whether it has strange blips. Anyone given the freedom to use their curiosity can drill down into anomalies.

Had those receiving the data from the Pillar 2 laboratories been allowed to focus some of their curiosity at what they were receiving they would have spotted pretty quickly that the 16,000 patient results were missing.

As it was, I suspect they were not given that freedom: I suspect they were told to transpose as much data as they could as quickly as possible, for what could possibly go wrong?

Single Data Pipeline, Singular Curiosity: Pick At Least One


To reiterate, the current problems with T&T would never have arisen with a single data pipeline which excluded any manual manipulation in Excel. But knowing that the data pipeline was broken and manual manipulation was by design part of the solution, the only way to minimise the risk was to encourage people engaged in that manual process to engage their curiosity about the efficacy of the data they were manipulating.

In their prototype phases – for that is the status of the T&T application - data projects will sometimes go wrong. But they are much more likely to go wrong if the people involved, at all levels, do not have enough time or freedom to think, to engage their curiosity, and to ask themselves “is this definitely right?”

You can view Moray's original article here


Moray Barclay is an Experienced Data Analyst working in hands-on coding, Big Data analytics, cloud computing and consulting.

Related blog & news

With over 10 years experience working solely in the Data & Analytics sector our consultants are able to offer detailed insights into the industry.

Visit our Blogs & News portal or check out the related posts below.

Getting Ahead As A Specialist In Data Science

For professionals working in Data Science, the discipline is all about discovery, insights and innovation. Rapid advancements in the adoption of data and technologies, coupled with organisations feeling the strain of the mass of data they have, means that Data Scientists are in high demand. To stay ahead of the competition, companies must continuously look for unique ways to extract insights from the large volumes of data they acquire. This is where professionals from Data Science come in. Their skills lie in correlating data points, mapping out trends and identifying insights that support organisations to action change and/or enhance their growth. Now more than ever, due to the global pandemic, opportunities to move into a career in this space are vast. From Data Scientists, to Data Engineers and Heads of Analytics and Machine Learning, the possibilities for professionals in this discipline are limitless. Here are just a handful of the things you should know for a career in Data Science. Data Science is supporting the future According to the 2020 Emerging Jobs Report published by LinkedIn, the role of the Data Scientist continues to be an incredibly important one within data, analytics and technology. It also shows that, in our core markets, this position continues to be one of the top emerging roles: USA, it is thirdUK, it is seventhGermany, it is eighthFrance, it is tenth Data Science is a discipline that is growing. In the past year, we have seen these professionals demonstrate their ability to adapt into industries such as retail, banking and medicine, where we have seen such sharp change in consumer habits and a step up in global demand. They are all poised and ready to make use of Data Science functions and analytics. Take healthcare and medicine, for example. New collaborations, funding routes and systems for sharing data will shape research from now on. Data supports the way in which we interpret information and provides a means for us to make predictions, spot new trends and developments as well as better managing supply chains and organisational planning. Taking on a role as a Data Scientist or engineer will ultimately be a purpose-driven career, driving future innovations and making a range of business processes simpler and more effective. Professionals require an array of skills The way in which we have become connected today is ubiquitous. It is this level of connectivity that is having a direct impact on the way in which organisations operate (and how their consumers make use of services), due to the growing levels of data that are being collected and are then required to be interpreted and managed. Professionals working within the realms of Data Science need to keep this link in mind as their career and new project opportunities arise. Whilst it is crucial for these specialists to have unique skills such as Apache Spark, Data Science, Machine Learning and Python, they also need to have a clear understanding of statistics, hold business development skills and be clear communicators. Particularly of Data Scientists and engineers, it is imperative to really understand what the customer is looking for from the software, be able to handle multiple projects side-by-side and have excellent end-to-end experience across relevant frameworks. Professionals should take the time to bolster these skills, particularly for technical needs, by completing ongoing online courses, speaking to industry experts and staying updated with the latest iterations of programming and data languages. There are a range of career opportunities for skilled, savvy Data Science professionals with an interest in data and analytics on the table. The discipline is determined by technology and trends, making for a dynamic, rapidly developing industry that is growing at an unprecedented rate. Critically, as the industry continues to advance and demand for skilled professionals grows, there will be plenty of opportunity for you to make your mark. If you’re a Data Scientist looking to take a step up or are looking for the next member of your team, we can help. Take a look at our latest opportunities or get in touch with one of our expert consultants to find out more.

The Dialogue: Keeping Data Secure While Working Flexibly

Last week, Peter Schroeter and Ryan Collins, head of DevSecOps at Upvest, and co-founder of RapidGroup, discussed how to keep Data secure while working flexibly.  Ryan brings to the table 12 years of experience from Server Administrator to CTO roles as well as experience as a Contractor. And as a business owner himself, he sees where the shortage falls and perhaps a way to fill the gap. Security is in the Spotlight Security is a priority for many businesses today. Avoiding negative PR has caused an internal shift in which companies take more care with their Data. There is also a Catch-22. In order to provide higher levels of security, businesses are slowing down their developers and software engineers, and taking more engineering time which in turn costs more money. Security before cost is becoming the new reality. The contradiction of deployment, project run time, and budgets are only part of the bigger picture. Compromise is key and follow these three tenets. Don’t leave anything open to vulnerability.Focus on auditability.Offer more training for Software Engineers. The Security-Focused Skills Gap How can you push security forward in a meaningful way when you can’t find the people you need with relevant experience? There’s a big gap in the market right now for security-oriented DevOps engineers. The skillsets many businesses are looking for include: Google Platform AWS and possibly Azure with a modern suite of tools with Teraform.Ability to float to the development side and work in SRE to ensure things are stable and scalable.DevOps Engineers with experience in the GO stack are especially hard to find. Companies want to go with what they know. However, there is a shift toward a more remote-friendly and contractor culture. When you can’t find the talent you need, sometimes it’s best to bite the bullet, and consider a Contractor. The Case to Hire a Contractor If this year has taught us anything, it’s that we don’t have to be confined to an office or to even one location. And yes, while there is an element of risk to being a Contractor, there are benefits to both sides. Contractors are compensated higher because businesses have lower HR costs, less tax regulations, payroll, and reporting to do.  Though there is some risk, there is always work for because so many companies want to secure their data. There’s always something to be optimised, always something which needs attention. You won’t be without work for long, if you have the skills. Disconnect Between Capability and Desire in the Market Some candidates have mentioned they have 60% of the skills required, but not enough project-based experience. How do you reconcile the disconnect?  It’s hard to specialise in DevOps, DevSecOps, and similar roles because it’s about automation, you have to be that "Swiss Army Knife", you have to live in the middle. You have to know how to get into the code, you have to know how to get in and do the CI workflows, etc. It’s almost Developer plus value-added skills or security engineer plus value-added skills. Remote Working Habits You Need Now Have a separated space set up for work.If you have a mix of people in office and those working remote, you have to make sure they’re communicating with each other.Rolling coffee breaks within Google Meet or something similar.Have a task management and time tracking system used by employees both on and offline.Build culture by picking at random two people and putting them together for a half hour to have a conversation that isn’t about work.  Startup vs Legacy Hiring Don’t box yourself in, but understand startups have a unique set of skills they need and most often without the budget to train someone. Whereas legacy businesses more often than not have the budget to train someone in the skillset they need. However, it’s important to note, the tech stack itself doesn’t really change. Best Practices for Async-Comm Teams Remember not everyone works in the same time zone. Don’t expect and immediate answer, and if you need an immediate answer, pick up the phone. Make sure your team has remote tools such as Slack or Google Meet and is doing most of their communication this way, even if they are in an office. If you don’t, your remote workers could be missing key information. The Future of DevOps and Data Security Many businesses may see a shift toward a more open working environment which is a good balance for what works best for talent and is good for productivity as well. Ultimately, we’re all solving the same challenges What is most important when it comes to keeping Data secure? Most important is getting people, systems, and education in place to do something in the first place. In other words, build from concept rather than moving too fast and breaking too many things. How Can Prospective Candidates Prepare? Focus on continuous learning and getting your skillsets like automation tools. If you really want to get involved with security side and SRE, you really have to get involved with the development side, too. Using the modern tech stacks like the GOLANG, the RUST, the SWIFT on the mobile side, and there’s always new pieces to the puzzle.There’s room for all types of relationships. Whether you’re looking for a long-term role, a short-term role, or something in between, DevOps is a never-ending project, so continuous learning is key for both candidate and company. You can watch the full conversation below:

RELATED Jobs

Salary

US$170000 - US$190000 per annum

Location

District of Columbia

Description

Harnham are partnered with one of the fastest growing fintech businesses here in the US, who are looking for a Director of Data Science to lead a thriving team

Salary

£60000 - £75000 per annum

Location

London

Description

This is an exciting new opportunity for a Senior Data Scientist to join a really successful Fintech company in the automotive space!

Salary

Benefits

Location

London

Description

Principal Data Scientist - £135k - London

recently viewed jobs