From Broken Data Pipelines to Broken Data Headlines

Guest Blog our consultant managing the role
Author: Guest Blog
Posting date: 10/15/2020 9:24 AM
This week's guest post is written by Moray Barclay.

Two things have caused the UK’s Test & Trace application to lose 16,000 Covid-19 test results, both of which are close to my heart. The first is the application’s data pipeline, which is broken. The second is a lack of curiosity. The former does not necessarily mean that a data application will fail. But when compounded by the latter it is certain.

Data Pipelines


All data applications have several parts, including an interesting part (algorithms, recently in the news), a boring part (data wrangling, never in the news), a creative part (visualisation, often a backdrop to the news), and an enabling part (engineering, usually misunderstood by the news). 

Data engineering, in addition to the design and implementation of the IT infrastructure common to all software applications, includes the design and implementation of the data pipeline. As its name suggests, a data pipeline is the mechanism by which data is entered at one end of a data application and flows through the application via various algorithms to emerge in a very different form at the other end.

A well architected data application has a single pipeline from start to finish. This does not mean that there should be no human interaction with the data as it travels down the pipeline but it should be limited to actions which can do no harm. Human actions which do no harm include: pressing buttons to start running algorithms or other blocks of code, reading and querying data, and exporting data to do manual exploratory or forensic analysis within a data governance framework.

The data pipeline for Test & Trace will look something like this:   

  • a patient manually fills out a web-form, which automatically updates a patient list
  • for each test, the laboratory adds the test result for that patient
  • the lab sends an Excel file to Public Health England with the ID’s of positive patients
  • PHE manually transpose the data in the Excel file to the NHS Test & Trace system
  • the NHS T&T system pushes each positive patient contact details to NHS T&T agents
  • for each positive patient, an NHS T&T contact centre agent phones them.

This is a not a single pipeline because in the middle a human being needs to open up an editable file and transpose it into another file. The pipeline is therefore broken, splitting at the point at which the second Excel file is manually created. If you put yourself in the shoes of the person receiving one of these Excel files, you can probably identify several ways in which this manual manipulation of data could lead to harm.

And it is not just the data which needs to be moved manually from one side of the broken pipeline to the other side, it is the associated data types, and CSV files can easily lose data type information. This matters. You may have experienced importing or exporting data with an application which changes 06/10/20 to 10/06/20. Patient identifiers should be of data type text, even if they consist only of numbers, for future-proofing. Real numbers represented in exponential format should, obviously, be of a numeric data type. And so on.

One final point: the different versions of Excel (between the Pillar 2 laboratories and PHE) are a side-show, because otherwise this implies that had the versions been the same, then everything would be fine. This is wrong. The BBC have today reported that “To handle the problem, PHE is now breaking down the test result data into smaller batches to create a larger number of Excel templates. That should ensure none hit their cap.” This solves the specific Excel incompatibility problem (assuming the process of creating small batches is error-free) but has no bearing on the more fundamental problem of the broken data pipeline, which will stay until the manual Excel manipulation is replaced by a normal and not particularly complex automated process.

Curiosity


So where does curiosity fit in?

The first thing that any Data Analyst does when they receive data is to look at it. This is partly a technical activity, but it is also a question of judgement and it requires an element of curiosity. Does this data look right? What is the range between the earliest and the latest dates? If I graph one measurement over time (in this case positive tests over time), does the line look right? If I graph two variables (such as Day Of Week versus positive tests) what does the scatter chart look like? Better still, if I apply regression analysis to the scatter chart what is the relationship between the two variables and within what bounds of confidence? How does that relate to the forecast? Why?

This is not about skills. If I receive raw data in csv format I would open it in a python environment or an SQL database. But anyone given the freedom to use their curiosity can open a csv file in Notepad and see there are actually one million rows of data and not 65,000. Anyone given the freedom to use their curiosity can graph data in Excel to see whether it has strange blips. Anyone given the freedom to use their curiosity can drill down into anomalies.

Had those receiving the data from the Pillar 2 laboratories been allowed to focus some of their curiosity at what they were receiving they would have spotted pretty quickly that the 16,000 patient results were missing.

As it was, I suspect they were not given that freedom: I suspect they were told to transpose as much data as they could as quickly as possible, for what could possibly go wrong?

Single Data Pipeline, Singular Curiosity: Pick At Least One


To reiterate, the current problems with T&T would never have arisen with a single data pipeline which excluded any manual manipulation in Excel. But knowing that the data pipeline was broken and manual manipulation was by design part of the solution, the only way to minimise the risk was to encourage people engaged in that manual process to engage their curiosity about the efficacy of the data they were manipulating.

In their prototype phases – for that is the status of the T&T application - data projects will sometimes go wrong. But they are much more likely to go wrong if the people involved, at all levels, do not have enough time or freedom to think, to engage their curiosity, and to ask themselves “is this definitely right?”

You can view Moray's original article here


Moray Barclay is an Experienced Data Analyst working in hands-on coding, Big Data analytics, cloud computing and consulting.

Related blog & news

With over 10 years experience working solely in the Data & Analytics sector our consultants are able to offer detailed insights into the industry.

Visit our Blogs & News portal or check out the related posts below.

How Can Your Career In Big Data Help You To Accelerate Change?

Data & Analytics is fast becoming a core business function across a range of different industries. 2.5 quintillion bytes of data are produced by humans every day, and it has been predicted that 463 exabytes of data will be generated each day by humans as of 2025. That’s quite a lot of data for organisations to break down. Within Gartner’s top 10 Data & Analytics trends for 2021, there is a specific focus on using data to drive change. In fact, business leaders are beginning to understand the importance of using data and analytics to accelerate digital business initiatives. Instead of being a secondary focus — completed by a separate team — Data & Analytics is shifting to a core function. Yet, due to the complexities of data sets, business leaders could end up missing opportunities to benefit from the wealth of information they have at their fingertips. The opportunity to make such an impact across the discipline is increasingly appealing for Data Engineers and Architects. Here are a just a selection of the benefits that your role in accelerating organisational change could create. Noting the impact In a business world that has (particularly in recent times) experienced continued disruption, creating impact in your industry has never been more important. Leaders of organisations of a range of sizes are looking to data specialists to help them make that long-lasting impression. What is significant here is that organisations need to build-up and make use of their teams to better position them to gather, collate, present and share information – and it needs to be achieved seamlessly too. Business leaders, therefore, need to express the specific aim and objective they are using data for within the organisation and how it’s intended to relate to the broader overarching business plans. Building resilience Key learnings from the past year have taught senior leaders around the globe that being prepared for any potential future disruption is a critical part of an organisation’s strategic plans. Data Engineers play a core role here. Using data to build resilience, instead of just reducing resistance or limiting the challenges it presents, will ensure organisations are well-placed to move into a post-pandemic world that makes use of the abundance of data available to them. Big Data and pulling apart and understanding these large scale and complex data sets will offer a new angle with which to inform resilience-building processes.  Alignment matters An organisation’s ability to collect, organise, analyse and react to data will be the thing that sets them apart from their competitors, in what we expect to become an increasingly competitive market. Business leaders must ensure that their teams are part of the data-driven culture and mindset that an organisation adopts. As this data is used to inform how an organisation interacts with its consumers, operates its processes or reaches new markets, it is incredibly important to ensure that your Data Engineers (and citizen developers) are equipped and aligned with the organisation’s visions. Change is a continuous process, particularly for the business community. Yet, there are some changes that are unpredictable, disruptive and mean that many pre-prepared plans may face a quick exit from discussions. Data professionals have an opportunity to drive the need for change, brought about by the impacts of the pandemic, in a positive and forward-thinking way. In understanding impact, resilience and alignment, this can be truly achieved. Data is an incredibly important tool, so using this in the right way is absolutely critical. If you’re in the world of Data & Analytics and looking to take a step up or find the next member of your team, we can help. Take a look at our latest opportunities or get in touch with one of our expert consultants to find out more.

What’s Keeping Women Out Of Data Science?

Data Science, the extraction of data to provide meaningful knowledge and insight, is experiencing a surge in growth within Data & Analytics. It is a fast-growing specialism, and talent in this area is in demand, with there being a 650 per cent increase in data science jobs since 2012. Simply put, pretty soon Data Science is going to play a fundamental role in every industry across the globe. Organisations have to adapt and make use of a range of Data Science tools and techniques or they will simply be forced out of business. LinkedIn recognised in their Emerging Jobs report that the role of a Data Scientist sits in the top three in the US, citing significant advancements in the emphasis on using data for this growth. Comparatively in the UK, this role lands within the top 10 at number seven.  Yet, our research tells us that in the UK, 25 per cent of female professionals work within Data Science, with this number dipping to just 20 per cent in the US. So, how can we support more women to enter the specialism? Encourage access to opportunities  Organisations need to continue to hire highly skilled technical talent to keep up with the growth that we are witnessing in the Data Science specialism. Yet, time and time again, working in Data Science can be seen to be an unattractive career proposition – in particular to women. To counteract this, business leaders need to make the role and rewards of becoming a Data Scientist visible within their organisation. Showcasing the range of projects and campaigns that are available, as well as providing opportunities for women to accelerate their careers and follow a pathway that suits them is critical. Education of STEM roles from a young age In order to see more women moving into roles within Data Science, industry leaders from within STEM fields need to take control and lead the way in educating women on the array of opportunities available. Through supporting, organising or hosting workshops, webinars and conferences, organisations can introduce women at entry-level to what careers in Data Science actually look like. This week for example in the UK, we’re currently in the middle of British Science Week. It is initiatives like these that build upon the education that is needed to promote roles in technical fields. Building up communities In the past year, we’ve all come to rely on our connections to provide insight and support during this period of uncertainty and change. This should be a continued focus moving forwards, building communities, networking and sharing knowledge in order to create an informed, educated and engaged workforce that attracts (and retains) female professionals. Within female-focused networks and groups, organisations can support women in advancing their careers, advocating for themselves and acting as a platform to showcase the opportunities that are available to women looking to move into a role in Data & Analytics. The consequence of ignoring these actions is a lack of diversity. We know that diverse teams perform better, and so welcoming in and making the Data Science specialism an attractive career consideration for women is critical. As the industry continues to advance and demand for skilled professionals grows, there will be plenty of opportunity for top talent to make their mark. If you're looking to take the next step in your career or build out a diverse Data & Analytics team, we may be able to help. Take a look at our latest opportunities or get in touch with one of our expert consultants to find out more. 

RELATED Jobs

recently viewed jobs