Big data demystified

our consultant managing the role
Posting date: 11/17/2014 12:00 AM

The term “Big Data” as applied to IT was coined around 2011, and various persons have laid claim to having been the first person to coin it. It has become a buzz word that is sometimes misunderstood and often abused. Here we will try to demystify it so we can understand what it is and how we can realize its real value.

What is Big Data?

Many alternative definitions of Big Data have been published. One of the most insightful of these was proposed by Gartner and has become the accepted standard. It defines Big Data as “High volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."

These three V’s of Big Data, volume, velocity and variety, have more recently been augmented by the addition of a fourth V, veracity.

Volume

It is now almost a cliché to say that 90% of the existing data has been generated in the last two years. Around 2.4 trillion gigabytes of data are generated globally each day. Much of it arises through the internet and information generating digital analytics devices such as smartphones, digital analytics cameras and CCTVs and it is growing exponentially. It is estimated that by 2020 worldwide corporate data will exceed 35,000 exabytes where an exabyte is one quintillion bytes. 
Despite its size, big data is not  all about volume. In principal any volume of data could be processed using conventional database software if volume was the only issue.

Velocity

Data velocity, the rate at which information flows, is also increasing as a similar rate to volume. The increase in data velocity is in line with improving technologies are developing in accordance with Moore’s Law. 

Variety

The pivotal V that makes data Big Data is variety. While standard database software handles structured data, Big Data is often unstructured and cannot be processed with the same tool sets. In fact it can be a combination of various data categories including structured, semi-structured and unstructured. Typically it could consist of XML, database tables, audio and video files, text messages, tweets, and so forth.

Veracity

Veracity is an obvious addition. Unless the data is relevant, accurate and can be trusted, it is of little or no value. Ensuring the veracity of Big Data can be a challenge as it is difficult to control its quality. Any organization using Big Data must have the means of deciding whether it is beneficial and the extent to which it can be trusted.

Dealing with Big Data

One of the more popular ways of dealing with Big Data is Apache Hadoop. Named after a toy elephant it is an open source project designed to enable the large scale storage and processing of big data sets across server clusters.
It is hugely saleable and can readily be scaled from a single server to many thousands of servers either in premise or in the cloud. Originally developed by Yahoo and Google, its users include Yahoo, Facebook, Twitter, LinkedIn, and many more.

In addition to its scalability, it provides an inexpensive approach to massive parallel computing. It is flexible and can handle any kind of structured or unstructured data from unlimited sources which can be joined and aggregated and is fault tolerant.

Another popular approach is NoSQL. Also open source, it is a database framework that enables the storage and processing of large quantities of structured and unstructured data.

Big Data in Practice

Big Data and analytics have proven to be a success for many organizations. Its applications have included:

  • Understanding and targeting customers –  one of the main applications of Big Data. Examples of this in practice include Amazon Recommendations and personalized Tesco money-off coupons
  • Business processes – applications include optimizing supply route logistics, stock control based on social media trends, and HR processes including recruitment
  • Healthcare – for instance the side effects of drugs, correlations between lifestyle and health, the human genome, and the spread of infections
  • Big science – for instance the LHC at CERN generates one petabytes of data a second. Although most of it is discarded, CERN scientists store and process 30 petabytes a year using 65,000 servers.
  • Security and law enforcement – in the US the NSA uses Big Data in its war on terrorism as does GCHQ in the UK.

Is Big Data Over-hyped?

While the value of big data is clear, it isn’t a panacea. Certainly it has failed to live up to many of its early expectations and, according to some commentators, it has passed the peak of inflated expectations and has descended into the “trough of disillusionment”.

The backlash set in following the failure of Google Flu Trends which claimed to identify flu outbreaks using search queries. It got it spectacularly wrong, overestimating cases significantly from 2009 to 2013.

Big Data has several intrinsic weaknesses. These include:

  • While Big Data can detect subtle correlations, it can’t show causal relationships. This can lead to bad and dangerous conclusions. For instance the increasing number of autism diagnoses has been highly correlated with organic food sales.
  • Big data throws up correlations that appear to be statistically significant, but they happen just by chance simply because of the volume of data. The harder you look the more patterns you find even though they aren’t really there.
  • Big Data advocates have claimed that searching for models is no longer relevant as Big Data alone can deliver the answers. This is a dangerous and potentially catastrophic position that fortunately is losing sway.

Finally

Big Data is a large amount of data that may be structured, unstructured or both. It is characterized by its volume, velocity, and variety, and to be valuable it must have veracity too. However its real value is realized only when analytics are used to extract from it useful information.

It has changed how we do business, interact with each other and our customers, and protect our citizens from terrorism. Its benefits are clear, but so too are its potential dangers. Regardless, it’s here to stay so we should ensure that we learn how to handle it.

<

Related blog & news

With over 10 years experience working solely in the Data & Analytics sector our consultants are able to offer detailed insights into the industry.

Visit our Blogs & News portal or check out the related posts below.

Harnham's Brush with Fame

Harnham have partnered with The Charter School North Dulwich as corporate sponsors of their ‘Secret Charter’ event. The event sees the south London state school selling over 500 postcard-sized original pieces of art to raise funds for their Art, Drama and Music departments. Conceived by local parent Laura Stephens, the original concept was to auction art from both pupils and contributing parents.  Whilst designs from 30 of the school's best art students remain, the scope of contributors has rapidly expanded and now includes the work of local artists alongside celebrated greats including Tracey Emin, Sir Anthony Gormley, Julian Opie, and Gary Hume.  In addition to famous artists, several well-known names have contributed their own designs including James Corden, David Mitchell, Miranda Hart, Jo Brand, Jeremy Corbyn, and Hugh Grant.  The event itself, sponsored by Harnham and others, will be hosted by James Nesbitt, and will take place at Dulwich Picture Gallery on the 15th October 2018.  You can find out how to purchase a postcard and more information about the event here. 

Breaking Code: How Programmers and AI are Shaping the Internet of Tomorrow

Data. It’s what we do. But, before the data is read and analysed, before the engineers lay the foundation of infrastructure, it is the programmers who create the code – the building blocks upon which our tomorrow is built. And once a year, we celebrate the wizards behind the curtain.  In a nod to 8-bit systems, on the 256th day of the year, we celebrate Programmers’ Day. Innovators from around the world gather to share knowledge with leading experts from a variety of disciplines, such as privacy and trust, artificial intelligence, and discovery and identification. Together they will discuss the internet of tomorrow.  The Next Generation of Internet At the Next Generation Internet (NGI), users are empowered to make choices in the control and use of their data. Each field from artificial intelligent agents to distributed ledger technologies support highly secure, transparent, and resilient internet infrastructures. A variety of businesses are able to decide how best to evaluate their data through the use of social models, high accessibility, and language transparency. Seamless interaction of an individual’s environment regardless of age or physical condition will drive the next generation of the internet. But, like all things which progress, practically at the speed of light, there is an element of ‘buyer beware’, or in this case, from ‘coder to user beware’. Caveat Emptor or rather, Caveat Coder The understanding, creation, and use of algorithms has revolutionised technology in ways we couldn’t possibly have imagined a few decades ago. Digital and Quantitative Analysts aim to, with enough data, be able to predict some action or outcome. However, as algorithms learn, there can be severe consequences of unpredictable code.  We create technology to improve our quality of life and to make our tasks more efficient. Through our efforts, we’ve made great strides in medicine, transportation, the sciences, and communication. But, what happens when the algorithms on which the technology is run surpasses the human at the helm? What happens when it builds upon itself faster than we can teach it? Or predict the infinite variable outcomes? Predictive analytics can become useless, or worse dangerous.  Balance is Key Electro-mechanical systems we could test and verify before implementation are a thing of the past, and the role of Machine Learning takes front and centre. Unfortunately, without the ability to test algorithms exhaustively, we must walk a tightrope of test and hope. Faith in systems is a fine balance of Machine Learning and the idea that it is possible to update or rewrite a host of programs, essentially ‘teaching’ the machine how to correct itself. But, who is ultimately responsible? These, and other questions, may balance out in the long run, but until then, basic laws regarding intention or negligence will need to be rethought. Searching for a solution  In every evolution there are growing pains. But, there are also solutions. In the world of tech, it’s important to put the health of society first and profit second, a fine balancing act in itself. Though solutions remain elusive, there are precautions technology companies can employ. One such precaution is to make tech companies responsible for the actions of their products, whether it is lines of rogue code or keeping a close eye on avoiding the tangled mass of ‘spaghetti’ code which can endanger us or our environment. Want to weigh in on the debate and learn how you can help shape the internet of tomorrow? If you’re interested in Big Data and Analytics, we may have a role for you. Check out our current vacancies. To learn more, contact our UK team at +44 20 8408 6070 or email us at info@harnham.com.