Author:  Matt Todd

Best of breed. Combining data lake and data warehouse technology for greater value

Data lake vs data warehouse. Which is best? Ask Google and you’ll get a vast array of different answers; from technology providers marketing enterprise systems that are designed to support one particular data structure over another, through to analysts or users who have a preference for coerced schema over data extraction at source.

Data warehouses have been in place for a while, fulfilling their purpose of processing big data that meets pre-defined user needs with good success. However, their effectiveness is increasingly called into question as the volume and types of data being collected expand and we start to accept that the purpose for this data may still need identifying.

On the flip side, data lakes have undoubtedly gained momentum, with many new technologies emerging to support the extraction and coercion of data as required for analysis. It certainly begs the question whether the data warehouse is becoming obsolete.

From my experience with customers, I would advocate that there is limited value in mining data into one version of a settled truth and believe that far better results can be achieved by releasing teams from having to work within pre-defined, restricted data schemas.

Indeed, a data lake by its very nature enables a far greater amount of information to be stored for future use, regardless of its structure or quality. This data can also be interrogated using different applications to help identify use cases quicker and easier.

However, this is not to suggest that data warehouses have become an outdated or obsolete concept or should be superseded by the data lake. There is room for both and the two can actually complement each other well.

Let’s take a closer look at how and why.

With a data lake, information is loaded in its ‘raw’ rather than curated format and stored directly from the source. This unchanged, ‘warts and all’ approach enables data to be extracted regardless of type. It is for the consumers of the raw data to create schemas and coerce the data into these as part of their work. As a result, scale and cost
are less of a concern.

In contrast a data warehouse aims to be as efficient with storage space as possible. It houses schema-compliant data that has already been processed for a specific purpose. In short, it provides answers to questions that have already been asked. While data which conforms in this way will undoubtedly deliver a competitive advantage to
the business it can also take a lot of effort for this data to be delivered in a different format, making manipulation more expensive.

So, how can businesses move to a scenario where the potential of the data collected is not restricted, but where functionality, order and usability are easily accessible?

A best of breed approach is required, where data has to be managed in a controlled and easily accessible manner for the end user, combined with the strengths and freedoms delivered by the data lake environment.

One common scenario is that a business may already have a single monolithic warehouse in place. This is optimised to provide the user with those readily accessible answers to common questions or scenarios, such as revenue, performance trends, market behaviour and so forth. The functionality of the warehouse is appealing to the end user but is restrictive, because the business may be missing out on valuable data that could be analysed.

In this scenario, raw data can be ingested into a data lake, using it as a preparatory environment to process large data sets before feeding them into the warehouse.

Elsewhere, a lake model can be leveraged to gather unstructured information, which is then analysed for its potential; using this to evolve strategies and business analysis. Simple and quick data discovery is an important part of this extraction and analysis, and by labelling information with additional metadata, a catalogue system can be created. This can help analysts to zone data, perhaps by its maturity or ‘value’ factor, before it is then fed into
the warehouse for further interrogation.

The future?

The business intelligence enabled by a traditional data warehouse environment is likely to be required for some time to come, offering data analysis in an easily digestible format for a range of diverse stakeholders.

In particular, vendors continue to develop warehouse automation tools, helping teams to work around the typical linear process, enabling changes and enhancements to be implemented later on in development, and responding
to change in real time.

Enabling a more agile warehouse environment requires issues around data discovery and quality to be highlighted as early as possible; creating a ‘data capture’ culture that can change and adapt quickly.

In the ongoing debate around data warehouse versus data lakes, it is clear that there are no absolutes and one technology is not, by design, superior to the other. The two technologies can work well together, combining unrestricted and flexible storage with business-specific analytical capabilities.

New Whitepaper

If you enjoyed this blog, you may be interested in Matt’s new whitepaper, “Riding the Wave of the Data Revolution; A value-First Approach to Data Lake Implementation”. Matt looks at the prevalence of data lakes and argues that a focus on understanding the value of your data is the best way to get the right result.

Who contributed to this article

  • Matt Todd
    Chief Architect

    Matt Todd is BlackCat’s Chief Architect, helping our clients create and execute technology and data strategies that solve complex business problems. Matt has nearly 20 years experience as a consultant in the technology and data space, including 10 years as co-founder of an application development company. Latterly, he has been at the forefront of thinking regarding data lakes and their role in helping businesses manage and realise the value of Big Data. Matt gained a BSc in Artificial Intelligence and Computer Science at the University of Birmingham and has remained an active member of the Birmingham technology scene. He set up the Docker and Cloud Native MeetUps and is a regular speaker and host at their events.