Latest News

Data virtualisation: the key to unlocking your data lake’s potential

Alberto Pan, Chief Technical Officer at Denodo, discusses why data virtualisation makes sound business sense

In today’s digital economy, data is the new currency. Every organisation – regardless of location, size or sector – is reliant on it. It drives operations and enables businesses to function effectively by increasing productivity, improving decision making capabilities and reducing financial cost.

A huge volume of data is being produced each day. Every action, every reaction and every interaction feeds into an ever-growing digital footprint. With IDC predicting that worldwide data levels will increase by 61% to 175 zettabytes by 2025, it’s not surprising that data lakes have become a principal data management architecture.

With infinite storage capabilities, data lakes enable organisations to keep all data – whether structured or unstructured – in one central repository. This makes discovery easier and reduces the time spent by data scientists on selection and integration. After all, it’s easier to see what you’ve got if it’s all in one place.

Data lakes also provide massive computing power, enabling the data that is held within them to be transformed and combined to meet the needs of any processes that require it. What’s more, organisations can use machine learning within a data lake in order to analyse historical data and forecast likely outcomes. This information can then be used to increase overall productivity and improve business processes, which is arguably one of the main advantages of using a data lake model – this was illustrated in the findings of a recent analyst report, which saw those organisations employing a data lake outperform their peers by nine percent in organic revenue growth.


Drowning in data

But it’s not as simple as it sounds.

Many businesses- in fact, the majority of businesses – still struggle with certain aspects of data delivery and integration. Indeed, a recent study indicated that data scientists can spend up to 80 percent of their time finding, cleaning and reorganising data. This means that only 20% of their time is being spent analysing any data produced and using it to make meaningful decisions. Instead of being able to unlock the data’s potential, data scientists often find themselves drowning in information and unable to apply analytics as a means of gleaning insight and intelligence.

So, why are they struggling?

Quite simply, storing all your data in one physical place doesn’t necessarily make it any easier to discover certain data sets. It can be like trying to find a needle in a haystack. To add to this, many companies have hundreds of repositories of data across a number of different cloud providers and on-premise databases. Replicating data from these systems of origin can be slow and costly, meaning that only a small subset of relevant data will be stored in the lake.

Furthermore, storing data in its original form still requires it to be adapted later on for machine learning processes. Despite vast improvements over the last few years, integration tools are limited and cannot yet help data scientists with the more complex tasks that require a more advanced skillset.

With all this in mind, how can organisations overcome these challenges and unlock the benefits of a data lake?


Data virtualisation – a life jacket for businesses

Data virtualisation essentially provides a single access point to all data, regardless of its location and without the need to first replicate it in a single repository. It stitches together data abstracted from various underlying sources and delivers it to consuming applications in real time so that even data that has not been copied into the lake is available for data scientists to use and analyse.

This ultimately helps data scientists with the discovery processes, enabling them to access all data in real time. And data virtualisation isn’t just flexible, it’s selective too. Best of breed tools will offer searchable catalogues of all available data sets, including extensive metadata on each data set.

Data virtualisation also offers clarity and simplicity to the data integration processes. With this tool, all data is organised according to a consistent data representation and query model, regardless of whether it is stored in a relational database a Hadoop cluster, a SaaS application or a NoSQL system meaning that data scientists are able to view it as if it were stored in the same place. As a result of this, data scientists are able to create reusable logical data sets which can be adapted to meet the needs of each individual machine learning process.


Looking forward

With data playing an ever-increasing role in operations and processes, it’s no wonder organisations are looking to modern analytics to drive meaningful insight and improve efficiency. Although currently in its relative infancy, the machine learning market is expected to grow by 44% over the next four years.  As adoption continues to grow and data lakes become more prevalent, data virtualisation will become increasingly essential to improving the productivity of data scientists.

By simplifying data discovery and integration, data virtualisation is set to play an important role in reducing the burden of data management and enabling data scientists to focus on their core skills. And, ultimately, data virtualisation will be key in helping organisations make the most of their data.