Hey, hey! I am Ilya Kalchenko, a Data Engineer at NIX, a fan of big and small data processing, and Python. In this article, I want to discuss the benefits of hybrid data lakes for efficient and secure data organization.
To begin with, I invite you to figure out the concepts of Data Warehouses and Data Lake. Let’s delve into the use cases and delimit areas of responsibility.
These are corporate repositories of processed and ready-to-use data. DWH has several properties of classic relational databases:
SQL (sometimes there may be exceptions).
While DWHs have similar characteristics to databases, they are not the same. The main difference appears when it is necessary to prepare analytics based on a large sample of data. Data warehouses store operational and transactional data optimized for aggregating and retrieving large amounts of data.
Data warehouses often follow Online Analytical Processing (OLAP) principles. This approach has its own particularities in contrast to Online Transaction Processing (OLTP), namely:
the speed of execution of analytical queries on large amounts of data is higher;
complex queries use aggregations;
data is often denormalized;
data is organized in schematics of a star or snowflake.
When using data warehouses, the ETL approach (Extract, Transform, Load) is more often chosen to obtain certain information. For example, to load data from a production database, convert it to a convenient format, and upload it to DWH. This is a typical scenario in Business Intelligence projects.
The main sources for DWH are operational databases and transactional systems. The most popular data warehouse systems are:
Azure Synapse Analytics.
They are repositories of raw data on a scale of a company. There are no defined rules for building such repositories. Let's highlight their main characteristics:
structured and unstructured data.
In such storages, the data generally gets in its original form, which means the transformation stage is not mandatory. In this case, the EL(T) approach is used. Then Data Scientists will take this data to create models, Data Engineers - to build DWH and Data Analysts - to search for insights and forecast business development.
Typical solutions for Data Lake formation are:
Google Cloud Storage;
Apache Hadoop HDFS.
Hybrid Data Lake
This is the unification of Data Lake and Data Warehouse into a single system. Such a solution should cover all the needs of the company in data storage and quick access to them by stakeholders. A typical workflow for Hybrid Data Lake is shown in the diagram:
Data from multiple sources is loaded, uploaded to Data Lake and changed as needed. Data Scientists and Data Analysts pull data from this environment. Experts use them to solve analytical problems and in tasks related to machine learning. Another way is also possible: when the data is loaded, transformed and unloaded into the storage for building Data Marts. At this stage, the data will come in handy for BI Developers when creating dashboards.
As an example, according to this scheme, we can use the following technologies:
as data sources — PostgreSQL, S3, Google Analytics;
for ELT и ETL processes — AWS Glue, Google Dataflow, talend, Pantaho;
as Data Lake — S3, HDFS, GCS;
as DWH — Snowflake, Amazon Redshift или Google BigQuery.
Let's see how Hybrid Data Lake can be useful on the example of Amazon Web Services.
AWS provides a wide variety of data processing and storage services. How to organize Data Lake on AWS? The vendor-independent storages serve as data sources, for example:
Events in json format.
Next, you need to transfer this data to our Data Lake. In our case, this is S3. We will forward this data using AWS DMS (Database Migration Service) for the database and AWS Kinesis for the flow of events. Once the data is in Data Lake, experts can use it. Then, using AWS Glue, we load the necessary data, process it and save it in the pre-configured Redshift. Now information is available to the entire company in an organized format and is suitable for both analytics and ML and BI.
This is a textbook example. The final set of services / tools will most likely be determined by the platform you choose.
As the volume of data continues to grow in all organizations, it is important to try to optimize storage costs. On-premises infrastructure is often heavily loaded. As a result, its overall value increases. By storing your data in cheap, supporting fast write Data Lake, and some of your analytics data in a powerful and reliable DWH, you can get the best of both worlds and be prepared for unexpected workloads.
The hybrid data lake is not entirely new. This is a successful combination of two essences, well-known to data engineers for a long time. This solution allows you to store all corporate historical data in a single system and, if necessary, receive it to perform analytical, ML and any other data-related tasks.
Personally, I witnessed it in practice by combining these technologies, you eliminate their shortcomings and get a significant advantage as supporting the entire system with one team. Whether you choose a Hybrid Data Lake in your project or use Data Warehouses and Data Lake separately depends on the requirements of your end-users and the data you collect. Still, the combination of these architectures can be an enjoyable experience for you and the best solution for organizing all information in one system of a large company. Your team will be able to maintain and improve everything related to data. In fact, this is one of the smartest approaches to organizing data.