A brief look at one of the latest architecture paradigms in the data analytics space
The data volume generated by the new digital era platforms is continuously and exponentially increasing and the insights generated from the data is becoming more and more valuable and critical, there are a bunch of architectural innovations and reforms happening. In this article, we will have a close look at one of the most disrupting architectures in the data analytics space.
Let’s have a look at a traditional data analytics platform
The core components in the above architecture are data ingestion, data lake, ETL and a data warehouse. Data lake and data warehouse are similar and different at the same time. The similarity is basically they are storage platforms for keeping the data. Now let’s look at the differences.
The million-dollar question — Is there a way to combine data lake and data warehouse?
Obviously, someone would ask. “Can we have a platform that can serve both as a data lake and data warehouse” ? The answer is “Data lakehouse”
Data lakehouse is a new architecture that emerged recently which combines the best of both data lake and data warehouse. In a nutshell, data lakehouse enables
- ACID compliance and full transactional update capabilities on data lake
- High performance query execution
- Data science and data engineering workloads
- Unified real-time and batch data processing
- Schema enforcement and data governance
As you can see in the diagram, lake house storage acts as both data lake and data warehouse. This is a very simplified version of the architecture. We will now look at a few of the innovations and software products that made this architecture possible. They are quite interesting.
Some popular early players (they are still very popular)
The below products are based on MPP (Massively Parallel Processing) architecture
Apache Presto: An open-source, distributed SQL querying platform that runs on a cluster of machines. Originating on Facebook, Presto is now one of the most popular query engines out there. It enables analytics using ANSI SQL on large amounts of data stored in a variety of systems such as HDFS, cloud storage, NoSQL stores etc. Presto enables data warehouse-like query performance for BI and reporting tools.
Apache Drill: Similar to Apache Presto, Drill is another open-source SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data coming from modern Big Data applications, using ANSI SQL. Drill originated at MapR (now acquired by HP Enterprise) and is based on Google’s Dremel.
Both Presto and Drill are Apache top-level projects
Amazon Athena: This is an AWS native version of Presto. It provides all functionalities of Presto but there is no need to install or manage the infrastructure. You will only have to pay for the queries fired and the data scanned. Even though Athena is meant to query data stored in the AWS S3 data lake, federated queries along with source connectors can be used to query a large variety of data sources.
Amazon Redshift Spectrum: Another AWS product based on Amazon Redshift, a popular data warehouse product offering from Amazon. Redshift Spectrum leverages the parallel processing capabilities available in an already provisioned Redshift cluster to connect and query data available in Amazon S3 storage.
Google BigQuery: Based on Google’s Dremel, it is a data warehouse offering from Google Cloud. Unlike a typical data warehouse solution, BigQuery supports machine learning workloads using BigQuery ML. It can also connect to external big data storage systems to query data using Federated queries.
As you can see, there are a variety of products available. But none of them truly possess all the capabilities of a data lake house.
The capabilities of ACID transactions are limited to none. Most of these work on immutable storage and updating records is impossible. Also, it is difficult to achieve a unified real-time and batch data layer. Moreover, we require different storage solutions for meeting all our requirements, and that deviates from the concept of single data storage for both data lake and data warehouse.
Delta Lake, Apache Iceberg and Apache Hudi
These platforms implement data lakehouse using a metadata layer based on open table formats on top of the data lake storage solutions such as HDFS, AWS S3, Azure Blob, Google Cloud Storage etc.
Delta Lake: Delta lake was created by Databricks and then open-sourced. It is based on the popular file format Apache Parquet and uses transaction logs created as JSON files to support ACID transactions on the data lake. Delta format is supported by many data processing frameworks which can now leverage this framework to enable data warehouse capabilities on the existing data lake storage systems. Other key features include
Scalable metadata on billions of partitions and files with ease
- Time travel to old data for audit or rollback
- Unified batch/streaming with transaction capabilities
- Schema evolution/Enforcement — Prevent bad data
- Audit history using transaction logs for full audit trial
- DML operations — SQL, Scala/Java and Python APIs to merge, update and delete datasets
Apache Iceberg: Another open-source project which was initially developed at Netflix and is currently an Apache top-level project. Apache Iceberg is an open table format designed for huge, petabyte-scale tables. The function of a table format is to determine how you manage, organize and track all of the files that make up a table. It supports multiple file formats such as Apache Parquet, Apache Avro and Apache ORC. Using a combination of metadata, manifest and data files, ACID transactions are supported on data lake storage. Other capabilities include
- Full schema evolution to track changes to a table over time
- Time travel to query historical data and verify changes between updates
- Partition layout and evolution enabling updates to partition schemes as queries and data volumes change without relying on hidden partitions or physical directories
- Rollback to prior versions to quickly correct issues and return tables to a known good state
- Advanced planning and filtering capabilities for high performance on large data volumes
Apache Hudi: Open-source product initially developed at Uber. Hudi (Hadoop Upsert Delete and Incremental) initially started as a streaming data lake platform on top of Apache Hadoop. It has evolved a lot now to support incremental batch jobs on any cloud storage platform. Hudi uses Apache Parquet and Apache Avro and enables ACID transaction capabilities. Other features include
- Upserts, Deletes with fast, pluggable indexing
- Transactions, Rollbacks, Concurrency Control
- Automatic file sizing, data clustering, compactions, cleaning
- Built-in metadata tracking for scalable storage access
- Incremental queries, Record level change streams
- Backwards compatible schema evolution and enforcement
- SQL Read/Writes from data processing frameworks such as Spark, Presto, Trino, Hive & more
Products for Enterprise
Now let’s look at some enterprise offerings
Delta Lakehouse by Databricks: Based on the delta lake and developed by Databricks which can run on popular cloud providers. The entire platform is powered by Apache Spark, the most popular data processing platform in the market now. By combining the capabilities of Delta lake and Apache Spark, delta lakehouse provides capabilities such as
- Lightning-fast performance for data processing with auto-scaling and indexing
- Data science and machine learning workloads at scale using MLflow
- Databricks SQL, which is server less SQL query execution engine exclusively for ultra-fast BI and dashboarding requirements
- Unity catalog — A unified data catalog for all data in the delta lakehouse which also provides data governance and security capabilities
- Built-in dashboarding platform to create reports and visualizations
- Built-in dashboarding platform to create reports and visualizations
Dremio: Product offering based on Apache Iceberg. Unlike Databricks, Dremio can work both on the cloud as well as on-premise. Unlike the Databrick platform, Dremio does not provide built-in machine learning or data science capabilities as the focus is mainly on data engineering, data warehousing, and fast query performance. With a high-performance SQL query engine and data transfer capabilities, Dremio solves the problem of combining a data lake and a data warehouse. The two services that are powering Dremio platform are
Dremio Sonar — A lakehouse engine built for SQL
Dremio Arctic — A metadata and data management service for Apache Iceberg that provides a unique Git-like experience for the lakehouse
Final thoughts
I have done my best to capture relevant details on each of these platforms. The ecosystem is always growing powered by more and more contributions to the open-source community. The beauty of these open-source platforms is that you can get all details on the internals and even more, you can check out the source code and start contributing. There is no limit to learning!
Manu Mukundan, Data Architect, NeST Digital