What is a data lake house?
Simply put, a data lake house is an architecture that enables secure and efficient execution of AI and BI directly on large amounts of data stored in a data lake.
Today, many companies store their data in data lakes, which are low-cost storage systems that can manage all types of data (structured and unstructured) and provide an open interface to any tool that processes it. These data lakes are where most of the data transformations and advanced analytical workloads (like AI) that leverage the full set of enterprise data are performed. Apart from this, for business intelligence (BI) use cases, proprietary data warehousing systems are used for smaller subsets of structured data. These data warehouses primarily support BI and use SQL to answer historical queries (e.g., what was the revenue last quarter?). ), but data lakes store much larger amounts of data and support analytics using both SQL and non-SQL interfaces, including predictive analytics and AI (e.g., which of our customers are likely to churn, what coupons should we offer to them and when). Historically, to achieve both AI and BI, you have to have multiple copies of your data, and you need to move your data between the data lake and the data warehouse.
A data lake house allows you to store all your data in a data lake and perform AI and BI directly on that data. It provides functions to efficiently perform both AI and BI on all of your company’s massive data. For example, it provides SQL and performance features (indexing, caching, MPP processing) to speed up BI for data lakes. It also supports Python, data science, direct support for AI frameworks, and direct file access so as not to force the use of SQL-based data warehouses. The key technologies used to implement a data lake house are Delta Lake and open sources like Hudi and Iceberg. Vendors that focus on data lake houses include Databricks, AWS, Dremio, and Starburst. Vendors that offer data warehousing include Teradata, Snowflake, and Oracle.
Bill Inmon, dubbed the father of data warehousing, recently published a blog post Evolution to a Data Lake House that describes the unique capabilities of a lake house to manage data in an open environment while combining the data science focus on data lakes with the end-user analytics of a data warehouse.
What is the difference between a data warehouse and a data lake house?
A lake house is built on top of an existing data lake, which in many cases contains more than 90% of an enterprise’s data. Many data warehouses support “external tables” to access such data, but they have severe functional limitations (e.g., read-only operation support) and performance limitations. LakeHouse instead introduces traditional data warehousing capabilities to existing data lakes. This includes ACID transactions, granular data security, low-cost update deletion, first-class SQL support, SQL query performance optimization, and BI-style reporting. By centrally located on the data lake, the lake house stores and manages all existing data in the data lake, structured data stored in tables, as well as all types of data such as text, audio, and video. In addition to providing direct access to data using open APIs, unlike data warehouses, it natively supports data science and machine learning use cases by supporting Python/R libraries such as PyTorch, TensorFlow, and XGBoost, as well as various machine learning It natively supports data science and machine learning use cases. In this way, the lake house provides a single system for managing all data in the enterprise while supporting a wide range from BI to AI.
Data warehouses, on the other hand, are proprietary data systems that specialize in SQL-based analysis on structured data and certain types of semi-structured data. Data warehouses have limited support for machine learning and cannot natively utilize popular open source tools without exporting the data (via ODBC/JDBC or to a data lake). Currently, no data warehousing system natively supports the existing audio, image, and video data already stored in data lakes.
What is the difference between a data lake and a lake house?
The most common complaint about data lakes is that they turn into data swamps. Everyone throws all kinds of data into a data lake. There is no structure or governance over the data in the lake. There is no performance because the data is not organized with performance in mind, and the analysis of the data lake is limited. Since data lakes use low-cost object storage, many companies use the data lake as a landing zone for data and move the data to another system data later, such as a data warehouse, to extract value.
LakeHouse addresses the fundamental problem of the data lake becoming a data swamp. Add ACID transactions to maintain consistency when multiple users are reading and writing data simultaneously. Supports data warehouse schema architectures such as star/snowflake schema, and provides robust governance and auditing mechanisms for data lakes. For fast analysis, it utilizes performance optimization techniques such as file compression to appropriate size, data skipping using file statistics, multi-dimensional clustering, and caching. It also provides fine-grained security and auditing features for data governance. By adding data management and performance optimization to an open data lake, the lake house can natively support BI and ML applications.
How easy is it for a data analyst to use a data lake house?
The Data Lake House implements a SQL interface similar to traditional data warehouses, so analysts can connect to the Data Lake House with their existing BI and SQL tools without having to change their workflow. For example, they can connect to the data lake house system from prominent BI tools like Tableau, PowerBI, Qlik, and Looker, perform data engineering like Fivetran and dbt, and analysts can export data to desktop tools like Microsoft Excel. Analysts can also export data to desktop tools such as Microsoft Excel. With LakeHouse’s ANSI SQL support, granular access control, and ACID transactions, administrators can not only manage like a data warehouse, but can also manage all enterprise data in one system.
One of the key advantages of the lake house system with respect to simplicity is that since it manages all the enterprise data, data analysts are granted access to the most recent raw data, rather than a subset of the data loaded into the data warehouse. The analyst can ask questions of multiple historical data sets and establish new pipelines for new data sets without having to ask the database administrator or data engineer to load the appropriate data. With built-in support for AI, analysts can easily run the models built by machine learning on arbitrary data.
What is the performance and cost comparison between a data warehouse and a lake house?
Data lake house systems are built on computational resources and storage that are isolated and scale flexibly to minimize operational costs and maximize performance. Modern systems offer comparable or even superior cost performance to traditional data warehouses and SQL workloads with the same optimizations in the engine (e.g., query compilation, storage layout optimization). In addition, in many cases, lake house systems are able to take advantage of cloud provider cost savings, such as spot instance pricing (which requires tolerance for worker node loss in the middle of a query) and infrequent access storage that is not intended for use in traditional data warehouses. Leverage cost-competitive features of cloud providers, such as price reduction through low-frequency access storage that traditional data warehouses are not expected to use.
What are some of the data governance features supported by the data lake house?
By adding a management interface to the data lake storage, the lake house system provides an integrated means to manage access control, data quality, and compliance for all enterprise data using the same standard interfaces as a data warehouse. A modern lake house system is an SQL database. Modern lake house systems support fine-grained (row-, column-, and view-level) access control via SQL, query auditing, attribute-based access control, data versioning, data quality constraints, and monitoring. These features are typically provided using standard interfaces familiar to database administrators (e.g., SQL’s GRANT statement) so that existing personnel can manage all data in the enterprise in an integrated manner. By centrally managing all data in the lake house system using a single management interface, the administrative burden and potential errors of managing multiple isolated systems can be reduced.
Should the lake house be centralized, or can it be distributed across a data mesh?
No, companies do not necessarily need to centralize all their data in a single lake house. Many companies using rakehouses take a distributed approach to storing and processing data, while adopting a centralized approach to security, governance, and search. Depending on the structure of the organization and the business requirements, there are several shared approaches.
Each business unit builds its own lake house to capture a complete view of its own business, from product development to customer acquisition to customer service.
Individual functional areas, such as product manufacturing, supply chain, sales, and marketing, build their own lake houses to optimize the operations of their business areas.
Some organizations are launching new lake houses in cases such as new cross-functional strategic initiatives like Customer 360, or to take quick action on unexpected disasters like the COVID pandemic.
With the integrated features of the lake house architecture, data architects are able to move data across siloed data stacks for BI and ML without having to orchestrate the complexities of moving data across the stack.