Databricks, the Data and AI company, today announced the launch of Databricks LakeFlow, a new solution that unifies and simplifies all aspects of data engineering, from data ingestion to transformation and orchestration. With LakeFlow, data teams can now simply and efficiently ingest data at scale from databases such as MySQL, Postgres and Oracle, and enterprise applications such as Salesforce, Dynamics, Sharepoint, Workday, NetSuite and Google Analytics. Databricks is also introducing Real Time Mode for Apache Spark™, which allows stream processing at ultra low latency.
LakeFlow automates deploying, operating and monitoring pipelines at scale in production with built-in support for CI/CD, and advanced workflows that support triggering, branching, and conditional execution. Data quality checks and health monitoring are built-in and integrated with alerting systems such as PagerDuty. LakeFlow makes building and operating production-grade data pipelines simple and efficient while still addressing the most complex data engineering use cases, enabling even the most busy data teams to meet the growing demand for reliable data and AI.
Addressing Challenges in Building and Operating Reliable Data Pipelines
Data engineering is essential for democratizing data and AI within businesses, yet it remains a challenging and complex field. Data teams must ingest data from siloed and often proprietary systems, including databases and enterprise applications, often requiring the creation of complex and fragile connectors. Additionally, data preparation involves maintaining intricate logic, and failures and latency spikes can lead to operational disruptions and unhappy customers. Deploying pipelines and monitoring data quality typically requires additional, disparate tools, further complicating the process. Existing solutions are fragmented and incomplete, leading to low data quality, reliability issues, high costs, and an increasing backlog of work.
LakeFlow addresses these challenges by simplifying all aspects of data engineering via a single, unified experience built on the Databricks Data Intelligence Platform, with deep integrations with Unity Catalog for end-to-end governance and serverless compute enabling highly efficient and scalable execution.
Key Features of LakeFlow
LakeFlow Connect: Simple and scalable data ingestion from every data source. LakeFlow Connect provides a breadth of native, scalable connectors for databases such as MySQL, Postgres, SQL Server and Oracle as well as enterprise applications like Salesforce, Dynamics, Sharepoint, Workday and NetSuite. These connectors are fully integrated with Unity Catalog, providing for robust data governance. LakeFlow Connect incorporates the low latency, highly efficient capabilities of Arcion, which was acquired by Databricks in November 2023. LakeFlow Connect makes all data, regardless of size, format or location available for batch and real-time analysis.
LakeFlow Pipelines: Simplifying and automating real-time data pipelines. Built on Databricks’ highly scalable Delta Live Tables technology, LakeFlow Pipelines allows data teams to implement data transformation and ETL in SQL or Python. Customers can now enable Real Time Mode for low-latency streaming without any code changes. LakeFlow eliminates the need for manual orchestration and unifies batch and stream processing. It offers incremental data processing for optimal price/performance. LakeFlow Pipelines makes even the most complex of streaming and batch data transformations simple to build and easy to operate.
LakeFlow Jobs: Orchestrating workflows across the Data Intelligence Platform. LakeFlow Jobs provides automated orchestration, data health and delivery spanning scheduling notebooks and SQL queries all the way to ML training and automatic dashboard updates. It provides enhanced control flow capabilities and full observability to help detect, diagnose and mitigate data issues for increased pipeline reliability. LakeFlow Jobs automates deploying, orchestrating and monitoring data pipelines in a single place, making it easier for data teams to meet their data delivery promises.