Introduction to Databricks Delta

Note

Databricks Delta is in Preview.

Databricks Delta delivers a powerful transactional storage layer by harnessing the power of Apache Spark and Databricks DBFS. The core abstraction of Databricks Delta is an optimized Spark table that

  • Stores data as Parquet files in DBFS.
  • Maintains a transaction log that efficiently tracks changes to the table.

You read and write data stored in the delta format using the same familiar Apache Spark SQL batch and streaming APIs that you use to work with Hive tables and DBFS directories. With the addition of the transaction log and other enhancements, Databricks Delta offers significant benefits:

ACID transactions
  • Multiple writers can simultaneously modify a dataset and see consistent views.
  • Writers can modify a dataset without interfering with jobs reading the dataset.
Fast read access
  • Automatic file management organizes data into large files that can be read efficiently.
  • Statistics enable speeding up reads by 10-100x and and data skipping avoids reading irrelevant information.

Requirements

Databricks Delta requires Databricks Runtime 4.1 or above. If you created a Databricks Delta table using a Databricks Runtime lower than 4.1, the table version must be upgraded. For details, see Table Versioning.

Frequently asked questions (FAQ)

How do Databricks Delta tables compare to Hive SerDe tables?

Databricks Delta tables are managed to a greater degree. In particular, there are several Hive SerDe parameters that Databricks Delta manages on your behalf that you should never specify manually:

  • ROWFORMAT
  • SERDE
  • OUTPUTFORMAT AND INPUTFORMAT
  • COMPRESSION
  • STORED AS
Does Databricks Delta support multi-table transactions?
Databricks Delta does not support multi-table transactions and foreign keys. Databricks Delta supports transactions at the table level.
Does Databricks Delta support writes or reads using the Spark Streaming DStream API?
Databricks Delta does not support the DStream API. We recommend Structured Streaming.
What DDL and DML features does Databricks Delta not support?
  • Unsupported DDL features:
    • ANALYZE TABLE PARTITION
    • ALTER TABLE [ADD|DROP] PARTITION
    • ALTER TABLE SET LOCATION
    • ALTER TABLE RECOVER PARTITIONS
    • ALTER TABLE SET SERDEPROPERTIES
    • CREATE TABLE LIKE
    • INSERT OVERWRITE DIRECTORY
    • LOAD DATA
  • Unsupported DML features:
    • INSERT INTO [OVERWRITE] with static partitions.
    • Subqueries in the WHERE conditions of UPDATE and DELETE.
    • Bucketing.
    • Specifying a schema when reading from a table. A command such as spark.read.format("delta").schema(df.schema).load(path) will fail.
    • Specifying target partitions using PARTITION (part_spec) in TRUNCATE TABLE.
What does it mean that Databricks Delta supports multi-cluster writes?
It means that Databricks Delta does locking to make sure that queries writing to a table from multiple clusters at the same time won’t corrupt the table. However, it does not mean that if there is a write conflict (for example, update and delete the same thing) that they will both succeed. Instead, one of writes will fail atomically and the error will tell you to retry the operation.
What are the limitations of multi-cluster writes?

Databricks Delta supports transactional writes from multiple clusters in the same workspace in Databricks Runtime 4.2 and above. All writers must be running Databricks Runtime 4.2 or above. The following features are not supported when running in this mode:

You can disable multi-cluster writes by setting spark.databricks.delta.multiClusterWrites.enabled to false. If they are disabled, writes to a single table must originate from a single cluster.

Warning

  • You cannot concurrently modify the same Databricks Delta table from different workspaces.
  • Writes to a single table using Databricks Runtime versions lower than 4.2 must originate from a single cluster. To perform transactional writes from multiple clusters in the same workspace you must upgrade to Databricks Runtime 4.2.
Why is Databricks Delta data I deleted still stored in S3?

If you are using Databricks Delta and have enabled bucket versioning you have two entities managing table files: Databricks Delta and AWS. To ensure that data is fully deleted you must:

  • Clean up deleted files that are no longer in the Databricks Delta transaction log using VACUUM
  • Enable an S3 lifecycle policy for versioned objects that ensures that old versions of deleted files are purged
Can I access Databricks Delta tables outside of Databricks Runtime?

There are two cases to consider: external writes and external reads.

  • External writes: Databricks Delta maintains additional metadata in the form of a transaction log to enable ACID transactions and snapshot isolation for readers. In order to ensure the transaction log is updated correctly and the proper validations are performed, writes must go through Databricks Runtime.

  • External reads: Databricks Delta tables store data encoded in an open format (Parquet), allowing other tools that understand this format to read the data. However, since other tools do not support Databricks Delta‘s transaction log, it is likely that they will incorrectly read stale deleted data, uncommitted data, or the partial results of failed transactions.

    In cases where the data is static (that is, there are no active jobs writing to the table), you can use VACUUM with a retention of ZERO HOURS to clean up any stale Parquet files that are not currently part of the table. This operation puts the Parquet files present in DBFS into a consistent state such that they can now be read by external tools.

    However, Databricks Delta relies on stale snapshots for the following functionality, which will break when using VACUUM with zero retention allowance:

    • Snapshot isolation for readers - Long running jobs will continue to read a consistent snapshot from the moment the jobs started, even if the table is modified concurrently. Running VACUUM with a retention less than length of these jobs can cause them to fail with a FileNotFoundException.
    • Streaming from Databricks Delta tables - Streams read from the original files written into a table in order to ensure exactly once processing. When combined with OPTIMIZE, VACUUM with zero retention can remove these files before the stream has time to processes them, causing it to fail.
For these reasons we only recommend the above technique on static data sets that must be read by external tools.