Porting Existing Workloads to Databricks Delta

When you port existing workloads to Databricks Delta, you should be aware of the following simplifications and differences compared with the data sources provided by Apache Spark and Apache Hive.

Databricks Delta handles the following operations automatically, which you should never perform manually:

REFRESH TABLE
Databricks Delta tables always return the most up-to-date information, so there is no need to manually call REFRESH TABLE after changes.
Add and remove partitions
Databricks Delta automatically tracks the set of partitions present in a table and updates the list as data is added or removed. As a result, there is no need to run ALTER TABLE [ADD|DROP] PARTITION or MSCK.
Load a single partition
As an optimization, users sometimes directly load the partition of data they are interested in. For example, spark.read.parquet("/data/date=2017-01-01"). This is unnecessary with Databricks Delta, since it can quickly scan the list of files to find the list of relevant ones. If you are interested in a single partition, specify it using a WHERE clause. For example, spark.read.parquet("/data").where("date = '2017-01-01'").

When you port an existing application to Databricks Delta, you should avoid the following operations, which bypass the transaction log:

Manually modify data
Databricks Delta uses the transaction log to atomically commit changes to the table. Because the log is the source of truth, files that are written out but not added to the transaction log are not read by Spark. Similarly, even if you manually delete a file, a pointer to the file is still present in the transaction log. Instead of manually modifying files stored in a Databricks Delta table, always use the DML commands that are described in this guide.
External readers

The data stored in Databricks Delta is encoded as Parquet files. However, accessing these files using an external reader is not safe. You’ll see duplicates and uncommitted data and the read may fail when someone runs VACUUM.

Note

Because the files are encoded in an open format, you always have the option to move the files outside Databricks Delta. At that point, you can run VACUUM RETAIN 0 and delete the transaction log. This leaves the table’s files in a consistent state that can be read by the external reader of your choice.

Example

There are a couple of ways you can create Databricks Delta tables from existing files: specify the location of the Databricks Delta files or accept the default location.

Specify file location

Suppose you have Parquet data stored in the DBFS directory /data-pipeline. To create a Databricks Delta table from this data:

  1. Read the data into a DataFrame and save it to a new directory in delta format:

    data = spark.read.parquet("/data-pipeline")
    data.write.format("delta").save("/delta/data-pipeline/")
    
  2. Create a Databricks Delta table that refers to the files in the Databricks Delta directory:

    DROP TABLE IF EXISTS pipeline_delta;
    CREATE TABLE pipeline_delta
    USING DELTA
    LOCATION "/delta/data-pipeline/"
    

Accept default file location

Create a Databricks Delta table directly from the original Parquet files:

%sql
CREATE TABLE pipeline_delta
USING delta
AS SELECT *
FROM parquet.`/data-pipeline`

In this case, the data files and Databricks Delta transaction log are written out to /user/hive/warehouse/<table-name>; that is, /user/hive/warehouse/pipeline_delta, which you can see by running describe detail pipeline_delta.