Binary Files

Preview

This feature is in Public Preview.

Databricks Runtime 5.4 supports the binary file data source, which reads binary files and converts each file into a single record that contains the raw content and metadata of the file.

Note

The binary file data source will be available in the next major release of Apache Spark. Databricks backported the feature from the Apache Spark master branch as a technical preview.

The binary file data source produces a DataFrame with the following columns and possibly partition columns:

  • path (StringType): The path of the file.
  • modificationTime (TimestampType): The modification time of the file. In some Hadoop FileSystem implementations, this parameter might be unavailable and the value would be set to a default value.
  • length (LongType): The length of the file in bytes.
  • content (BinaryType): The content of the file.

To read binary files, specify the data source format as binaryFile. To load files with paths matching a given glob pattern while keeping the behavior of partition discovery, you can use the pathGlobFilter option. For example, the following code reads all JPG files from the input directory with partition discovery.

df = spark.read.format("binaryFile").option("pathGlobFilter", "*.jpg").load("/path/to/dir")

Similar APIs exist for Scala, Java, and R.

Databricks recommends turning off compression when you save data loaded from binary files to improve read performance when you load it back.

spark.conf.set("spark.sql.parquet.compression.codec", "uncompressed")
df.write.format("delta").save("/path/to/table")