Databricks File System - DBFS

Databricks File System (DBFS) is a distributed file system installed on Databricks Runtime clusters.

DBFS is a layer over S3 that allows you to mount S3 buckets and is available in both Python and Scala. Files in DBFS persist to S3, so you won’t lose data even after you terminate a cluster.

In the past, DBFS used an S3 bucket created in the Databricks account to store data that is not stored on a DBFS mount point. At your request, Databricks can switch this over to an S3 bucket in your own account.

Mounting S3 buckets in DBFS gives you access to specific data without requiring S3 keys.

In a Spark cluster you access DBFS using Databricks Utilities. On your local computer you access DBFS using the CLI.

Access DBFS with the CLI

The DBFS command-line interface leverages the DBFS API to expose an easy to use command-line interface to DBFS. Using this client, interacting with DBFS is as easy as running:

# List files in DBFS
dbfs ls
# Put local file ./apple.txt to dbfs:/apple.txt
dbfs cp ./apple.txt dbfs:/apple.txt
# Get dbfs:/apple.txt and save to local file ./apple.txt
dbfs cp dbfs:/apple.txt ./apple.txt
# Recursively put local dir ./banana to dbfs:/banana
dbfs cp -r ./banana dbfs:/banana

For more information about the DBFS command-line interface, see Databricks CLI.

Access DBFS with dbutils

This section has several examples of how to write files to and read files from DBFS using dbutils.

Tip

To access the help menu for DBFS, use the dbutils.fs.help() command.

  • Write file to and read files from DBFS as if it were a local filesystem.

    dbutils.fs.mkdirs("/foobar/")
    
    dbutils.fs.put("/foobar/baz.txt", "Hello, World!")
    
    dbutils.fs.head("/foobar/baz.txt")
    
    dbutils.fs.rm("/foobar/baz.txt")
    
  • Use dbfs:/ to access a DBFS path.

    display(dbutils.fs.ls("dbfs:/foobar"))
    
  • Use file:/ to access the local disk.

    dbutils.fs.ls("file:/foobar")
    
  • Filesystem cells provide a shorthand for accessing the dbutils filesystem module. Most dbutils.fs commands are available using the %fs magic command as well.

    %fs rm -r foobar
    

For more information about Databricks Utilities, see Databricks Utilities.

Access DBFS using the Spark API

#python
sc.parallelize(range(0, 100)).saveAsTextFile("/tmp/foo.txt")
//scala
sc.parallelize(0 until 100).saveAsTextFile("/tmp/bar.txt")

Access DBFS using local file APIs

You can use local file APIs to read and write to DBFS paths. Databricks configures each node with a fuse mount that allows processes to read and write to the underlying distributed storage layer.

#python
# write a file to DBFS using python i/o apis
with open("/dbfs/tmp/test_dbfs.txt", 'w') as f:
  f.write("Apache Spark is awesome!\n")
  f.write("End of example!")

# read the file
with open("/dbfs/tmp/test_dbfs.txt", "r") as f_read:
  for line in f_read:
    print line
// scala
import scala.io.Source

val filename = "/dbfs/tmp/test_dbfs.txt"
for (line <- Source.fromFile(filename).getLines()) {
  println(line)
}

Warning

  • Local file I/O APIs only support files less than 2GB in size. You might see corrupted files if you use local file I/O APIs to read or write files larger than 2GB. Access it using the DBFS CLI, use dbutils.fs, or use Hadoop Filesystem APIs to access large files instead.

  • If you write a file using the local file I/O APIs and then immediately try to access it using the DBFS CLI, dbutils.fs, or the Hadoop Filesystem APIs, you might encounter a FileNotFoundException, a file of size 0, or stale file contents. That is expected because the OS caches writes by default. To force those writes to be flushed to persistent storage (in our case DBFS), use the standard Unix system call sync.

    // scala
    import scala.sys.process._
    
    // Write a file using the local file I/O API (over the fuse mount).
    dbutils.fs.put("file:/dbfs/tmp/test", "test-contents")
    
    // Unless you call this, the code below might not see the file or its latest contents.
    "sync /dbfs/tmp/test" !
    
    // Read the file using "dbfs:/" instead of the fuse mount.
    dbutils.fs.head("dbfs:/tmp/test")
    

Mount an S3 bucket

Mounting an S3 bucket directly to DBFS allows you to access files in S3 as if they were on the local file system.

Tip

One common issue is to pick bucket names that are not valid URIs. For information, see S3 bucket name limitations.

We recommend Secure Access to S3 Buckets Using IAM Roles for mounting your buckets. IAM roles allow you to mount a bucket as a path. You can also mount a bucket using AWS keys, although we do not recommend doing so.

  1. Replace the values in the following cell with your S3 credentials.

    # python
    ACCESS_KEY = "<aws-access-key>"
    SECRET_KEY = "<aws-secret-key>"
    ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
    AWS_BUCKET_NAME = "<aws-bucket-name>"
    MOUNT_NAME = "<mount-name>"
    
    dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
    display(dbutils.fs.ls("/mnt/%s" % MOUNT_NAME))
    
    // scala
    // Replace with your values
    val AccessKey = "<aws-access-key>"
    // Encode the Secret Key as that can contain "/"
    val SecretKey = "<aws-secret-key>"
    val EncodedSecretKey = SecretKey.replace("/", "%2F")
    val AwsBucketName = "<aws-bucket-name>"
    val MountName = "<mount-name>"
    
    dbutils.fs.mount(s"s3a://$AccessKey:$EncodedSecretKey@$AwsBucketName", s"/mnt/$MountName")
    display(dbutils.fs.ls(s"/mnt/$MountName"))
    
  2. Access files in your S3 bucket as if they were local files, for example:

    # python
    df = spark.read.text("/mnt/%s/...." % MOUNT_NAME)
    df = spark.read.text("dbfs:/$MountName/....")
    
    // scala
    val df = spark.read.text(s"/$MountName/....")
    val df = spark.read.text(s"dbfs:/$MountName/....")
    

Note

You can use the fuse mounts to access mounted S3 buckets by referring to /dbfs/mnt/myMount/.

Unmount an S3 bucket

To unmount a mount point, use the following command:

dbutils.fs.unmount(s"/mnt/$MountName")