Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics. Azure Data Lake Storage Gen2 combines the capabilities of two existing storage services: Azure Data Lake Storage Gen1 features, such as file system semantics, file-level security and scale are combined with low-cost, tiered storage, high availability/disaster recovery capabilities, and a large SDK/tooling ecosystem from Azure Blob Storage.
Azure Data Lake Storage Gen2 support is in Public Preview.
This topic explains how to access Azure Data Lake Storage Gen2 using the ABFS driver built into Databricks Runtime.
In this topic:
To set up the credential of a Azure Storage account for Azure Data Lake Storage Gen2, we recommend that you set the credential in the session configs of your notebook.
spark.conf.set( "fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net", "<your-storage-account-access-key>")
Hadoop configuration options set using
spark.conf.set(...) are not accessible via
This means that while they are visible to the DataFrame and Dataset API, they are not visible to the RDD API.
If you are using the RDD API to read from Azure Data Lake Storage Gen2, you must set the credentials using one of the following methods:
Specify the Hadoop configuration options as Spark options when you create the cluster. You must add the
spark.hadoop.prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations that are used for your RDD jobs:
# Using an account access key spark.hadoop.fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net <your-storage-account-access-key>
Scala users can set the credentials in
// Using an account access key spark.sparkContext.hadoopConfiguration.set( "fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net", "<your-storage-account-access-key>" )
The credentials set in the Hadoop configuration are available to all users who access the cluster.
Once an account access key is set up, you can use standard Spark and Databricks APIs to read from the storage account. For example,
val df = spark.read.parquet("abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-directory-name>") dbutils.fs.ls("abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-directory-name>")
Azure Data Lake Storage Gen2 has a hierarchical namespace, which provides improved performance and a familiar file system experience. To take advantage of the hierarchical namespace, you must enable it when creating the Azure Storage account for Azure Data Lake Storage Gen2.
- When the hierarchical namespace is enabled for an Azure Data Lake Storage Gen2 account, you do not need to create any Blob container through Azure Portal.
- If you enable hierarchical namespace, there is no interoperability of data or operations between Blob and Data Lake Storage Gen2 REST APIs with the public preview.
Once the hierarchical namespace is enabled for a storage account, set
true. In a notebook, you can set this configuration by running the command:
You can also set
true in the Spark configuration properties field on the cluster creation page.
- Can I create a a mount point for Azure Data Lake Storage Gen2?
- Mount points of Azure Data Lake Storage Gen2 are not supported.
- Does ABFS supports Shared Access Signature (SAS) token authentication?
- SAS token authentication is not supported.
- Can I use the
abfsscheme to access Azure Data Lake Storage Gen2?
- Yes. However, we recommend that you use
abfssscheme, which uses SSL encrypted access, wherever possible.
- While accessing an Azure Data Lake Storage Gen2 account with hierarchical namespace enabled I experienced
java.io.FileNotFoundExceptionand the error message mentions
If the error message includes the following information, it is because your command is trying to access a Blob Storage container created through the Azure Portal.
StatusCode=404 StatusDescription=The specified filesystem does not exist. ErrorCode=FilesystemNotFound ErrorMessage=The specified filesystem does not exist.
When a hierarchical namespace is enabled, you will not need to create containers through Azure Portal. If you see this issue, delete the Blob container through Azure Portal. After a few minutes, you will be able to access this container. Or, you can change your
abfssURI to use a different container as long as this container is not created through Azure Portal.