Amazon S3 Select

Amazon S3 Select enables retrieving only required data from an object. The Databricks S3 Select connector provides a Spark data source that leverages S3 Select. When you use an S3 Select data source, filter and column selection on a DataFrame is pushed down, saving S3 data bandwidth.

Requirements

The Databricks S3 Select connector requires Databricks Runtime 4.1 or above.

Limitations

Amazon S3 Select supports the following file formats:

  • CSV and JSON files
  • UTF-8 encoding
  • GZIP or no compression

The Databricks S3 Select connector has the following limitations:

  • Complex types (arrays and objects) cannot be used in JSON
  • Schema inference is not supported
  • File splitting is not supported, however multiline records are supported
  • DBFS mount points are not supported

Usage

  • Scala

    sc.read.format("s3select").schema(...).options(...).load("s3://bucket/filename")
    
  • SQL

    CREATE TABLE name (...) USING S3SELECT LOCATION 's3://bucket/filename' [ OPTIONS (...) ]
    

If the filename extension is .csv or .json, the format is automatically detected; otherwise you must provide the FileFormat option.

Options

This section describes options for all file types and options specific to CSV and JSON.

Generic options

Option name Default value Description
FileFormat ‘auto’ Input file type (‘auto’, ‘csv’, or ‘json’)
CompressionType ‘none’ Compression codec used by the input file (‘none’ or ‘gzip’)

CSV specific options

Option name Default value Description
NullValue ‘’ Character string representing null values in the input
Header false Flag to skip the first line of the input (the potential header contents are ignored)
Comment ‘#’ Lines starting with the value of this parameters are ignored
RecordDelimiter ‘n’ Character separating records in a file
Delimiter ‘,’ Character separating fields within a record
Quote ‘”’ Character used to quote values containing reserved characters
Escape ‘”’ Character used to escape quoted quote character

JSON specific options

Option name Default value Description
Type document Type of input (‘document’ or ‘lines’)

S3 authentication

You can use the S3 authentication methods (keys and IAM roles) available in Databricks; we recommend that you use IAM roles. There are three ways of providing the credentials:

  1. Default Credential Provider Chain (recommended option): AWS credentials are automatically retrieved through the DefaultAWSCredentialsProviderChain. If you use IAM roles to authenticate to S3 then you should use this method. Other methods of providing credentials (methods 2 and 3) take precedence over this default.

  2. Set keys in Hadoop conf: Specify AWS keys in Hadoop configuration properties.

    To reference the s3a:// filesystem, set the fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey properties in a Hadoop XML configuration file or call sc.hadoopConfiguration.set() to set Spark’s global Hadoop configuration.

    • Scala

      sc.hadoopConfiguration.set("fs.s3a.access.key", "$AccessKey")
      sc.hadoopConfiguration.set("fs.s3a.secret.key", "$SecretKey")
      
    • Python

      sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY)
      sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
      
  3. Encode keys in URI: For example, the URI s3a://$AccessKey:$SecretKey@bucket/path/to/dir encodes the key pair (AccessKey, SecretKey).