Skip to main content

Load data using a Unity Catalog external location

Preview

This feature is in Public Preview.

This article describes how to use the add data UI to create a managed table from data in Amazon S3 using a Unity Catalog external location. An external location is an object that combines a cloud storage path with a storage credential that authorizes access to the cloud storage path.

Databricks recommends using Unity Catalog external locations to access data in cloud object storage. The legacy S3 table import page only supports creating tables in the legacy Hive metastore and requires that you select a compute resource that uses an instance profile.

Before you begin

Before you begin, you must have the following:

File types

The following file types are supported:

  • CSV
  • TSV
  • JSON
  • XML
  • AVRO
  • Parquet

Step 1: Confirm access to the external location

To confirm access to the external location, do the following:

  1. In the sidebar of your Databricks workspace, click Catalog.
  2. In Catalog Explorer, click External Data > External Locations.

Step 2: Create the managed table

To create the managed table, do the following:

  1. In the sidebar of your workspace, click + New > Add data.

  2. In the add data UI, click Amazon S3.

  3. Select an external location from the drop-down list.

  4. Select the folders and the files that you want to load into Databricks, and then click Preview table.

  5. Select a catalog and a schema from the drop-down lists.

  6. (Optional) Edit the table name.

  7. (Optional) To set advanced format options by file type, click Advanced attributes, turn off Automatically detect file type, and then select a file type.

    For a list of format options, see the following section.

  8. (Optional) To edit the column name, click the input box at the top of the column.

    Column names don’t support commas, backslashes, or unicode characters (such as emojis).

  9. (Optional) To edit column types, click the icon with the type.

  10. Click Create table.

File type format options

The following format options are available, depending on the file type:

Format optionDescriptionSupported file types
Column delimiterThe separator character between columns. Only a single character is allowed, and backslash is not supported.

The default is a comma.
CSV
Escape characterThe escape character to use when parsing the data.

The default is a quotation mark.
CSV
First row contains the headerThis option specifies whether the file contains a header.

Enabled by default.
CSV
Automatically detect file typeAutomatically detect file type. Default is true.XML
Automatically detect column typesAutomatically detect column types from file content. You can edit types in the preview table. If this is set to false, all column types are inferred as STRING.

Enabled by default.
- CSV

- JSON
- XML
Rows span multiple linesWhether a column’s value can span multiple lines in the file.

Disabled by default.
- CSV

- JSON
Merge the schema across multiple filesWhether to infer the schema across multiple files and to merge the schema of each file.

Enabled by default.
CSV
Allow commentsWhether comments are allowed in the file.

Enabled by default.
JSON
Allow single quotesWhether single quotes are allowed in the file.

Enabled by default.
JSON
Infer timestampWhether to try to infer timestamp strings as TimestampType.

Enabled by default.
JSON
Rescued data columnWhether to save columns that don’t match the schema. For more information, see What is the rescued data column?.

Enabled by default.
- CSV

- JSON
- Avro
- Parquet
Exclude attributeWhether to exclude attributes in elements. Default is false.XML
Attribute prefixThe prefix for attributes to differentiate attributes and elements. Default is _.XML

Column data types

The following column data types are supported. For more information about individual data types see SQL data types.

Data TypeDescription
BIGINT8-byte signed integer numbers.
BOOLEANBoolean (true, false) values.
DATEand day, without a time-zone.
DECIMAL (P,S)Numbers with maximum precision P and fixed scale S.
DOUBLE8-byte double-precision floating point numbers.
STRINGCharacter string values.
TIMESTAMPValues comprising values of fields year, month, day, hour, minute, and second, with the session local timezone.

Known issues

  • You might experience issues with special characters in complex data types, such as a JSON object with a key containing a backtick or a colon.
  • Some JSON files might require that you manually select JSON for the file type. To manually select a file type after you select files, click Advanced attributes, turn off Automatically detect file type, and then select JSON.
  • Nested timestamps and decimals inside complex types might encounter issues.