Accessing Data

This topic describes how to import data, load data using the Spark API, and edit and delete data using Databricks File System commands.

Import data

If you have small files on your local machine that you want to analyze with Databricks, you can easily upload them to Databricks File System. For simple exploration scenarios you can:

  • Drop files into or browse to files in the Import & Explore Data box on the landing page:

    ../_images/import-landing.png
  • Upload the files in the Create table UI.

For production environments, however, we recommend that you access Databricks File System using the CLI or one of the APIs. You can also use a wide variety of data sources to import data directly in your notebooks.

Load data

You can read your raw data into Spark directly. For example, if you uploaded a CSV, you can read your data using one of these examples.

Tip

For easier access, we recommend that you create a table. See Databases and Tables for more information.

Scala
val sparkDF = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/state_income-9f7c5.csv")
Python
sparkDF = spark.read.format('csv').options(header='true', inferSchema='true').load('/FileStore/tables/state_income-9f7c5.csv')
R
sparkDF <- read.df(source = "csv", path = "/FileStore/tables/state_income-9f7c5.csv", header="true", inferSchema = "true")
Scala RDD
val rdd = sc.textFile("/FileStore/tables/state_income-9f7c5.csv")
Python RDD
rdd = sc.textFile("/FileStore/tables/state_income-9f7c5.csv")

If the data volume is small enough, you can also load this data directly onto the driver node. For example:

Python
pandas_df = pd.read_csv("/dbfs/FileStore/tables/state_income-9f7c5.csv", header='infer')
R
df = read.csv("/dbfs/FileStore/tables/state_income-9f7c5.csv", header = TRUE)

Download to driver

You can use %sh wget <url>/<filename> to download data to the Spark driver node.

Note

The cell output prints Saving to: '<filename>', but the file is actually saved to file:/databricks/driver/<filename>.

Edit data

You cannot edit data directly within Databricks, but you can overwrite a data file using Databricks File System commands.

Delete data

To delete data, use the following Databricks Utilities command:

dbutils.fs.rm("dbfs:/FileStore/tables/state_income-9f7c5.csv", true)

Warning

Deleted data cannot be recovered.