Spark Submit (legacy)
The Spark Submit task type is a legacy pattern for configuring JARs as tasks. Databricks recommends using the JAR task. See JAR task for jobs.
Requirements
- You can run spark-submit tasks only on new clusters.
- You must upload your JAR file to a location or Maven repository compatible with your compute configuration. See Java and Scala library support.
- You cannot access JAR files stored in volumes.
- Spark-submit does not support cluster autoscaling. To learn more about autoscaling, see Cluster autoscaling.
- Spark-submit does not support Databricks Utilities (
dbutils
) reference. To use Databricks Utilities, use JAR tasks instead. - If you use a Unity Catalog-enabled cluster, spark-submit is supported only if the cluster uses the dedicated access mode. Standard access mode is not supported. See Access modes.
- Structured Streaming jobs should never have maximum concurrent runs set to greater than 1. Streaming jobs should be set to run using the cron expression
"* * * * * ?"
(every minute). Because a streaming task runs continuously, it should always be the final task in a job.
Configure a Spark Submit task
Add a Spark Submit
task from the Tasks tab in the Jobs UI by doing the following:
-
In the Type drop-down menu, select
Spark Submit
. -
Use Compute to configure a cluster that supports the logic in your task.
-
Use the Parameters text box to provide all arguments and configurations necessary to run your task as a JSON array of strings.
-
The first three arguments are used to identify the main class to run in a JAR at a specified path, as in the following example:
JSON["--class", "org.apache.spark.mainClassName", "dbfs:/Filestore/libraries/jar_path.jar"]
-
You cannot override the
master
,deploy-mode
, andexecutor-cores
settings configured by Databricks -
Use
--jars
and--py-files
to add dependent Java, Scala, and Python libraries. -
Use
--conf
to set Spark configurations. -
The
--jars
,--py-files
,--files
arguments support DBFS and S3 paths. -
By default, the Spark submit job uses all available memory, excluding memory reserved for Databricks services. You can set
--driver-memory
, and--executor-memory
to a smaller value to leave some room for off-heap usage.
-
-
Click Save task.