Python Clusters

Spark jobs, Python notebook cells, and library installation all support both Python 2 and 3.

Python 3 is supported on all Databricks Runtime versions starting with Spark 2.0.2-db3.

The default Python version for clusters created using the UI is Python 3. The default version for clusters created using the REST API is Python 2.

Create a Python cluster

To specify the Python version when you create a cluster, select it from the Python Version drop-down.

../../_images/python-select.png

You can create a cluster running a specific version of Python using the API by setting the environment variable PYSPARK_PYTHON to /databricks/python/bin/python or /databricks/python3/bin/python3. For an example, see the REST API example Create a Python 3 cluster.

To validate that the PYSPARK_PYTHON configuration took effect, in a Python notebook (or %python cell) run

import sys
print(sys.version)

If you specified /databricks/python3/bin/python3, it should print something like:

3.5.2 (default, Sep 10 2016, 08:21:44)
[GCC 5.4.0 20160609]

Important

When you run %sh python --version in a notebook, python refers to the Ubuntu system Python version, which is Python 2. Use /databricks/python/bin/python to refer to the version of Python used by Databricks notebooks and Spark: this path is automatically configured to point to the correct Python executable.

Frequently asked questions (FAQ)

Can I use both Python 2 and Python 3 notebooks on the same cluster?
No. The Python version is a cluster-wide setting and is not configurable on a per-notebook basis.
What libraries are pre-installed on Python clusters?
Python 2 and 3 share the same set of installed libraries and library versions with only one exception: simples3 is not available for Python 3, so it is installed only in Python 2. For details on the specific libraries that are pre-installed, see the Databricks Runtime release notes.
Will my existing PyPI libraries work with Python 3?
Yes. Databricks installs the correct version if the library supports both Python 2 and 3. If the library does not support Python 3, then library attachment fails with an error.
Will my existing .egg libraries work with Python 3?

It depends on whether your existing egg library is cross-compatible with both Python 2 and 3. If the library does not support Python 3 then either library attachment will fail or runtime errors will occur.

For a comprehensive guide on porting code to Python 3 and writing code compatible with both Python 2 and 3, see http://python3porting.com/.

Can I still install Python libraries using init scripts?
A common use case for Cluster Node Initialization Scripts is to install packages. Use /databricks/python/bin/pip to ensure that Python packages install into Databricks Python virtual environment rather than the system Python environment.