VPC Peering

VPC peering allows your Databricks clusters to connect to your other AWS infrastructure (RDS, Redshift, Kafka, Cassandra, and so on) using private IP addresses within the internal AWS network. In order to establish a peering connection, both the Databricks VPC and the VPC hosting your other infrastructure must exist in the same AWS region.

The VPC hosting the other infrastructure must have a CIDR range distinct from the Databricks VPC and any other CIDR range included as a destination in the Databricks VPC main route table. If you have a conflict, you can contact Databricks support to inquire about moving your Databricks VPC to a new CIDR range of your choice. You can view this by searching for the Databricks VPC in your AWS Console, clicking on the main route table associated with it, and then examining the Route Tables tab. Here is an example of a main route table for a Databricks deployment that is already peered with several other VPCs:

Databricks VPC Route Table

For information on VPC peering, see the AWS VPC Peering guide.

This guide walks you through an example of peering an AWS Aurora RDS to your Databricks VPC using the AWS Console. If you prefer a programmatic solution, go to Programmatic VPC peering for a notebook that performs all of the steps for you. Finally, there is a troubleshooting section for common problems and resolutions.

Important

Consult your AWS/devops team before trying to set up VPC peering. Some familiarity with AWS as well as sufficient permissions will ensure this process goes smoothly. The notebook can help you make this transition, however depending on your environment it is important to ensure to make the necessary modifications to ensure there is no impact to the your existing infrastructure.

AWS Console example

The following diagram illustrates all of the different components that are involved in peering your Databricks deployment to your other AWS infrastructure. In the example, Databricks is deployed in one AWS account and the Aurora RDS is deployed into another. A peering connection is established to link the two VPCs across both AWS accounts.

VPC Peering Connection Across AWS Accounts

As you move through this process within your own AWS Console, it helps to keep a table of information to refer back to. Record the following:

  1. ID and CIDR Range of your Databricks VPC.
  2. ID and CIDR Range of your other infrastructure (Aurora RDS).
  3. ID of the main route table of your Databricks VPC.
AWS Service Name ID CIDR Range
VPC Databricks VPC vpc-dbcb3fbc 10.126.0.0/16
VPC Aurora RDS VPC vpc-7b52471c 172.78.0.0/16
Route Table Databricks Main Route Table rtb-3775c750  

Step 1: Create a peering connection

  1. Navigate to the VPC Dashboard.

  2. Select Peering Connections.

  3. Click Create Peering Connection

  4. Set the VPC Requester to the Databricks VPC ID.

  5. Set the VPC Acceptor to the Aurora VPC ID.

  6. Click Create Peering Connection.

    Create Peering Connection

Step 2: Record the ID of the peering connection

AWS Service Name ID CIDR Range
VPC Databricks VPC vpc-dbcb3fbc 10.126.0.0/16
VPC Aurora RDS VPC vpc-7b52471c 172.78.0.0/16
Route Table Databricks Main Route Table rtb-3775c750  
Peering Connection Databricks VPC <> Aurora VPC pcx-4d148024  

Step 3: Accept the peering connection request

The VPC with the Aurora RDS will need to have its owner approve the request. The status on Peering Connections indicates Pending Acceptance until this is done.

Peering Connection Pending Acceptance

Select Actions > Accept Request.

Peering Connection Accept Request

Step 4: Add DNS resolution to peering connection

  1. Log into the AWS Account that hosts the Databricks VPC.
  2. Navigate to the VPC Dashboard.
  3. Select Peering Connections.
  4. From the Actions menu, select Edit DNS Settings.
  5. Click to enable DNS resolution.
  6. Log into the AWS Account that hosts the Aurora VPC and repeat steps 2 - 4.
Enable DNS Resolution

Step 5: Add destination to Databricks VPC main route table

  1. Select Route Tables in the VPC Dashboard.

  2. Search for the Databricks VPC ID.

  3. Click the Edit button under the Routes tab.

  4. Click Add another route.

  5. Enter the CIDR range of the Aurora VPC for the Destination.

  6. Enter the ID of the peering connection for the Target.

    Databricks VPC Route Destinations

Step 6: Add destination to Aurora VPC main route table

  1. Select Route Tables in the VPC Dashboard.

  2. Search for the Aurora VPC ID.

  3. Click the Edit button under the Routes tab.

  4. Click Add another route.

  5. Enter the CIDR range of the Databricks VPC for the Destination.

  6. Enter the ID of the peering connection for the Target.

    Aurora VPC Route Destinations

Step 7: Find the Databricks unmanaged security group

  1. Select Security Groups in the VPC Dashboard.
  2. Search for the ID of the Databricks VPC.
  3. Find and Record the ID of the security group with Unmanaged in the name. Do not select the Managed security group.
AWS Service Name ID CIDR Range
VPC Databricks VPC vpc-dbcb3fbc 10.126.0.0/16
VPC Aurora RDS VPC vpc-7b52471c 172.78.0.0/16
Route Table Databricks Main Route Table rtb-3775c750  
Peering Connection Databricks VPC <> Aurora VPC pcx-4d148024  
Security Group Databricks Unmanaged Group sg-96016bef  

Step 8: Add rule to unmanaged security group

  1. Select Security Groups in the VPC Dashboard.

  2. Search for the ID of the Aurora VPC.

  3. Add an Inbound Rule by clicking Edit and then Add Another Rule.

  4. Select Custom TCP Rule or the service that relates to your RDS.

  5. Set the Port Range to correspond to your RDS service. The default for Aurora [MySQL] is 3306.

  6. Set the Source to be the security group ID of the Unmanaged Databricks security group.

    Aurora Security Group Rule

Step 9: Test connectivity

  1. Create a Databricks cluster.

  2. Check to see if you can connect to the database with the following netcat command:

    %sh nc -zv <hostname> <port>
    
    Validate Connectivity

Programmatic VPC peering

This notebook supports two scenarios:

  • Establishing VPC peering between Databricks VPC and another VPC in the same AWS account
  • Establishing VPC peering between Databricks VPC and another VPC in a different AWS account

Troubleshooting

Can’t establish connectivity with netcat

If you can’t establish connectivity with netcat, check that the hostname is resolving via DNS by using the host Linux command. If the hostname does not resolve, verify that you have enabled DNS resolution in your peering connection.

%sh host -t a <hostname>
Validate DNS Resolution
Can’t establish connectivity the hostname or the IP address

If you aren’t able to establish connectivity with either the hostname or the IP address, verify that the VPC of your Aurora RDS has 3 subnets associated with its main route table.

  1. Select Subnets from the VPC Dashboard and search for the ID of the Aurora VPC. There should be a subnet for each availability zone.

    Aurora VPC Subnets
  2. Make sure that each of those subnets are associated with the main route table.

    1. Select Route Tables from the VPC Dashboard and search for the main route table associated with the Aurora RDS.

    2. Click the Subnet Associations tab and then Edit. You should see all 3 subnets in the list, but none of them should have Associate selected.

      Aurora Subnet Associations