Secure Access to S3 Buckets Across Accounts Using IAM Roles with an AssumeRole Policy

In AWS you can set up cross-account access, so the computing in one account can access a bucket in another account. One way to grant access, described in Secure Access to S3 Buckets Using IAM Roles, is to grant an account direct access to a bucket in another account. Another way to grant access to a bucket is to allow an account to assume a role in another account.

Consider AWS Account A <Databricks-deployment-s3> and AWS Account B <bucket-owner-acct-id>. Account A is used when signing up with Databricks, so EC2 services and the DBFS root bucket is managed by this account. Account B has a bucket <s3-bucket-name>.

../../../_images/assume-role.png

This topic provides the steps to configure Account A to use the AWS AssumeRole action to access S3 files in Account B as a role in Account B. To enable this access you perform configuration in Account A and Account B, in the Databricks Admin Console, when you configure a Databricks cluster, and when you run a notebook that accesses the bucket.

Requirements

  • AWS administrator access to IAM roles and policies in the AWS account of the Databricks deployment and the AWS account of the S3 bucket.
  • Target S3 bucket.
  • If you intend to enable encryption for the S3 bucket, you must add the IAM role as a Key User for the KMS key provided in the configuration. See Configure KMS encryption.

Step 1: In Account A, create role MyRoleA and attach policies

  1. Create a role named MyRoleA in Account A. The Instance Profile ARN is arn:aws:iam::<Databricks-deployment-s3>:instance-profile/MyRoleA.

  2. Create a policy that says that a role in Account A can assume MyRoleB in Account B. Attach it to MyRoleA. Click Inline policy and paste in the policy:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Stmt1487884001000",
          "Effect": "Allow",
          "Action": [
            "sts:AssumeRole"
          ],
          "Resource": [
            "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB"
          ]
        }
      ]
    }
    
  3. Update the policy for the Account A role used to create clusters, adding the iam:PassRole action to MyRoleA:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Stmt1403287045000",
          "Effect": "Allow",
          "Action": [
            "ec2:AssociateDhcpOptions",
            "ec2:AssociateIamInstanceProfile",
            "ec2:AssociateRouteTable",
            "ec2:AttachInternetGateway",
            "ec2:AttachVolume",
            "ec2:AuthorizeSecurityGroupEgress",
            "ec2:AuthorizeSecurityGroupIngress",
            "ec2:CancelSpotInstanceRequests",
            "ec2:CreateDhcpOptions",
            "ec2:CreateInternetGateway",
            "ec2:CreateKeyPair",
            "ec2:CreatePlacementGroup",
            "ec2:CreateRoute",
            "ec2:CreateSecurityGroup",
            "ec2:CreateSubnet",
            "ec2:CreateTags",
            "ec2:CreateVolume",
            "ec2:CreateVpc",
            "ec2:CreateVpcPeeringConnection",
            "ec2:DeleteInternetGateway",
            "ec2:DeleteKeyPair",
            "ec2:DeletePlacementGroup",
            "ec2:DeleteRoute",
            "ec2:DeleteRouteTable",
            "ec2:DeleteSecurityGroup",
            "ec2:DeleteSubnet",
            "ec2:DeleteTags",
            "ec2:DeleteVolume",
            "ec2:DeleteVpc",
            "ec2:DescribeAvailabilityZones",
            "ec2:DescribeIamInstanceProfileAssociations",
            "ec2:DescribeInstanceStatus",
            "ec2:DescribeInstances",
            "ec2:DescribePlacementGroups",
            "ec2:DescribePrefixLists",
            "ec2:DescribeReservedInstancesOfferings",
            "ec2:DescribeRouteTables",
            "ec2:DescribeSecurityGroups",
            "ec2:DescribeSpotInstanceRequests",
            "ec2:DescribeSpotPriceHistory",
            "ec2:DescribeSubnets",
            "ec2:DescribeVolumes",
            "ec2:DescribeVpcs",
            "ec2:DetachInternetGateway",
            "ec2:DisassociateIamInstanceProfile",
            "ec2:ModifyVpcAttribute",
            "ec2:ReplaceIamInstanceProfileAssociation",
            "ec2:RequestSpotInstances",
            "ec2:RevokeSecurityGroupEgress",
            "ec2:RevokeSecurityGroupIngress",
            "ec2:RunInstances",
            "ec2:TerminateInstances"
          ],
          "Resource": [
              "*"
          ]
        },
        {
          "Effect": "Allow",
          "Action": "iam:PassRole",
          "Resource": [
            "arn:aws:iam::<Databricks-deployment-s3>:role/MyRoleA"
          ]
        }
      ]
    }
    

Step 2: In Account B, create role MyRoleB and attach policies

  1. Create a role named MyRoleB. The Role ARN is arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB.

  2. Edit the trust relationship of role MyRoleB to allow a role MyRoleA in Account A to assume a role in Account B. Select IAM > Roles > MyRoleB > Trust relationships > Edit trust relationship and enter:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": [
              "arn:aws:iam::<Databricks-deployment-s3>:role/MyRoleA"
            ]
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }
    
  3. Create a bucket policy for the bucket <s3-bucket-name>. Select S3 > <s3-bucket-name> > Permissions > Bucket Policy and enter:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "s3:GetBucketLocation",
            "s3:ListBucket"
          ],
          "Resource": [
              "arn:aws:s3:::<s3-bucket-name>"
          ]
        },
        {
          "Effect": "Allow",
          "Action": [
            "s3:PutObject",
            "s3:PutObjectAcl",
            "s3:GetObject",
            "s3:DeleteObject"
          ],
          "Resource": [
              "arn:aws:s3:::<s3-bucket-name>/*"
          ]
        }
      ]
    }
    
  4. Add the role (Principal) MyRoleB to the bucket policy.

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": [
                "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB",
            ]
          },
          "Action": [
            "s3:GetBucketLocation",
            "s3:ListBucket"
          ],
          "Resource": "arn:aws:s3:::<s3-bucket-name>"
        },
        {
          "Effect": "Allow",
          "Principal": {
              "AWS": [
                  "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB",
              ]
          },
          "Action": [
            "s3:PutObject",
            "s3:PutObjectAcl",
            "s3:GetObject",
            "s3:DeleteObject"
          ],
          "Resource": "arn:aws:s3:::<s3-bucket-name>/*"
        }
      ]
    }
    

Tip

If you are prompted with a Principal error, make sure that you modified only the Trust relationship policy.

Step 3: Add MyRoleA to a Databricks workspace using Account A instances

In the Databricks Admin Console, add the IAM role MyRoleA to Databricks using the MyRoleA instance profile ARN arn:aws:iam::<Databricks-deployment-s3>:instance-profile/MyRoleA from step 1.

Step 4: Configure cluster with MyRoleA

  1. In the Databricks UI of the same workspace, create a cluster.

  2. On the Instances tab, select the IAM role MyRoleA.

  3. On the Spark tab on the cluster detail page, optionally set the assumeRole credential type and assume role ARN of MyRoleB:

    spark.hadoop.fs.s3a.credentialsType AssumeRole
    spark.hadoop.fs.s3a.stsAssumeRole.arn arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB
    spark.hadoop.fs.s3a.canned.acl BucketOwnerFullControl
    spark.hadoop.fs.s3a.acl.default BucketOwnerFullControl
    
  4. Start the cluster.

Step 5: Access bucket in Account B from a notebook running on cluster with MyRoleA

  1. Attach a notebook to the cluster you created and configured in Step 4.

  2. If the notebook is attached to cluster running Databricks Runtime 4.0 and above, optionally mount the S3 bucket on DBFS with extra configurations:

    dbutils.fs.mount("s3a://<s3-bucket-name>", "/mnt/<s3-bucket-name>",
      extraConfigs = Map(
        "fs.s3a.credentialsType" -> "AssumeRole",
        "fs.s3a.stsAssumeRole.arn" -> "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB",
        "spark.hadoop.fs.s3a.canned.acl" -> "BucketOwnerFullControl",
        "spark.hadoop.fs.s3a.acl.default" -> "BucketOwnerFullControl"
      )
    )
    

    You can use such a mount in any cluster running Databricks Runtime 3.4 and above.

    Note

    This is the recommended option.

  3. If you did not set the assumeRole credential type and assume role ARN in the Spark configuration of the cluster or mount the S3 bucket, you can do it in the first command in a notebook:

    sc.hadoopConfiguration.set("fs.s3a.credentialsType", "AssumeRole")
    sc.hadoopConfiguration.set("fs.s3a.stsAssumeRole.arn", "arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB")
    sc.hadoopConfiguration.set("fs.s3a.canned.acl", "BucketOwnerFullControl")
    sc.hadoopConfiguration.set("fs.s3a.acl.default", "BucketOwnerFullControl")
    
  4. Verify that you can access the S3 bucket, using the following command:

    dbutils.fs.ls("s3a://<s3-bucket-name>/")