Overlap Join

Overlap joins, which report the shared intervals between two overlapping features, are a common operation in genomic analysis. A canonical example is identifying the aligned genomic reads which overlap with a set of target sites. In a single node setting, this is often accomplished using the bedtools intersect command line tool.

Databricks Runtime HLS automatically optimizes for overlap joins in Spark SQL.

Usage

To use the overlap join optimization, import the overlaps function and pass in four DataFrame columns corresponding to the start and end of the pair of features. We define coordinates according to half-open intervals, in which the start is included and the end is excluded.

from hls.expressions import overlaps
reads.join(targets, overlaps(reads.reads_start, reads.reads_end, targets.targets_start, targets.targets_end))
import org.apache.spark.sql.hls.dsl.expressions.overlaps
reads.join(targets, overlaps(reads("reads_start"), reads("reads_end"), targets("targets_start"), targets("targets_end")))

Configuration

The optimization for overlap join is based on creating bins in the feature range. The default bin size is set to 5000 base pairs by default. If you are working on an application with significantly longer targets, we suggest increasing your bin size as follows.

spark.databricks.hls.rangeJoin.binSize 5000

Example

Check out this demonstration notebook to observe the speedup resulting from our overlap join optimization.