Filter Input Partitions

Sometimes, you may have tasks that only need a subset of the input layer partitions to produce their output layers. In such cases, it is usually desirable to exclude unnecessary partition keys as early as possible in the execution process to avoid wasting CPU, memory, and network bandwidth.

While it is possible to manually filter the input keys in the compiler's front-end implementation for most compilation patterns, for some patterns, such as the RefTreeCompiler, it can be difficult to know if a key should be filtered, or not, since we typically do not know if a subject or a referenced partition is being processed.

For these cases, the library provides a way to configure a partition filter to select partition keys and metadata from the list of partitions to process. This configuration works as follows:

  • For the RefTreeCompiler, only the subject partitions are filtered. This is something to remember, particularly if an input layer is itself the product of an upstream compiler in a multi-compiler driver task with filtering applied. If a compiler down the chain references neighbor partition names from a layer produced by an upstream compiler, make sure that the filtering is permissive enough to output all partitions to be referenced by any following compilation in the driver task. You can use the byId executor config to configure a different filter for a compiler.
  • The here.platform.data-processing.executors.partitionKeyFilters configuration is part of fingerprints, which means that changing this configuration triggers a non-incremental run for the compilation to remain deterministic.
  • Partition key filters can also be used with the DeltaSet interface. Partition key filters defined in here.platform.data-processing.executors.partitionKeyFilters are used in all query and readBack DeltaSet transformations to filter the input partitions.

Configure Filters

Partition key filtering can be configured in the application.conf. For example, this filter can configure a task to only process partitions contained within a given latitude/longitude bounding box:

here.platform.data-processing.executors.partitionKeyFilters = [
  {
    className = "BoundingBoxFilter"
    param.boundingBox { north = 24.8, south = 24.68, east = 121.8, west = 121.7 }
  }
]

The root of the property is a list, multiple filters specified at this level are combined as their union (OR logic). If there is a need to combine them with AND logic, they can be put under a single AndFilter at the root.

This is the list of built-in filters:

  • BoundingBoxFilter
  • AllowListFilter
  • AndFilter
  • OrFilter
  • NotFilter

A custom filter can also be applied from applications by extending the PartitionKeyFilter. These filters return either true or false from shouldProcess. Boolean logic is applied to combine them up to the root filter, such as:

  • Or of two bounding boxes is a union
  • And of two bounding boxes is an intersection
  • Notof two bounding boxes means that only partitions outside of the underlying bounding box are considered

Filters can only be applied on partition key parameters, for example, catalog and layer IDs, and partition name.

NOTE: For performance reasons, filters do not filter based on partition metadata or payloads.

Override Filters for Specific Compilers

If one of the compilers needs a different partition filter applied, you can use the byId mechanism to configure a different set of filter for the executor that wraps it. Overriding these filters means replacing the whole default set of filters, as they are not combined.

Example:

// Application-wide default filter.
here.platform.data-processing.executors.partitionKeyFilters = [
  {
    className = "BoundingBoxFilter"
    param.boundingBox { north = 24.8, south = 24.68, east = 121.8, west = 121.7 }
  }
]

// Filters can be overridden for specific executors/compilers using their Executor.Id.
here.platform.data-processing.executors.byId {
  // Apply a larger bounding box filter to the input layers of this compiler to
  // include neighbor partitions.
  intermediate-compiler.partitionKeyFilters = [
    {
      className = "BoundingBoxFilter"
      param.boundingBox { north = 24.9, south = 24.58, east = 121.9, west = 121.6 }
    }
  ]
  // This compiler isn't using filtering.
  another-compiler.partitionKeyFilters = []
}

Apply a Filter to Specific Layers Only

The AllowListFilter can be used to filter based on a fixed list of partition names. When combined with Boolean operation filters, AllowListFilter can also be used to apply some filters to specific layers only.

For example, to apply a bounding box only to inLayer of inCatalog, you can configure your application as follows:

// Match (process) a partition key if:
here.platform.data-processing.executors.partitionKeyFilters = [
  // NOT layer in inCatalog:inLayer
  {
    className = "NotFilter"
    param.operand = {
      className = "AllowListFilter"
      param.catalogsAndLayers = {"inCatalog": ["inLayer"]}
    }
  },
  // OR (layer in inCatalog:inLayer AND partition intersects boundingBox)
  {
    className = "AndFilter"
    param.operands = [
      {
        className = "AllowListFilter"
        param.catalogsAndLayers = {"inCatalog": ["inLayer"]}
      }, {
        className = "BoundingBoxFilter"
        param.boundingBox { north = 2.8, south = 2.68, east = 121.8, west = 121.7 }
      }
    ]
  }
]

For additional information about configuring partition key filters see Configure The Library.

results matching ""

    No results matching ""