Stateful Processing

The Data Processing Library lets you process input data snapshots by querying the current base version of the output catalog. Consequently, you can perform stateful data processing, where the current input data is processed by taking into consideration the output produced from the previous run.

A DriverTask that runs a stateful compiler
A DriverTask that runs a stateful compiler

Data from the output catalog at its current version, can be referred to by using the Default.FeedbackCatalogId identifier, in the same way as the Default.OutCatalogId and input catalog identifiers are used to refer to the other output and input layers.

NOTE The feedbackRetriever of the DriverContext interface provides the functionality to load the required data to perform stateful processing, or you can access the Retriever for the FeedbackCatalogId through the inRetriever method of the DriverContext.

NOTE All compilation patterns in the processing library still apply when you perform stateful processing. Moreover, this feature does not impact any concept, functionality, or require any special configuration in the environment where the application runs, typically the Pipeline API.

Creating the initial state is delegated to the developers using the processing library. This initial state should be committed to the output catalog before the application runs. In most cases, it is sufficient to represent the initial state as an empty output catalog and implement logic in the compiler to define the initial processing values when the feedback layers contain no data, such as setting the counter to zero, initializing an object containing some information to represent the initial state of the processing, and so on.

The feedback catalog is the output catalog version which is produced in the compiler's last run. You can access earlier versions of the output catalog using the concepts described in multiple catalog versions.

The section below describes an example of stateful processing, covering the major design decisions you should consider for this type of data processing.

Example: POI Data Changes Counter

A Point of Interest (POI) is an object that models a place in the real world. In this context, places can be shops, tourist attractions, or similar. A POI may contain one of the following properties:

  • id
  • type
  • geocoordinates (lat, long)
  • name
  • phone.

Among these properties, we assume that id, type, and the geocoordinates do not change over time; other information may change across different data releases.

Suppose we need to count the number of times POIs information have changed over time. The output catalog contains information related to all POIs ever stored in the input layer, because we want to keep track of the overall history for the POIs layer.

This is an example of stateful processing where the result of a previous elaboration (the current values of the counters) needs to be included as additional input for the compiler.

In this context, we assume that POIs are stored in a single layer and we want the output layer to have the same tiling as the input layer.

Design of Output Layers and Feedback Layers

For this use case, we have one layer for both the output layer and the feedback layer. This layer should contain the following properties for each POI:

  • id
  • name
  • phone
  • counter that holds the number of times a given POI content changed

The name and phone members are the information that is monitored.

Initial State

In this case, the initial state can consist of an empty output layer. The compiler's logic should consider any tile or POI that has not yet been processed as new input, and then set new counter members to zero and fill the information being monitored.

Processing Pattern

Since there is a fixed correspondence between couples of input and feedback keys versus a single output layer key, the most suitable processing pattern is the Direct M:N pattern described in Direct 1:N and M:N Compilers.

Processing Logic

The processing logic works on a tile basis. The output tile content is initialized to contain the current feedback tile content, when available. The empty output catalog is considered a valid initial state.

For each POI in the input layer:

  • if the POI is not in the feedback layer:
    • add a new entry to the output layer by setting its counter to zero and the id, the name, and the phone to the value of the POI
  • else if the name and/or the phone changed:
    • update the changed information in the output layer and increase the counter by 1

Note that running the above processing a second time, on a given input produces no changes in the output layer. This is the desirable property for stateful processing where the output converges for a given configuration in one step.

results matching ""

    No results matching ""