Compilation of Workflow Definition¶
Compilation is a process that takes a document written in a programming language, checks its correctness, and transforms it into a format that the execution environment can understand.
A similar process happens in the Workflows ecosystem whenever you want to run a Workflow Definition. The Workflows Compiler performs several steps to transform a JSON document into a computation graph, which is then executed by the Workflows Execution Engine. While this process can be complex, understanding it can be helpful for developers contributing to the ecosystem. In this document, we outline key details of the compilation process to assist in building Workflow blocks and encourage contributions to the core Execution Engine.
Note
This document covers the design of Execution Engine v1
(which is current stable version). Please
acknowledge information about versioning to understand Execution Engine
development cycle.
Stages of compilation¶
Workflow compilation involves several stages, including:
-
Loading available blocks: Gathering all the blocks that can be used in the workflow based on configuration of execution environment
-
Compiling dynamic blocks: Turning dynamic blocks definitions into standard Workflow Blocks
-
Parsing the Workflow Definition: Reading and interpreting the JSON document that defines the workflow, detecting syntax errors
-
Building Workflow Execution Graph: Creating a graph that defines how data will flow through the workflow during execution and verifying Workflow integrity
-
Initializing Workflow steps from blocks: Setting up the individual workflow steps based on the available blocks, steps definitions and configuration of execution environment.
Let's take a closer look at each of the workflow compilation steps.
Workflows blocks loading¶
As described in the blocks bundling guide, a group of Workflow blocks can be packaged into a workflow plugin. A plugin is essentially a standard Python library that, in its main module, exposes specific functions allowing Workflow Blocks to be dynamically loaded.
The Workflows Compiler and Execution Engine are designed to be independent of specific Workflow Blocks, and the Compiler has the ability to discover and load blocks from plugins.
Roboflow provides the roboflow_core
plugin, which includes a set of basic Workflow Blocks that are always
loaded by the Compiler, as both the Compiler and these blocks are bundled in the inference
package.
For custom plugins, once they are installed in the Python environment, they need to be referenced using an environment
variable called WORKFLOWS_PLUGINS
. This variable should contain the names of the Python packages that contain the
plugins, separated by commas.
For example, if you have two custom plugins, numpy_plugin
and pandas_plugin
, you can enable them in
your Workflows environment by setting:
export WORKFLOWS_PLUGINS="numpy_plugin,pandas_plugin"
Both numpy_plugin
and pandas_plugin
are not paths to library repositories, but rather
names of the main modules of libraries shipping plugins (import numpy_plugin
must work in your
Python environment for the plugin to be possible to be loaded).
Once Compiler loads all plugins it is ready for the next stage of compilation.
Compilation of dynamic blocks¶
Note
The topic of dynamic Python blocks is covered in separate docs page. To unerstand the content of this section you only need to know that there is a way to define Workflow Blocks in-place in Workflow Definition - specifying both block manifest and Python code in JSON document. This functionality only works if you run Workflows Execution Engine on your hardware and is disabled ad Roboflow hosted platform.
The Workflows Compiler can transform Dynamic Python Blocks, defined directly in a Workflow Definition, into full-fledged Workflow Blocks at runtime. The Compiler generates these block classes dynamically based on the block's definition, eliminating the need for developers to manually create them as they would in a plugin.
Once this process is complete, the dynamic blocks are added to the pool of available Workflow Blocks. These blocks
can then be used in the steps
section of your Workflow Definition, just like any other standard block.
Parsing Workflow Definition¶
Once all Workflow Blocks are loaded, the Compiler retrieves the manifest classes for each block.
These manifests are pydantic
data classes that define the structure of step entries in definition.
At parsing stage, errors with Workflows Definition are alerted, for example:
-
usage of non-existing blocks
-
invalid configuration of steps
-
lack of required parameters for steps
Thanks to pydantic
, the Workflows Compiler doesn't need its own parser. Additionally, blocks creators use standard
Python library to define block manifests.
Building Workflow Execution Graph¶
Building the Workflow Execution graph is the most critical stage of Workflow compilation. Here's how it works:
Adding Vertices¶
First, each input, step and output are added as vertices in the graph, with each vertex given a special label for future identification. These vertices also include metadata, like marking input vertices with seeds for data lineage tracking (more on this later).
Adding Edges¶
After placing the vertices, the next step is to create edges between them based on the selectors defined in the Workflow. The Compiler examines the block manifests to determine which properties can accept selectors and the expected "kind" of those selectors. This enables the Compiler to detect errors in the Workflow definition, such as:
-
Providing an output kind from one step that doesn't match the expected input kind of the next step.
-
Referring to non-existent steps or inputs.
Each edge also contains metadata indicating which input property is being fed by the output data, which is helpful at later stages of compilation and during execution
Note
Normally, step inputs "request" data from step outputs, forming an edge from Step A's output to Step B's input during Step B's processing. However, control-flow blocks are an exception, as they both accept data and declare other steps in the manifest, creating a special flow-control edge in the graph.
Structural Validation¶
Once the graph is constructed, the Compiler checks for structural issues like cycles to ensure the graph can be executed properly.
Data Lineage verification¶
Finally, data lineage properties are populated from input nodes and carried through the graph. So, what is data lineage? Lineage is a list of identifiers that track the creation and nesting of batches through the steps, determining:
-
the source path of data
-
dimensionality level
of data -
compatibility of different pieces of data that may be referred by a step - ensuring that step will only take corresponding batches elements from multiple sources (such that batch element index
example: (1, 2)
refers to the exact same piece of data when two batch-oriented inputs are connected into the step and not to some randomly provided batches with different lineage that does not make sense to process together)
Each time a new nested batch is created by a step, a unique identifier is added to the lineage of the output. This allows the Compiler to track and verify if the inputs across steps are compatible.
Note
Fundamental assumption of data lineage is that all batch-oriented inputs are granted
the same lineage identifier - so implicitly it enforces all input batches to be fed with
data that has corresponding data-points at corresponding positions in batches. For instance,
if your Workflow compares image_1
to image_2
(and you declare those two inputs in Wofklow Definition),
the Compiler assumes the elements of image_1[3]
to correspond with image_2[3]
.
Thanks to lineage tracking, the Compiler can detect potential mistakes. For example, if you attempt to connect two dynamic crop outputs to a single step's inputs, the Compiler will notice that the number of crops in each output may not match. This would result in nested batch elements with mismatched indices, which could lead to unpredictable results during execution if the situation is not prevented.
Example of lineage missmatch
Imagine the following scenario:
-
you declare single image input in your Workflow
-
at first you perform object detection using two different models
-
you use two dynamic crop steps - to crop based on first and second model predictions respectivelly
-
now you want to use block to compare two images features (using classical Compute Vision methods)
What would you expect to happen when you plug inputs from those two crop steps into comparison block?
-
Without tracing the lineage you would "flatten" and "zip" those two batches and pass pairs of images to comparison block - the problem is that in this case you cannot determine if the comparisons between those elements actually makes sense - probably do not!
-
With lineage tracing - Compiler knows that you attempt to feed two batches with lineages that do not match regarding last nesting level and raises compilation error.
One may ask - "ok, but maybe I would like to apply secondary classifier on both crops and
merge results at the end to get all results in single output - is that possible?". The answer is
yes - as mentioned above, nested batches differ only at the last lineage level - so when we use
some blocks from "dimensionality collapse" category - we will align the results of secondary classifiers
into batches at dimensionality level
1 with matching lineage.
As outlined in the section dedicated to blocks development, each block can define
the expected dimensionality of its inputs and outputs. This refers to how the data should be structured.
For example, if a block needs an image
input that's one level above a batch of predictions
, the Compiler will
check that this requirement is met when verifying the Workflow step. If the connections between steps don’t match
the expected dimensionality, an error will occur. Additionally, each input is also verified to ensure it is compatible
based on data lineage. Once the step passes validation, the output dimensionality is determined and will be used to
check compatibility with subsequent steps.
It’s important to note that blocks define dimensionality requirements in relative terms, not absolute. This means a block specifies the difference (or offset) in dimensionality between its inputs and outputs. This approach allows blocks to work flexibly at any dimensionality level.
Note
In version 1, the Workflows Compiler only supports blocks that work across two different dimensionality levels
.
This was done to keep the design straightforward. If there's a need for blocks that handle more
dimensionality levels
in the future, we will consider expanding this support.
Denoting flow-control¶
The Workflows Compiler helps the Execution Engine manage flow-control structures in workflows. It marks specific attributes that allow the system to understand how flow-control impacts building inputs for certain steps and the execution of the workflow graph (for more details, see the Execution Engine docs).
To ensure the workflow structure is correct, the Compiler checks data lineage for flow-control steps in a similar way as described in the section on data-lineage verification.
The Compiler assumes flow-control steps can affect other steps if:
-
The flow-control step operates on non-batch-oriented inputs - in this case, the flow-control step can either allow or prevent the connected step (and related steps) from running entirely, even if the input is a batch of data - all batch elements are affected.
-
The flow-control step operates on batch-oriented inputs with compatible lineage - here, the flow-control step can decide separately for each element in the batch which ones will proceed and which ones will be stopped.
Batch-orientation compatibility¶
As it was outlined, Workflows define batch-oriented data and scalars. From the description of the nature of data in Workflows, you can also conclude that operations which are executed against batch-oriented data have two almost equivalent ways of running:
-
all-at-once: taking whole batches of data and processing them
-
one-by-one: looping over batch elements and getting results sequentially
Since the default way for Workflow blocks to deal with the batches is to consume them element-by-element, there is no real difference between batch-oriented data and scalars in such case. Execution Engine simply unpacks scalars from batches and pass them to each step.
The process may complicate when block accepts batch input. You will learn the details in blocks development guide, but block is required to denote each input that must be provided batch-wise and all inputs which can be fed with both batch-oriented data and scalars at the same time (which is much less common case). In such cases, lineage is used to deduce if the actual data fed into every step input is batch or scalar. When violation is detected (for instance scalar is provided for input that requires batches or vice versa) - the error is raised.
Potential future improvements
At this moment, we are not sure if the behaviour described above is limiting the potential of Workflows ecosystem. If you see that your Workflows cannot run due to the errors being result of described mechanism - please let us know in GitHub issues.
Initializing Workflow steps from blocks¶
The documentation often refers to a Workflow Step as an instance of a Workflow Block, which serves as its prototype. To put it simply, a Workflow Block is a class that implements specific behavior, which can be customized by configuration—whether it's set by the environment running the Execution Engine, the Workflow definition, or inputs at runtime.
In programming, we create an instance of a class using a constructor, usually requiring initialization parameters. One the same note, Workflow Blocks are initialized by the Workflows Compiler whenever a step in the Workflow references that block. Some blocks may need specific initialization parameters, while others won't.
When a block requires initialization parameters:
-
The block must declare the parameters it needs, as explained in detail in the blocks development guide
-
The values for these parameters must be provided from the environment where the Workflow is being executed.
-
The values for these parameters must be provided from the environment where the Workflow is being executed.
This second part might seem tricky, so let’s look at an example. In the in user guide,
under the section showing how to integrate with Workflows using the inference
Python package, you might come
across code like this:
[...]
# example init paramaters for blocks - dependent on set of blocks
# used in your workflow
workflow_init_parameters = {
"workflows_core.model_manager": model_manager,
"workflows_core.api_key": "<YOUR-API-KEY>,
"workflows_core.step_execution_mode": StepExecutionMode.LOCAL,
}
# instance of Execution Engine - init(...) method invocation triggers
# the compilation process
execution_engine = ExecutionEngine.init(
...,
init_parameters=workflow_init_parameters,
...,
)
[...]
In this example, workflow_init_parameters contains
values that the Compiler uses when
initializing Workflow steps based on block requests.
Initialization parameters (often called "init parameters") can be passed to the Compiler in two ways:
-
Explicitly: You provide specific values (numbers, strings, objects, etc.).
-
Implicitly: Default values are defined within the Workflows plugin, which can either be specific values or functions (taking no parameters) that generate values dynamically, such as from environmental variables.
The dictionary workflow_init_parameters
shows explicitly passed init parameters. The structure of the keys
is important: {plugin_name}.{init_parameter_name}
. You can also specify just {init_parameter_name}
, but this
changes how parameters are resolved.
How Parameters Are Resolved?¶
When the Compiler looks for a block’s required init parameter, it follows this process:
-
Exact Match: It first checks the explicitly provided parameters for an exact match to
{plugin_name}.{init_parameter_name}
. -
Default Parameters: If no match is found, it checks the plugin’s default parameters.
-
General Match: Finally, it looks for a general match with just
{init_parameter_name}
in the explicitly provided parameters.
This mechanism allows flexibility, as some block parameters can have default values while others must be provided explicitly. Additionally, it lets certain parameters be shared across different plugins.