Creating a Python data model

Every plugin needs a schema to represent its expected inputs and outputs in a machine-readable format. The schema strong typing is a core design element of Arcaflow, enabling us to build portable workflows that compartmentalize failure conditions and avoid data errors.

When creating a data model for Arcaflow plugins in Python, everything starts with dataclasses. They allow Arcaflow to get information about the data types of individual fields in your class:

plugin.py

import dataclasses


@dataclasses.dataclass
class MyDataModel:
    some_field: str
    other_field: int

However, Arcaflow doesn’t support all Python data types. You pick from the following list:

str
int
float
bool
Enums
re.Pattern
typing.List[othertype]
typing.Dict[keytype, valuetype]
typing.Union[onedataclass, anotherdataclass]
Dataclasses
typing.Any

You can read more about the individual types in the data types section

Optional parameters

You can also declare any parameter as optional, like this:

plugin.py

@dataclasses.dataclass
class MyClass:
    param: typing.Optional[int] = None

Note that adding typing.Optional is not enough, you must specify the default value.

Annotations for validation and metadata

You can specify desired validations for each field like this:

plugin.py

@dataclasses.dataclass
class MyClass:
    param: typing.Annotated[int, schema.name("Param")]

Tip

Annotated objects are preferred as a best practice for a documented schema, and are expected for any officially-supported community plugins.

You can use the following annotations to add metadata to your fields:

schema.id adds a serialized field name for the current field (e.g. one containing dashes, which is not valid in Python)
schema.name adds a human-readable name to the parameter. This can be used to present a form field.
schema.description adds a long-form description to the field.
schema.example adds an example value to the field. You can repeat this annotation multiple times. The example must be provided as primitive types (no dataclasses).

You can also add validations to the fields. The following annotations are valid for all data types:

schema.required_if specifies a field that causes the current field to be required. If the other field is empty, the current field is not required. You can repeat this annotation multiple times. (Make sure to use the optional annotation above.)
schema.required_if_not specifies a field that, if not filled, causes the current field to be required. You can repeat this annotation multiple times.(Make sure to use the optional annotation above.)
schema.conflicts specifies a field that cannot be used together with the current field. You can repeat this annotation multiple times. (Make sure to use the optional annotation above.)

Additionally, some data types have their own validations and metadata, such as schema.min, schema.max, schema.pattern, or schema.units.

Note

When combining typing.Annotated with typing.Optional, the default value is assigned to the Annotated object, not to the Optional object.

plugin.py

@dataclasses.dataclass
class MyClass:
    param: typing.Annotated[
        typing.Optional[int],
        schema.name("Param")
    ] = None

Data types

Strings

Strings are, as the name suggests, strings of human-readable characters. You can specify them in your dataclass like this:

some_field: str

Additionally, you can apply the following validations:

schema.min() specifies the minimum length of the string if the field is set.
schema.max() specifies the maximum length of the string if the field is set.
schema.pattern() specifies the regular expression the string must match if the field is set.

Integers

Integers are 64-bit signed whole numbers. You can specify them in your dataclass like this:

some_field: int

Additionally, you can apply the following validations and metadata:

schema.min() specifies the minimum number if the field is set.
schema.max() specifies the maximum number if the field is set.
schema.units() specifies the units for this field (e.g. bytes). See Units.

Floating point numbers

Floating point numbers are 64-bit signed fractions. You can specify them in your dataclass like this:

some_field: float

Warning

Floating point numbers are inaccurate! Make sure to transmit numbers requiring accuracy as integers!

Additionally, you can apply the following validations and metadata:

schema.min() specifies the minimum number if the field is set.
schema.max() specifies the maximum number if the field is set.
schema.units() specifies the units for this field (e.g. bytes). See Units.

Booleans

Booleans are True or False values. You can specify them in your dataclass like this:

some_field: bool

Booleans have no additional validations or metadata.

Enums

Enums, short for enumerations, are used to define a set of named values as unique constants. They provide a way to represent a fixed number of possible values for a variable, parameter, or property. In Python, an enum is declared as a class, but doesn’t behave as a normal class. Instead, the “attributes” of the class act as independent “member” or “enumeration member” objects, each of which has a name and a constant value.

By using enums, you can give meaningful names to distinct values, making the code more self-explanatory and providing a convenient way to work with sets of related constants.

In an Arcaflow schema, an Enum type provides a list of valid values for a field. The Enum must define a set of members with unique values, all of which are either strings or integers.

You can specify an enum with string values like this:

import enum


class MyEnum(enum.Enum):
  Value1 = "value 1"
  Value2 = "value 2"

my_field: MyEnum

The MyEnum class above defines two members, Value1 and Value2. Each member is associated with a constant value, in this case, the strings “value 1” and “value 2” respectively. An input value of “value 1” will result in the plugin seeing a value for my_field of MyEnum.Value1.

You can specify an Enum class with integer values like this:

import enum

class MyEnum(enum.Enum):
    Value1 = 1
    Value2 = 2

my_field: MyEnum

The my_field variable is a variable of type MyEnum. It can store one of the defined enumeration members (Value1 or Value2). An input value of 1 in this case will result in the plugin receiving a value for my_field of MyEnum.Value1.

     value = MyEnum.Value1

In the above example, the Value1 member of MyEnum is accessed and assigned to the variable value.

Note

Enumeration members are “singleton” objects which have a single instance. In Python, you should compare enumeration members using is rather than == (for example, variable is MyEnum.Value1). The values of an Enum used in an Arcaflow schema must have values of string or integer data type.

Tip

Enums aren’t dataclasses, but can be used as the type of dataclass attributes.

Warning

Do not mix integers and strings in the same enum! The values for each Enum type must all be strings, or all integers.

Patterns

When you need to hold regular expressions, you can use a pattern field. This is tied to the Python regular expressions library. You can specify a pattern field like this:

import re

my_field: re.Pattern

Pattern fields have no additional validations or metadata.

Note

If you are looking for a way to do pattern/regex matching for a string you will need to use the schema.pattern() validation which specifies the regular expression, to which the string must match.

The below example declares that the first_name variable must only have uppercase and lowercase alphabets.

plugin.py

@dataclasses.dataclass
class MyClass:
    first_name: typing.Annotated[
        str,
        schema.min(2),
        schema.pattern(re.compile("^[a-zA-Z]+$")),
        schema.example("Arca"),
        schema.name("First name")
    ]

Lists

When you want to make a list in Arcaflow, you always need to specify its contents. You can do that like this:

my_field: typing.List[str]

Lists can have the following validations:

schema.min() specifies the minimum number of items in the list.
schema.max() specifies the maximum number of items in the list.

Tip

Items in lists can also be annotated with validations.

Dicts

Dicts (maps in Arcaflow) are key-value pairs. You need to specify both the key and the value type. You can do that as follows:

my_field: typing.Dict[str, str]

Lists can have the following validations:

schema.min() specifies the minimum number of items in the list.
schema.max() specifies the maximum number of items in the list.

Tip

Items in dicts can also be annotated with validations.

Union types

Union types (one-of in Arcaflow) allow you to specify two or more possible objects (dataclasses) that can be in a specific place. The only requirement is that there must be a common field (discriminator) and each dataclass must have a unique value for this field. If you do not add this field to your dataclasses, it will be added automatically for you.

For example:

import typing
import dataclasses


@dataclasses.dataclass
class FullName:
    first_name: str
    last_name: str


@dataclasses.dataclass
class Nickname:
    nickname: str


name: typing.Annotated[
    typing.Union[
        typing.Annotated[FullName, schema.discriminator_value("full")],
        typing.Annotated[Nickname, schema.discriminator_value("nick")]
    ], schema.discriminator("name_type")]

Tip

The schema.discriminator and schema.discriminator_value annotations are optional. If you do not specify them, a discriminator will be generated for you.

Any types

Any types allow you to pass through any primitive data (no dataclasses). However, this comes with severe limitations as far as validation and use in workflows is concerned, so this type should only be used in limited cases. For example, if you would like to create a plugin that inserts data into an ElasticSearch database the “any” type would be appropriate here.

You can define an “any” type like this:

my_data: typing.Any

Units

Integers and floats can have unit metadata associated with them. For example, a field may contain a unit description like this:

time: typing.Annotated[int, schema.units(schema.UNIT_TIME)]

In this case, a string like 5m30s will automatically be parsed into nanoseconds. Integers will pass through without conversion. You can also define your own unit types. At minimum, you need to specify the base type (nanoseconds in this case), and you can specify multipliers:

my_units = schema.Units(
  schema.Unit(
    # Short, singular
    "ns",
    # Short, plural
    "ns",
    # Long, singular
    "nanosecond",
    # Long, plural
    "nanoseconds"
  ),
  {
    1000: schema.Unit(
      "ms",
      "ms",
      "microsecond",
      "microseconds"
    ),
    # ...
  }
)

You can then use this description in your schema.units annotations. Additionally, you can also use it to convert an integer or float into its string representation with the my_units.format_short or my_units.format_long functions. If you need to parse a string yourself, you can use my_units.parse.

Built-In Units

A number of unit types are built-in to the python SDK for convenience:

UNIT_BYTE - Bytes and 2^10 multiples (kilo-, mega-, giga-, tera-, peta-)
UNIT_TIME - Nanoseconds and human-friendly multiples (microseconds, seconds, minutes, hours, days)
UNIT_CHARACTER - Character notations (char, chars, character, characters)
UNIT_PERCENT - Percentage notations (%, percent)