Creating a Python data model
Every plugin needs a schema to represent its expected inputs and outputs in a machine-readable format. The schema strong typing is a core design element of Arcaflow, enabling us to build portable workflows that compartmentalize failure conditions and avoid data errors.
When creating a data model for Arcaflow plugins in Python, everything starts with dataclasses. They allow Arcaflow to get information about the data types of individual fields in your class:
import dataclasses
@dataclasses.dataclass
class MyDataModel:
some_field: str
other_field: int
However, Arcaflow doesn’t support all Python data types. You pick from the following list:
str
int
float
bool
- Enums
re.Pattern
typing.List[othertype]
typing.Dict[keytype, valuetype]
typing.Union[onedataclass, anotherdataclass]
- Dataclasses
typing.Any
You can read more about the individual types in the data types section
Optional parameters
You can also declare any parameter as optional, like this:
@dataclasses.dataclass
class MyClass:
param: typing.Optional[int] = None
Note that adding typing.Optional
is not enough, you must specify the default value.
Annotations for validation and metadata
You can specify desired validations for each field like this:
@dataclasses.dataclass
class MyClass:
param: typing.Annotated[int, schema.name("Param")]
Tip
Annotated objects are preferred as a best practice for a documented schema, and are expected for any officially-supported community plugins.
You can use the following annotations to add metadata to your fields:
schema.id
adds a serialized field name for the current field (e.g. one containing dashes, which is not valid in Python)schema.name
adds a human-readable name to the parameter. This can be used to present a form field.schema.description
adds a long-form description to the field.schema.example
adds an example value to the field. You can repeat this annotation multiple times. The example must be provided as primitive types (no dataclasses).
You can also add validations to the fields. The following annotations are valid for all data types:
schema.required_if
specifies a field that causes the current field to be required. If the other field is empty, the current field is not required. You can repeat this annotation multiple times. (Make sure to use the optional annotation above.)schema.required_if_not
specifies a field that, if not filled, causes the current field to be required. You can repeat this annotation multiple times.(Make sure to use the optional annotation above.)schema.conflicts
specifies a field that cannot be used together with the current field. You can repeat this annotation multiple times. (Make sure to use the optional annotation above.)
Additionally, some data types have their own validations and metadata, such as schema.min
, schema.max
, schema.pattern
, or schema.units
.
Note
When combining typing.Annotated
with typing.Optional
, the default value is assigned to the Annotated
object, not to the Optional
object.
@dataclasses.dataclass
class MyClass:
param: typing.Annotated[
typing.Optional[int],
schema.name("Param")
] = None
Data types
Strings
Strings are, as the name suggests, strings of human-readable characters. You can specify them in your dataclass like this:
some_field: str
Additionally, you can apply the following validations:
schema.min()
specifies the minimum length of the string if the field is set.schema.max()
specifies the maximum length of the string if the field is set.schema.pattern()
specifies the regular expression the string must match if the field is set.
Integers
Integers are 64-bit signed whole numbers. You can specify them in your dataclass like this:
some_field: int
Additionally, you can apply the following validations and metadata:
schema.min()
specifies the minimum number if the field is set.schema.max()
specifies the maximum number if the field is set.schema.units()
specifies the units for this field (e.g. bytes). See Units.
Floating point numbers
Floating point numbers are 64-bit signed fractions. You can specify them in your dataclass like this:
some_field: float
Warning
Floating point numbers are inaccurate! Make sure to transmit numbers requiring accuracy as integers!
Additionally, you can apply the following validations and metadata:
schema.min()
specifies the minimum number if the field is set.schema.max()
specifies the maximum number if the field is set.schema.units()
specifies the units for this field (e.g. bytes). See Units.
Booleans
Booleans are True
or False
values. You can specify them in your dataclass like this:
some_field: bool
Booleans have no additional validations or metadata.
Enums
Enums, short for enumerations, are used to define a set of named values as unique constants. They provide a way to represent a fixed number of possible values for a variable, parameter, or property. In Python, an enum is declared as a class, but doesn’t behave as a normal class. Instead, the “attributes” of the class act as independent “member” or “enumeration member” objects, each of which has a name and a constant value.
By using enums, you can give meaningful names to distinct values, making the code more self-explanatory and providing a convenient way to work with sets of related constants.
In an Arcaflow schema, an Enum type provides a list of valid values for a field. The Enum must define a set of members with unique values, all of which are either strings or integers.
You can specify an enum with string values like this:
import enum
class MyEnum(enum.Enum):
Value1 = "value 1"
Value2 = "value 2"
my_field: MyEnum
my_field
of MyEnum.Value1.
You can specify an Enum class with integer values like this:
import enum
class MyEnum(enum.Enum):
Value1 = 1
Value2 = 2
my_field: MyEnum
The my_field
variable is a variable of type MyEnum. It can store one of the defined enumeration members (Value1 or Value2). An input value of 1 in this case will result in the plugin receiving a value for my_field
of MyEnum.Value1.
value = MyEnum.Value1
Note
Enumeration members are “singleton” objects which have a single instance. In Python, you should compare enumeration members using is
rather than ==
(for example, variable is MyEnum.Value1
). The values of an Enum used in an Arcaflow schema must have values of string or integer data type.
Tip
Enums aren’t dataclasses, but can be used as the type of dataclass attributes.
Warning
Do not mix integers and strings in the same enum! The values for each Enum type must all be strings, or all integers.
Patterns
When you need to hold regular expressions, you can use a pattern field. This is tied to the Python regular expressions library. You can specify a pattern field like this:
import re
my_field: re.Pattern
Pattern fields have no additional validations or metadata.
Note
If you are looking for a way to do pattern/regex matching for a string you will need to use the schema.pattern() validation which specifies the regular expression, to which the string must match.
The below example declares that the first_name variable must only have uppercase and lowercase alphabets.
@dataclasses.dataclass
class MyClass:
first_name: typing.Annotated[
str,
schema.min(2),
schema.pattern(re.compile("^[a-zA-Z]+$")),
schema.example("Arca"),
schema.name("First name")
]
Lists
When you want to make a list in Arcaflow, you always need to specify its contents. You can do that like this:
my_field: typing.List[str]
Lists can have the following validations:
schema.min()
specifies the minimum number of items in the list.schema.max()
specifies the maximum number of items in the list.
Tip
Items in lists can also be annotated with validations.
Dicts
Dicts (maps in Arcaflow) are key-value pairs. You need to specify both the key and the value type. You can do that as follows:
my_field: typing.Dict[str, str]
Lists can have the following validations:
schema.min()
specifies the minimum number of items in the list.schema.max()
specifies the maximum number of items in the list.
Tip
Items in dicts can also be annotated with validations.
Union types
Union types (one-of in Arcaflow) allow you to specify two or more possible objects (dataclasses) that can be in a specific place. The only requirement is that there must be a common field (discriminator) and each dataclass must have a unique value for this field. If you do not add this field to your dataclasses, it will be added automatically for you.
For example:
import typing
import dataclasses
@dataclasses.dataclass
class FullName:
first_name: str
last_name: str
@dataclasses.dataclass
class Nickname:
nickname: str
name: typing.Annotated[
typing.Union[
typing.Annotated[FullName, schema.discriminator_value("full")],
typing.Annotated[Nickname, schema.discriminator_value("nick")]
], schema.discriminator("name_type")]
Tip
The schema.discriminator
and schema.discriminator_value
annotations are optional. If you do not specify them, a discriminator will be generated for you.
Any types
Any types allow you to pass through any primitive data (no dataclasses). However, this comes with severe limitations as far as validation and use in workflows is concerned, so this type should only be used in limited cases. For example, if you would like to create a plugin that inserts data into an ElasticSearch database the “any” type would be appropriate here.
You can define an “any” type like this:
my_data: typing.Any
Units
Integers and floats can have unit metadata associated with them. For example, a field may contain a unit description like this:
time: typing.Annotated[int, schema.units(schema.UNIT_TIME)]
In this case, a string like 5m30s
will automatically be parsed into nanoseconds. Integers will pass through without conversion. You can also define your own unit types. At minimum, you need to specify the base type (nanoseconds in this case), and you can specify multipliers:
my_units = schema.Units(
schema.Unit(
# Short, singular
"ns",
# Short, plural
"ns",
# Long, singular
"nanosecond",
# Long, plural
"nanoseconds"
),
{
1000: schema.Unit(
"ms",
"ms",
"microsecond",
"microseconds"
),
# ...
}
)
You can then use this description in your schema.units
annotations. Additionally, you can also use it to convert an integer or float into its string representation with the my_units.format_short
or my_units.format_long
functions. If you need to parse a string yourself, you can use my_units.parse
.
Built-In Units
A number of unit types are built-in to the python SDK for convenience:
UNIT_BYTE
- Bytes and 2^10 multiples (kilo-, mega-, giga-, tera-, peta-)UNIT_TIME
- Nanoseconds and human-friendly multiples (microseconds, seconds, minutes, hours, days)UNIT_CHARACTER
- Character notations (char, chars, character, characters)UNIT_PERCENT
- Percentage notations (%, percent)