Providers

Though we aim to offer a large suite of built-in transforms, it is inevitable that people will need to author their own. This is made possible through the notion of Providers which leverage expansion services and vend catalogues of schema transforms.

Java

For example, you could build a jar that vends a cross language transform or schema transform and then use it in a transform as follows

pipeline:
  type: chain
  source:
    type: ReadFromCsv
    config:
      path: /path/to/input*.csv

  transforms:
    - type: MyCustomTransform
      config:
        arg: whatever

  sink:
    type: WriteToJson
    config:
      path: /path/to/output.json

providers:
  - type: javaJar
    config:
       jar: /path/or/url/to/myExpansionService.jar
    transforms:
       MyCustomTransform: "urn:registered:in:expansion:service"

A full example of how to build a java provider can be found here.

Python

Arbitrary Python transforms can be provided as well, using the syntax

providers:
  - type: pythonPackage
    config:
       packages:
           - my_pypi_package>=version
           - /path/to/local/package.zip
    transforms:
       MyCustomTransform: "pkg.module.PTransformClassOrCallable"

We offer a python provider starter project that serves as a complete example for how to do this.

YAML Provider listing files

One can reference an external listings of providers in the yaml pipeline file via the syntax

providers:
  - include: "file:///path/to/local/providers.yaml"
  - include: "gs://path/to/remote/providers.yaml"
  - include: "https://example.com/hosted/providers.yaml"
  ...

where providers.yaml is simply a yaml file containing a list of providers in the same format as those inlined in this providers block. See, for example, the provider listing here.

In fact, this is how many of the the built in transforms are declared, see for example the builtin io listing file.

Hosting these listing files (together with their required artifacts) allows one to easily share catalogues of transforms that can be directly used by others in their YAML pipelines.