Blockifiers#

Blockifiers convert data into Steamship’s native Block format.

  • A Blockifier’s input is raw bytes. Examples include a PDF, image, audio, HTML, CSV, JSON-formatted API output, or so on.

  • A Blockifier’s output is an object in Steamship Block format.

All data imported into Steamship must be first blockified before it can be used.

You can use blockifiers when developing Steamship Packages, in your own Python app code, or as one-off functions that convert data in the cloud.

Using Blockifiers#

To use a blockifier, create an instance with your Steamship client and apply it to a file.

# Load a Steamship Workspace
from steamship import Steamship, File
client = Steamship(workspace="my-workspace-handle")

# Upload a file
file = File.create(path="path/to/some_file").data

# Create the blockifier instance
blockifier = client.use_plugin('blockifier-handle', 'instance-handle')

# Apply the blockifier to the file
task = file.blockify(blockifier.handle)

# Wait until the blockify task completes remotely
task.wait()

# Query across the persisted blocks and tags returned from blockification.
file.query('blocktag AND name "paragraph"')

In the above code, the two key lines are:

blockifier = client.use_plugin('blockifier-handle', 'instance-handle')
task = file.blockify(blockifier.handle)

In these lines, blockifier-handle identifies which blockifier you would like to use, and instance-handle identifies your particular instance of this blockifier in a workspace. The same instance is reused, rather than created, if you load it like this again.

Common Blockifiers#

Steamship maintains a growing collection of official blockifiers for common scenarios. Our goal is to always map our defaults to best of breed models so that you can get work done quickly without worrying about the details of model selection and tuning.

Our current list of supported blockifiers are:

Using a Blockifier from within a Steamship Package#

Steamship Packages are Python classes that run in the cloud. The Package constructor receives a pre-configured Steamship client anchored in the correct user Workspace for the lifespan of that instance. You should use this client to import any Blockifiers that package uses.

Do that with the client.use_plugin method, like this:

from steamship import Steamship, App

class MyPackage(App):
  def __init__(self, client: Steamship, config: Dict[str, Any] = None):
    super().__init__(client, config)
    self.blockifier = client.use_plugin(
      plugin_handle="blockifier-handle",
      instance_handle="unique-id",
      config={"key": "value"}
    )

    # Or, as a shortcut:
    self.blockifier = client.use_plugin("blockifier-handle", "unique-id", config={})

We recommend:

  1. Doing this in the constructor, and saving the result as a member variable.

  2. Using a pre-set instance handle. This will ensure you get the same plugin instance each time instead of generating a new one each time your package is used.

Using a Blockifier from within a Steamship Workspace#

Each instance of a Steamship client is anchored to a Workspace. This Workspace provides a scope in which data and infrastructure can live.

Create a plugin instance within a Workspace by simply using the Steamship client, like this:

from steamship import Steamship

client = Steamship()

blockifier = client.use_plugin(
  plugin_handle="blockifier-handle",
  instance_handle="unique-id",
  config={"key": "value"}
)

# Or, as a shortcut:

blockifier = client.use_plugin("blockifier-handle", "unique-id", config={})

Using a Blockifier as a one-off operation#

If you wish to use a Blockifier in-line without a known workspace, you can create a Blockifier from the Steamship client’s static class.

from steamship import Steamship

blockifier = Steamship.use_plugin(
  plugin_handle="blockifier-handle",
  config={"key": "value"}
)

# Or, as shorthand:

blockifier = Steamship.use_plugin("blockifier-handle", config={})

Developing Blockifiers#

To develop a blockifier, first follow the instructions in Developing Plugins to create a new plugin project. This will result in a full, working plugin scaffold that you could deploy and use immediately.

Then, read below details about how to modify that scaffold for your own needs.

The Blockifier Contract#

Blockifiers are responsible for transforming raw data into Steamship Block Format. Using our SDK, that means implementing the following method:

class MyBlockifier(Blockifier):
    def run(
       self, request: PluginRequest[RawDataPluginInput]
    ) -> Union[
       Response,
       Response[BlockAndTagPluginOutput]
    ]:
        pass

How to Structure Blocks and Tags#

The biggest design question you will face when implementing a blockifier is how to structure your blocks and tags.

At the platform level, we leave this open-ended on purpose, but we do encourage a few conventions of common convergence.

See the Workspace Data Model section for a discussion of how to think effectively about blocks and tags.

Synchronous Example: A Pseudo-Markdown Blockifier#

A trivial implementation of this contract would be a pseudo-Markdown blockifier.

Let’s say this blockifier assumes the input data is UTF-8, assumes that empty new lines represent paragraph breaks. You could implement such a blockifier with this following code:

class PretendMarkdownBlockifier(Blockifier):
    def run(self, request: PluginRequest[RawDataPluginInput]) -> Union[PluginRequest[BlockAndTagPluginOutput], BlockAndTagPluginOutput]:
        # Grab the raw bytes.
        text = request.data.data

        # Decode it as UTF-8
        if isinstance(text, bytes):
            text = text.decode("utf-8")

       # Split it into paragraphs based on a double newline
       paragraphs = data.split("\n\n")

       # Create a block for each paragraph and add a tag marking it as a paragraph
       blocks = [
         Block.CreateRequest(text=paragraph, tags=[
             Tag.CreateRequest(kind="my-plugin", name="paragraph")
         ]) for paragraph in paragraphs
       ]

       # Return a BlockAndTagPluginOutput object
       return BlockAndTagPluginOutput(file=File.CreateRequest(blocks=blocks))

From the standpoint of the Steamship Engine, this PretendMarkdownBlockifier now provides a way to transform any bytes claiming to be of this pseudo-markdown type into Steamship Block Format.

Asynchronous Blockifiers#

Some blockifiers will need to call third-party APIs that are asynchronous. Image-to-text (OCR) and speech-to-text (S2T) are two common examples. When this occurs, you should make your blockifier asynchronous as well.

See the Developing Asynchronous Plugins section for details.