Agent Guidebook
Learn from PDFs

Learn from PDFs on the Web

Give your agent the ability to read and remember PDF content

Use the IndexerPipelineMixin (opens in a new tab) to give your agent the ability to read and remember PDF files.

A full working example is available here (opens in a new tab). You can copy and paste this agent into your api.py file or use it as a reference.

Mixins

A Mixin is just a way to add a bundle of functionality to your AgentService. Mixins can include new API endpoints, async processing pipelines, and webhooks.

Adding the IndexerPipelineMixin

  1. Add IndexerPipelineMixin to the static USED_MIXIN_CLASSES list in your AgentService
  2. Register IndexerPipelineMixin in your AgentService __init__ with
    self.add_mixin(IndexerPipelineMixin(self.client, self))

Adding this mixin registers an entire asynchronous document processing pipeline to your agent, along with an API endpoint (/index_url) to access it.

This pipeline will:

  • Import provided URLs
  • Convert them to text format (currently only PDF and YouTube are supported -- see the Customizing section below!)
  • Chunk the data for use question-answering Tools
  • Embed each chunk into your agent's default Vector Database (created on-demand for it)

Learning new content with IndexerPipelineMixin

NOTE: IndexerPipelineMixin currently only functions in deployed agents. You will have to run ship deploy to learn new information with this Mixin.

Via API

Any AgentService with the IndexerPipelineMixin gains two authenticated endpoints /index_url and /index_text that let you load information into your agent.

To view the documentation for your agent's learning endpoints:

  1. Deploy your agent if you haven't done so already
  2. Create an instance so that you can use it
  3. View your agent instance's web page
  4. Click on the Manage tab of your agent instance's web page
  5. Click on the API tab in your agent's management console
  6. Click on either the /index_url or /index_text endpoints for customized API documentation

In the case of PDF files, simply provide the URL of the PDF file to the /index_url argument as directed by the API documentation.

Via the Web Interface

Proceed just as with the instructions for API learning, but use the auto-generated web endpoint for the API method.

Note that you only need to provide the url argument.

PDF Specific Notes

Our PDF processing plugin currently only supports text PDFs.

If your PDF is scanned and each page is an image, it will fail to parse.

Adding file importer plugins to Steamship to leverage cloud OCR services is relatively easy. If you create one, please consider sending us a pull request on GitHub (opens in a new tab)!

Customizing the Behavior of IndexerPipelineMixin

Steamship mixins are just open-source Python. The source for this mixin is available here (opens in a new tab). You can customize it by copy-pasting the tool into your own project, changing the code, and then using your new version.

This particular tool works by:

  1. Loading three other mixins:
    1. The FileImporterMixin (opens in a new tab), which adds API endpoints to scrape data from the web and YouTube.
    2. The BlockifierMixin (opens in a new tab), which adds API endpoints to convert PDF and Video data to text.
    3. The IndexerMixin (opens in a new tab), which adds API endpoints chunk and embed documents into a vector database, for use with question-answering tools.
  2. Adds an API endpoint which orchestrates an asynchronous task pipeline that scrapes, converts, chunks, embeds, and stores documents in a vector database.

This pipeline is open source Python and very customizable: you can incorporate tools such as LlamaIndex or LangChain if they contain specific importers or splitters you need.

Here are some ideas for how you might extend it:

  • Extend the FileImporterMixin to support auto-detection of new URL types, such as Wikipedia or the SEC's EDGAR database
  • Extend the BlockifierMixin to support new filetypes, such as images (via OCR or other image-to-text models)
  • Extend the IndexerMixin to utilize different text chunking/splitting strategies
  • Add a /learn_website endpoint which scrapes an entire website and schedules the learning of it

If you create an interesting customization of this tool, please consider sending us a pull request on GitHub (opens in a new tab)!