Learn from PDFs on the Web
Give your agent the ability to read and remember PDF content
IndexerPipelineMixin (opens in a new tab) to give your agent the ability to read and remember PDF files.
A full working example is available here (opens in a new tab). You can copy and paste this agent into your
api.py file or use it as a reference.
A Mixin is just a way to add a bundle of functionality to your
AgentService. Mixins can include new API endpoints, async processing pipelines, and webhooks.
IndexerPipelineMixinto the static
USED_MIXIN_CLASSESlist in your AgentService
IndexerPipelineMixinin your AgentService
Adding this mixin registers an entire asynchronous document processing pipeline to your agent, along with an API endpoint (
/index_url) to access it.
This pipeline will:
- Import provided URLs
- Convert them to text format (currently only PDF and YouTube are supported -- see the Customizing section below!)
- Chunk the data for use question-answering Tools
- Embed each chunk into your agent's default Vector Database (created on-demand for it)
IndexerPipelineMixin currently only functions in deployed agents. You will have to run
ship deploy to learn new information with this Mixin.
AgentService with the
IndexerPipelineMixin gains two authenticated endpoints
/index_text that let you load information into your agent.
To view the documentation for your agent's learning endpoints:
- Deploy your agent if you haven't done so already
- Create an instance so that you can use it
- View your agent instance's web page
- Click on the Manage tab of your agent instance's web page
- Click on the API tab in your agent's management console
- Click on either the
/index_textendpoints for customized API documentation
In the case of PDF files, simply provide the URL of the PDF file to the
/index_url argument as directed by the API documentation.
Proceed just as with the instructions for API learning, but use the auto-generated web endpoint for the API method.
Note that you only need to provide the
Our PDF processing plugin currently only supports text PDFs.
If your PDF is scanned and each page is an image, it will fail to parse.
Adding file importer plugins to Steamship to leverage cloud OCR services is relatively easy. If you create one, please consider sending us a pull request on GitHub (opens in a new tab)!
Steamship mixins are just open-source Python. The source for this mixin is available here (opens in a new tab). You can customize it by copy-pasting the tool into your own project, changing the code, and then using your new version.
This particular tool works by:
- Loading three other mixins:
FileImporterMixin(opens in a new tab), which adds API endpoints to scrape data from the web and YouTube.
BlockifierMixin(opens in a new tab), which adds API endpoints to convert PDF and Video data to text.
IndexerMixin(opens in a new tab), which adds API endpoints chunk and embed documents into a vector database, for use with question-answering tools.
- Adds an API endpoint which orchestrates an asynchronous task pipeline that scrapes, converts, chunks, embeds, and stores documents in a vector database.
This pipeline is open source Python and very customizable: you can incorporate tools such as LlamaIndex or LangChain if they contain specific importers or splitters you need.
Here are some ideas for how you might extend it:
- Extend the
FileImporterMixinto support auto-detection of new URL types, such as Wikipedia or the SEC's EDGAR database
- Extend the
BlockifierMixinto support new filetypes, such as images (via OCR or other image-to-text models)
- Extend the
IndexerMixinto utilize different text chunking/splitting strategies
- Add a
/learn_websiteendpoint which scrapes an entire website and schedules the learning of it
If you create an interesting customization of this tool, please consider sending us a pull request on GitHub (opens in a new tab)!