Document transformation framework for vector based retrieval
Project description
🐛 Doctran
Document transformation library for AI knowledge
Vector databases are useful for retrieving context for LLMs, however they struggle to find relevant information if the source documents are indexed hapharzardly and information is sparse. Doctran is an open-source library that uses LLMs and open source NLP libraries to transform raw text into clean, structured, information-dense documents that are optimized for vector space retrieval.
Doctran is maintained by Psychic, the data integration layer for LLMs.
Getting Started
pip install doctran
from doctran import Doctran
doctran = Doctran(openai_api_key=OPENAI_API_KEY)
document = doctran.parse(content="your_content_as_string")
Clone or download the examples.ipynb
for interactive examples.
Chaining transformations
Doctran is designed to make chaining document transformations easy. For example, you may want to first redact all PII from a document before sending it over to OpenAI to be summarized.
Ordering is important when chaining transformations - transformations that are invoked first will be executed first, and its result will be passed to the next transformation.
document = await document.redact(entities=["EMAIL_ADDRESS", "PHONE_NUMBER"]).extract(properties).summarize().execute()
Doctransformers
Extract
Given any valid JSON schema, yses OpenAI function calling to extract structured data from a document.
from doctran import ExtractProperty
properties = ExtractProperty(
name="millenial_or_boomer",
description="A prediction of whether this document was written by a millenial or boomer",
type="string",
enum=["millenial", "boomer"],
required=True
)
document = await document.extract(properties=properties).execute()
Redact
Uses a spaCy model to remove names, emails, phone numbers and other sensitive information from a document. Runs locally to avoid sending sensitive data to third party APIs.
document = await document.redact(entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN"]).execute()
Summarize
Summarize the information in a document. token_limit
may be passed to configure the size of the summary, however it may not be respected by OpenAI.
document = await document.summarize().execute()
Refine
Remove all information from a document unless it's related to a specific set of topics.
document = await document.refine(topics=['marketing', 'meetings']).execute()
Translate
Translates text into another language
document = await document.translate(language="spanish").execute()
Interrogate
Convert information in a document into question and answer format. End user queries often take the form of a question, so converting information into questions and creating indexes from these questions often yields better results when using a vector database for context retrieval.
document = await document.interrogate().execute()
Contributing
Doctran is open to contributions! The best way to get started is to contribute a new document transformer. Transformers that don't rely on API calls (e.g. OpenAI) are especially valuable since they can run fast and don't require any external dependencies.
Adding a new doctransformer
Contributing new transformers is straightforward.
- Add a new class that extends
DocumentTransformer
orOpenAIDocumentTransformer
- Implement the
__init__
andtransform
methods - Add corresponding methods to the
DocumentTransformationBuilder
andDocument
classes to enable chaining
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.