Skip to main content

Document transformation framework for vector based retrieval

Project description

🐛 Doctran

Document transformation framework - use LLMs to process complex strings with natural language instructions.

License Issues

There are certain applications that require documents to be parsed where human-level judgement matters more than speed. E.g. labelling transactions, or extracting semantic information from texts. In these cases, RegEx can be too inflexible, but LLMs are ideal. One way to think of Doctran is a LLM-powered black box where messy strings go in and nice, clean, labelled strings come out. Another way to think about it is a modular, declarative wrapper over OpenAI's functional calling feature that significantly improves the developer experience.

Doctran is (lightly) maintained by jasonwcfan.

Examples

Clone or download examples.ipynb for interactive demos.

Doctran converts messy, unstructured text

<doc type="Confidential Document - For Internal Use Only">
<metadata>
<date> &#x004A; &#x0075; &#x006C; &#x0079; &#x0020; &#x0031; , &#x0032; &#x0030; &#x0032; &#x0033; </date>
<subject> Updates and Discussions on Various Topics; </subject>
</metadata>
<body>
<p>Dear Team,</p>
<p>I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.</p>
<section>
<header>Security and Privacy Measures</header>
<p>As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe&#64;example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security&#64;example.com.</p>
</section>
<section>
<header>HR Updates and Employee Benefits</header>
<p>Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: &#x0030; &#x0034; &#x0039; - &#x0034; &#x0035; - &#x0035; &#x0039; &#x0032; &#x0038;) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: &#x0034; &#x0031; 
...

Into semi-structured documents that are optimized for vector search.

{
  "topics": ["Security and Privacy", "HR Updates", "Marketing", "R&D"],
  "summary": "The document discusses updates on security measures, HR, marketing initiatives, and R&D projects. It commends John Doe for enhancing network security, welcomes new team members, and recognizes Jane Smith for her customer service. It also mentions the open enrollment period for employee benefits, thanks Sarah Thompson for her social media efforts, and announces a product launch event on July 15th. David Rodriguez is acknowledged for his contributions to R&D. The document emphasizes the importance of confidentiality.",
  "contact_info": [
    {
      "name": "John Doe",
      "contact_info": {
        "phone": "",
        "email": "john.doe@example.com"
      }
    },
    {
      "name": "Michael Johnson",
      "contact_info": {
        "phone": "418-492-3850",
        "email": "michael.johnson@example.com"
      }
    },
    {
      "name": "Sarah Thompson",
      "contact_info": {
        "phone": "415-555-1234",
        "email": ""
      }
    }
  ],
  "questions_and_answers": [
    {
      "question": "What is the purpose of this document?",
      "answer": "The purpose of this document is to provide important updates and discuss various topics that require the team's attention."
    },
    {
      "question": "Who is commended for enhancing the company's network security?",
      "answer": "John Doe from the IT department is commended for enhancing the company's network security."
    }
  ]
}

Getting Started

pip install doctran

from doctran import Doctran

doctran = Doctran(openai_api_key=OPENAI_API_KEY)
document = doctran.parse(content="your_content_as_string")

Chaining transformations

Doctran is designed to make chaining document transformations easy. For example, you may want to first redact all PII from a document before sending it over to OpenAI to be summarized.

Ordering is important when chaining transformations - transformations that are invoked first will be executed first, and its result will be passed to the next transformation.

document = await document.redact(entities=["EMAIL_ADDRESS", "PHONE_NUMBER"]).extract(properties).summarize().execute()

Doctransformers

Extract

Given any valid JSON schema, uses OpenAI function calling to extract structured data from a document.

from doctran import ExtractProperty

properties = ExtractProperty(
    name="millenial_or_boomer", 
    description="A prediction of whether this document was written by a millenial or boomer",
    type="string",
    enum=["millenial", "boomer"],
    required=True
)
document = await document.extract(properties=properties).execute()

Redact

Uses a spaCy model to remove names, emails, phone numbers and other sensitive information from a document. Runs locally to avoid sending sensitive data to third party APIs.

document = await document.redact(entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN"]).execute()

Summarize

Summarize the information in a document. token_limit may be passed to configure the size of the summary, however it may not be respected by OpenAI.

document = await document.summarize().execute()

Refine

Remove all information from a document unless it's related to a specific set of topics.

document = await document.refine(topics=['marketing', 'meetings']).execute()

Translate

Translates text into another language

document = await document.translate(language="spanish").execute()

Interrogate

Convert information in a document into question and answer format. End user queries often take the form of a question, so converting information into questions and creating indexes from these questions often yields better results when using a vector database for context retrieval.

document = await document.interrogate().execute()

Contributing

Doctran is open to contributions! The best way to get started is to contribute a new document transformer. Transformers that don't rely on API calls (e.g. OpenAI) are especially valuable since they can run fast and don't require any external dependencies.

Adding a new doctransformer

Contributing new transformers is straightforward.

  1. Add a new class that extends DocumentTransformer or OpenAIDocumentTransformer
  2. Implement the __init__ and transform methods
  3. Add corresponding methods to the DocumentTransformationBuilder and Document classes to enable chaining

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctran-0.0.14.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

doctran-0.0.14-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file doctran-0.0.14.tar.gz.

File metadata

  • Download URL: doctran-0.0.14.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.12 Darwin/22.5.0

File hashes

Hashes for doctran-0.0.14.tar.gz
Algorithm Hash digest
SHA256 3c710c9393ac2e4e50539b99c4726e5d8499e96d3c94e00b8beddf997d20392c
MD5 e5604744d05aadb1e26c5dfece0afbac
BLAKE2b-256 14fb115a62b811a24112984d6e5d6a2d17e0dc6ddf5a80ac11dcae56622a6044

See more details on using hashes here.

File details

Details for the file doctran-0.0.14-py3-none-any.whl.

File metadata

  • Download URL: doctran-0.0.14-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.12 Darwin/22.5.0

File hashes

Hashes for doctran-0.0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 92ce3fbef8e731b93c69aaef765468f05ff69603aecb6869e6a5d861bbae4e0d
MD5 1ecc56b37778541eabe295517b01a4b8
BLAKE2b-256 57e5f0d1fa2c0e2b28cae0adba6606210b31272fc28ed21101bf3508f4b7627c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page