PDF to markdown using Azure OpenAI batch processing

These details have not been verified by PyPI

Project links

Project description

Parallex

What it does

Converts PDF into images
Makes requests to Azure OpenAI to convert the images to markdown using Batch API
- Azure OpenAPI Batch
- OpenAPI Batch
Polls for batch completion and then converts AI responses in structured output based on the page of the corresponding PDF
Post batch processing to do what you wish with the resulting markdown

Requirements

Parallex uses graphicsmagick for the conversion of PDF to images.

brew install graphicsmagick

Installation

pip install parallex

Example usage

import os
from parallex.models.parallex_callable_output import ParallexCallableOutput
from parallex.parallex import parallex

os.environ["AZURE_API_KEY"] = "key"
os.environ["AZURE_API_BASE"] = "your-endpoint.com"
os.environ["AZURE_API_VERSION"] = "deployment_version"
os.environ["AZURE_API_DEPLOYMENT"] = "deployment_name"

model = "gpt-4o"

async def some_operation(file_url: str) -> None:
  response_data: ParallexCallableOutput = await parallex(
    model=model,
    pdf_source_url=file_url,
    post_process_callable=example_post_process, # Optional
    concurrency=2, # Optional
    prompt_text="Turn images into markdown", # Optional
    log_level="ERROR" # Optional
  )
  pages = response_data.pages

def example_post_process(output: ParallexCallableOutput) -> None:
    file_name = output.file_name
    pages = output.pages
    for page in pages:
        markdown_for_page = page.output_content
        pdf_page_number = page.page_number

Responses have the following structure;

class ParallexCallableOutput(BaseModel):
    file_name: str = Field(description="Name of file that is processed")
    pdf_source_url: str = Field(description="Given URL of the source of output")
    trace_id: UUID = Field(description="Unique trace for each file")
    pages: list[PageResponse] = Field(description="List of PageResponse objects")

class PageResponse(BaseModel):
    output_content: str = Field(description="Markdown generated for the page")
    page_number: int = Field(description="Page number of the associated PDF")

Default prompt is

"""
    Convert the following PDF page to markdown.
    Return only the markdown with no explanation text.
    Leave out any page numbers and redundant headers or footers.
    Do not include any code blocks (e.g. "```markdown" or "```") in the response.
    If unable to parse, return an empty string.
"""

Batch processing for list of prompts

If you do not need to process images, but just want to process prompts using the Batch API, you can call;

response_data: ParallexPromptsCallableOutput = await parallex_simple_prompts(
    model=model,
    prompts=["Some prompt", "Some other prompt"],
    post_process_callable=example_post_process
)
responses = response_data.responses

This will create a batch that includes all the prompts in prompts and responses can be tied back to the prompt by index.

Responses have the following structure;

class ParallexPromptsCallableOutput(BaseModel):
    original_prompts: list[str] = Field(description="List of given prompts")
    trace_id: UUID = Field(description="Unique trace for each file")
    responses: list[PromptResponse] = Field(description="List of PromptResponse objects")

class PromptResponse(BaseModel):
    output_content: str = Field(description="Response from the model")
    prompt_index: int = Field(description="Index corresponding to the given prompts")

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

Jan 16, 2025

0.4.0

Jan 15, 2025

0.3.4

Jan 10, 2025

0.3.3

Jan 10, 2025

This version

0.3.2

Jan 7, 2025

0.3.1

Dec 13, 2024

0.3.0

Dec 13, 2024

0.2.1

Nov 27, 2024

0.2.0

Nov 18, 2024

0.1.4

Nov 14, 2024

0.1.3

Nov 14, 2024

0.1.2

Nov 14, 2024

0.1.1

Nov 14, 2024

0.1.0

Nov 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parallex-0.3.2.tar.gz (11.0 kB view details)

Uploaded Jan 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parallex-0.3.2-py3-none-any.whl (15.6 kB view details)

Uploaded Jan 7, 2025 Python 3

File details

Details for the file parallex-0.3.2.tar.gz.

File metadata

Download URL: parallex-0.3.2.tar.gz
Upload date: Jan 7, 2025
Size: 11.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.4 CPython/3.12.6 Darwin/21.6.0

File hashes

Hashes for parallex-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`f49a997bfa03ee10330504bad10e30bf203e0cc196e0fee1f463c08ed8e65d86`
MD5	`374d560ece161ef74826729e0e4a1fa9`
BLAKE2b-256	`da0825045242a591b13651bb8611f669c5164563286918e244bed3194c021a23`

See more details on using hashes here.

File details

Details for the file parallex-0.3.2-py3-none-any.whl.

File metadata

Download URL: parallex-0.3.2-py3-none-any.whl
Upload date: Jan 7, 2025
Size: 15.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.4 CPython/3.12.6 Darwin/21.6.0

File hashes

Hashes for parallex-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`de60f4fbd39b5f89eee4c3f5ef0994c68273339651159d4e81ffd762604d04ec`
MD5	`07adb6a61cf585c1dcdb7d376a329ac0`
BLAKE2b-256	`7a7888562c37545296a71f8ee7da765efffa1dcbd6f6ea727a133edd5086023b`

See more details on using hashes here.

parallex 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Parallex

What it does

Requirements

Installation

Example usage

Default prompt is

Batch processing for list of prompts

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes