Skip to main content

Annotation tools for PDF

Project description

PlasmaPDF Quick Start Guide

PlasmaPDF is a Python library for converting from txt spans to x-y positioned tokens in the PAWLs format. It is a utility library used in OpenContracts.

Installation

To install PlasmaPDF, use pip:

pip install plasmapdf

Basic Usage

1. Importing the Library

Start by importing the necessary components:

from plasmapdf.models.PdfDataLayer import build_translation_layer
from plasmapdf.models.types import TextSpan, SpanAnnotation, PawlsPagePythonType

2. Creating a PdfDataLayer

The core of plasmaPDF is the PdfDataLayer class. You create an instance of this class using the makePdfTranslationLayerFromPawlsTokens function:

pawls_tokens: list[PawlsPagePythonType] = [
    {
        "page": {"width": 612, "height": 792, "index": 0},
        "tokens": [
            {"x": 72, "y": 72, "width": 50, "height": 12, "text": "Hello"},
            {"x": 130, "y": 72, "width": 50, "height": 12, "text": "World"}
        ]
    }
]

pdf_data_layer = makePdfTranslationLayerFromPawlsTokens(pawls_tokens)

3. Working with Text Spans

You can extract raw text from a span in the document:

span = TextSpan(id="1", start=0, end=11, text="Hello World")
raw_text = pdf_data_layer.get_raw_text_from_span(span)
print(raw_text)  # Output: "Hello World"

4. Creating Annotations

To create an annotation:

span_annotation = SpanAnnotation(span=span, annotation_label="GREETING")
oc_annotation = pdf_data_layer.create_opencontract_annotation_from_span(span_annotation)

5. Accessing Document Information

You can access various pieces of information about the document:

print(pdf_data_layer.doc_text)  # Full document text
print(pdf_data_layer.human_friendly_full_text)  # Human-readable version of the text
print(pdf_data_layer.page_dataframe)  # DataFrame with page information
print(pdf_data_layer.tokens_dataframe)  # DataFrame with token information

Advanced Usage

Working with Multi-Page Documents

PlasmaPDF can handle multi-page documents. When you create the PdfDataLayer, make sure to include tokens for all pages:

multi_page_pawls_tokens = [
    {
        "page": {"width": 612, "height": 792, "index": 0},
        "tokens": [...]
    },
    {
        "page": {"width": 612, "height": 792, "index": 1},
        "tokens": [...]
    }
]

pdf_data_layer = makePdfTranslationLayerFromPawlsTokens(multi_page_pawls_tokens)

Splitting Spans Across Pages

If you have a span that potentially crosses page boundaries, you can split it:

long_span = TextSpan(id="2", start=0, end=1000, text="...")
page_aware_spans = pdf_data_layer.split_span_on_pages(long_span)

Creating OpenContracts Annotations

To create an annotation in the OpenContracts format:

span = TextSpan(id="3", start=0, end=20, text="Important clause here")
span_annotation = SpanAnnotation(span=span, annotation_label="IMPORTANT_CLAUSE")
oc_annotation = pdf_data_layer.create_opencontract_annotation_from_span(span_annotation)

Utility Functions

PlasmaPDF includes utility functions for working with job results:

from plasmapdf.utils.utils import package_job_results_to_oc_generated_corpus_type

# Assume you have job_results, possible_span_labels, possible_doc_labels, 
# possible_relationship_labels, and suggested_label_set

corpus = package_job_results_to_oc_generated_corpus_type(
    job_results,
    possible_span_labels,
    possible_doc_labels,
    possible_relationship_labels,
    suggested_label_set
)

This function packages job results into the OpenContracts corpus format.

Testing

PlasmaPDF comes with a suite of unit tests. You can run these tests to ensure everything is working correctly:

hatch test

This will run all the tests in the tests directory.

Conclusion

This quick start guide covers the basics of using PlasmaPDF. For more detailed information, refer to the full documentation or explore the source code. If you encounter any issues or have questions, please refer to the project's issue tracker or documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plasmapdf-0.1.0.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

plasmapdf-0.1.0-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file plasmapdf-0.1.0.tar.gz.

File metadata

  • Download URL: plasmapdf-0.1.0.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for plasmapdf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 82ee0363a25399f75e6670787251c7fb2d3e5c943c983a1ab2d26e8fb91fc3fa
MD5 9148d2391965713f18293896a04eca68
BLAKE2b-256 bb921622791c18dd8ed18a28bdfe80c92d2aacaf14bcd31780b0d3d396eed670

See more details on using hashes here.

File details

Details for the file plasmapdf-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: plasmapdf-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for plasmapdf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 98bcadff423e37206cfd518eaf28d1102a8b210e3c6d92dbae00b29927dfd930
MD5 1059483d0000ff2cb939041846673560
BLAKE2b-256 9b779564a35f1d705d13b52f10ee090683c68a5126431b704b9d4c62dc430da8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page