Skip to main content

Organize your document transformation pipeline. Define your components, this tool ensure traceability and organization. Reuse your work easily in other projects

Project description

Document Transformer

Document Transformer allows users to define and apply transformations to documents in a flexible and robust manner, ensuring traceability of each change made to the documents.

Table of Contents

Features

  • Flexible document transformation
  • Comprehensive traceability for each transformation
  • Add custom supports to multiple document formats (e.g., JSON, XML, CSV)
  • Easy integration with other tools and workflows

Installation

To install Document Transformer, follow these steps:

# Install using pip
pip install document-transformer

Usage

Define custom Document class

from document_transformer import Document

class PDFDocument(Document):
    """Custom class to PDF Documents"""

class ImageDocument(Document):
    """Custom class to Image Documents"""
    def saver(self, path):
        self.data.save(path)
        return self

Define the transformer. Specify input and output Document types

from document_transformer import DocumentTransformer
import pdf2image  # install: pip install pdf2image
from typing import List
from pathlib import Path

class PDF2Images(DocumentTransformer):
    input: PDFDocument = None
    output: List[ImageDocument] = []

    def transformer(self) -> List[ImageDocument]:
        """Split the PDF document into pages"""
        images = pdf2image.convert_from_path(self.input.path)
        return [
            ImageDocument(
                metadata={'pdf_path': Path(self.input.path).name, 'page': i+1, 'size': image.size},
                data=image,
            )
            for i, image in enumerate(images)
        ]

Run your implementation

pdf_doc = PDFDocument(path="document.pdf")
images = PDF2Images(input=pdf_doc).run()

for image in images:
    image.save(path=f'images/pag_{image.metadata["page"]}.jpg')
    print(f"Imagen: {image.id}")
    print(f"Parents: {image.parents}")
    print(f"Metadata: {image.metadata}")

Or run like a pipeline, visualize the graph transformation

from document_transformer import Pipeline
from document_transformer.utils import plot_graph

# Define Pipeline, add more transformers as you need
pipeline = Pipeline(transformers=[
    PDF2Images(to="images/pag_{metadata[page]}.jpg"),
    # Images2Markdown(to="images/pag_{metadata[page]}.md")),
    # ...
])

# Define input and get output
pdf_doc = PDFDocument(path="document.pdf")
images = pipeline.run(input=pdf_doc)

# See transfomer plot graph
plot_graph(pipeline.get_traces())

plot_graph.png

Contributing

We welcome contributions! Please read our Contributing Guide to learn how you can help.

License

Document Transformer is licensed under the MIT License

Contact

If you have any questions or feedback, please feel free to reach out to us at johngonzalezv@gmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_transformer-0.1.5.tar.gz (6.6 kB view details)

Uploaded Source

Built Distribution

document_transformer-0.1.5-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file document_transformer-0.1.5.tar.gz.

File metadata

  • Download URL: document_transformer-0.1.5.tar.gz
  • Upload date:
  • Size: 6.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.9.16 Linux/6.1.52-71.125.amzn2023.x86_64

File hashes

Hashes for document_transformer-0.1.5.tar.gz
Algorithm Hash digest
SHA256 ea2ab79c855a3a251d2d5abb7c165dd78a66b3413919ade8efc8d5ce64804cb6
MD5 d5ba4137fd5893947cc9731d1f03ef2d
BLAKE2b-256 04de39231636b7df7c356e099c14b02bff2b2cf5ceeb7696dc4523d70d30dd43

See more details on using hashes here.

File details

Details for the file document_transformer-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: document_transformer-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.9.16 Linux/6.1.52-71.125.amzn2023.x86_64

File hashes

Hashes for document_transformer-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 037af31ea417576c769575fbed192055cc219ae539aa4f59e532625d1a8b33d3
MD5 4c98ebaf9e8a8cdca74417f2ee4112c1
BLAKE2b-256 4c87eeaf5ca543cbf8373cec1f4113aaf8be282795d1d3e5ed4510e7f1d7baa0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page