No project description provided
Project description
Document Transformer
Document Transformer allows users to define and apply transformations to documents in a flexible and robust manner, ensuring traceability of each change made to the documents.
Table of Contents
Features
- Flexible document transformation
- Comprehensive traceability for each transformation
- Add custom supports to multiple document formats (e.g., JSON, XML, CSV)
- Easy integration with other tools and workflows
Installation
To install Document Transformer, follow these steps:
# Install using pip
pip install document-transformer
Usage
Define custom Document class
from document_transformer import Document
class PDFDocument(Document):
"""Custom class to PDF Documents"""
class ImageDocument(Document):
"""Custom class to Image Documents"""
def saver(self, path):
self.data.save(path)
return self
Define the transformer. Specify input and output Document types
from document_transformer import DocumentTransformer
import pdf2image # install: pip install pdf2image
from typing import List
from pathlib import Path
class PDF2Images(DocumentTransformer):
input: PDFDocument = None
output: List[ImageDocument] = []
def transformer(self) -> List[ImageDocument]:
"""Split the PDF document into pages"""
images = pdf2image.convert_from_path(self.input.path)
return [
ImageDocument(
metadata={'pdf_path': Path(self.input.path).name, 'page': i+1, 'size': image.size},
data=image,
)
for i, image in enumerate(images)
]
Run your implementation
pdf_doc = PDFDocument(path="document.pdf")
images = PDF2Images(input=pdf_doc).run()
for image in images:
image.save(path=f'images/pag_{image.metadata["page"]}.jpg')
print(f"Imagen: {image.id}")
print(f"Parents: {image.parents}")
print(f"Metadata: {image.metadata}")
Or run like a pipeline, visualize the graph transformation
from document_transformer import Pipeline
from document_transformer.utils import plot_graph
# Define Pipeline, add more transformers as you need
pipeline = Pipeline(transformers=[
PDF2Images(to="images/pag_{metadata[page]}.jpg"),
# Images2Markdown(to="images/pag_{metadata[page]}.md")),
# ...
])
# Define input and get output
pdf_doc = PDFDocument(path="document.pdf")
images = pipeline.run(input=pdf_doc)
# See transfomer plot graph
plot_graph(pipeline.get_traces())
Contributing
We welcome contributions! Please read our Contributing Guide to learn how you can help.
License
Document Transformer is licensed under the MIT License
Contact
If you have any questions or feedback, please feel free to reach out to us at johngonzalezv@gmail.com.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for document_transformer-0.1.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | e65335d9d7e2028de0062ab253d1074ff383e8e44099824764544aea7d588ef6 |
|
MD5 | c80c1808d681d571903d5b8a84da32c1 |
|
BLAKE2b-256 | 6d408030cea725c34540cc1bf93879566cfdc127eeae4fb563b70cbd7823abdb |
Hashes for document_transformer-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1192f5b0a3735d54728a8e4bdf82d7051b69b59a8c6070a857d84802fda40d93 |
|
MD5 | 071ac905b414c287fe9612f75870f228 |
|
BLAKE2b-256 | def5440cb4e05e48b5f3935e6d871c54e6222113d1f1f082e53c44c8dd07271d |