A python package to structure files using visual and style informations
Project description
FileStruct
FileStruct is a high-level Python library that aims to extract the overall structure of documents, particularly PDFs, based on visual information such as size, color and font.
How does it work ?
As clever human beings, we are able to detect titles, subtitles, and paragraphs using the visual appearence of the document. A big text in red most certainly represent a title (or subtitle). Using these heuristics, we are able to structure a document : This paragraph belongs to this section. The same method is used by this package to provide an automated, while realistic way to structure a document. The method is described bellow :
- Text and style extraction : We rely on lower level librairies (like PyMuPDF) for the extraction of the text and style information, and the ordering of each block of text.
- Tree creation : A tree is created, in which each block of text is a node of the tree. A child of a node in the tree is a subsection of a section in the document.
- Data exportation : The data can be exported in JSON format.
For now, filestruct can only read formats that are supported by PyMuPDF. This includes pdf, epub, xps, mobi, fb2, cbz and svg. I plan to add more file formats in the future.
Installation
Install FileStruct using pip :
pip install filestruct
Getting Started
Bellow, a basic usage for a PDF document :
from filestruct.document import PDFDocument
doc = Document("PATH_TO_YOUR_FILE.pdf")
data = doc.to_json() # Export the tree into json format
print(data)
print(doc)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file filestruct-0.2.tar.gz
.
File metadata
- Download URL: filestruct-0.2.tar.gz
- Upload date:
- Size: 17.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bded6c726d9950261020c3d3b8c7cad65bea808101d304f5bd607eeca9481699 |
|
MD5 | 1f9e1b1d4b37c1eef349641023d97878 |
|
BLAKE2b-256 | 940763392fa76d330921f591fd135183e354b75528fc5aab1e0cd0a7dea1018a |