Skip to main content

A python package to structure files using visual and style informations

Project description

FileStruct

FileStruct is a high-level Python library that aims to extract the overall structure of documents, particularly PDFs, based on visual information such as size, color and font.

How does it work ?

As clever human beings, we are able to detect titles, subtitles, and paragraphs using the visual appearence of the document. A big text in red most certainly represent a title (or subtitle). Using these heuristics, we are able to structure a document : This paragraph belongs to this section. The same method is used by this package to provide an automated, while realistic way to structure a document. The method is described bellow :

  1. Text and style extraction : We rely on lower level librairies (like PyMuPDF) for the extraction of the text and style information, and the ordering of each block of text.
  2. Tree creation : A tree is created, in which each block of text is a node of the tree. A child of a node in the tree is a subsection of a section in the document.
  3. Data exportation : The data can be exported in JSON format.

For now, filestruct can only read formats that are supported by PyMuPDF. This includes pdf, epub, xps, mobi, fb2, cbz and svg. I plan to add more file formats in the future.

Installation

Install FileStruct using pip :

pip install filestruct

Getting Started

Bellow, a basic usage for a PDF document :

from filestruct.document import PDFDocument

doc = Document("PATH_TO_YOUR_FILE.pdf")
data = doc.to_json()   # Export the tree into json format
print(data)
print(doc)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filestruct-0.2.tar.gz (17.5 kB view details)

Uploaded Source

File details

Details for the file filestruct-0.2.tar.gz.

File metadata

  • Download URL: filestruct-0.2.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for filestruct-0.2.tar.gz
Algorithm Hash digest
SHA256 bded6c726d9950261020c3d3b8c7cad65bea808101d304f5bd607eeca9481699
MD5 1f9e1b1d4b37c1eef349641023d97878
BLAKE2b-256 940763392fa76d330921f591fd135183e354b75528fc5aab1e0cd0a7dea1018a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page