A powerful tool to extract text, tables, charts, and formulas from documents and convert them into Markdown format, ideal to improve LLM's accuracy and for versatile document processing.

These details have not been verified by PyPI

Project description

PyPI License: CC BY-NC 4.0

MDify: Convert any document to Markdown

MDify is a powerful Python library for converting documents into clean, structured Markdown.

Unlike other tools, MDify can accurately extract tables, charts, and images, even offering the option to save them separately for further use.
This is particularly useful when working with documents like financial statements, spreadsheets, and data-rich reports, which usually have lots of tables and images.
MDify categorizes images into general pictures and charts and extracts tables of any kind, even complex ones with merged cells and sparse data.

Whether you're working with research papers, reports, or general documents, MDify ensures the data is extracted in a structured, clean, and machine-readable format, making it ideal for tasks like fine-tuning, question answering, and document analysis in the context of Large Language Models (LLMs).
By converting complex PDFs into well-structured Markdown, this tool helps streamline the input process for LLM applications, reducing the time spent on manual cleaning and formatting. With features like table extraction, image preservation, and high-quality OCR, MDify is a perfect fit for preparing large volumes of data for AI models.

IMPORTANT: Currently this tools only supports PDFs and images (such as text extracts, document scans, etc.) written in English.

🚀 Installation

First, install MDify via PyPI:

pip install mdify

⚡ Quickstart

Convert a document to Markdown with just a few lines of code:

from mdify import DocumentParser

parser = DocumentParser()
parser.parse('PATH_TO_YOUR_DOCUMENT')

Or parse multiple documents from one folder at once simply by changing the last line to:

parser.parse_directory('PATH_TO_YOUR_FOLDER')

Alternatively, you can also pass the document in bytes to the parse() method, but in this case you must also provide the document name and type manually:

with open('PATH_TO_YOUR_DOCUMENT', 'rb') as f:
  document_bytes = f.read()
parser.parse(document_bytes, document_name='YOUR_DOCUMENT_NAME', document_type='pdf')

You can then choose the outputs to save using DocumentParser(save_artifacts=...), or you can set the write mode to embedded, placeholder or described by passing the write_mode parameter to the parse() function.

NB: To make the best use of this library and extract meaning from images, use the following code:

from mdify import WriteMode

parser.parse('PATH_TO_YOUR_DOCUMENT', write_mode=WriteMode.DESCRIBED)

🔹 Key Features

✔️ Handles complex layouts - Extracts text, tables, and visual elements with precision
🖼️ Preserves images & charts - Gives the option to save and reuse extracted visuals for Computer Vision tasks
🎯 Optimized for accuracy - Combines layout detection and OCR to extract text from documents
🤖 Preprocessing for LLM applications - Converts documents to Markdown, which is popular for LLM training and fine-tuning tasks
🛠️ Debug mode - Save intermediate document elements as images for analysis

NB:

The first run will take ~2 minutes to download the necessary models.
Diagrams are not supported yet, therefore if you use the DESCRIBED write mode they may be analyzed incorrectly.

📄 Documentation

For more information, please refer to the official documentation.

🤝 Contributing

MDify is an independent, open-source project developed and maintained by passionate developers. Your support is highly valued, and any contributions — whether through issues, bug reports, feature requests, or pull requests — are more than welcome!

If you are interested in improving this library or adding new features, please don't hesitate to get involved!

💖 Support

Being an independent developer, I would much appreciate it if you could

Thank you!

⚖️ License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

You can find the full text of the license here: CC BY-NC 4.0

❞ Citation

If you use this project, please download the citation from Zenodo (scroll all the way down and then choose the format you prefer (e.g. Bibtex) from the Export dropdown).

🔗 Acknowledgments

This project leverages several open-source repositories for different components:

pypdfium2 – PDF loading
Ultralytics – YOLO-based layout detection
Supervision – Rendering layout elements
Surya – Text recognition (primary OCR, though it's currently a bottleneck)
EasyOCR – Recognizing text headers and titles (solves Surya's issue with them)
PaddleOCR – Table recognition
Optimum – Formula extraction model integration
YOLOv10-Document-Layout-Analysis – YOLO model parameters
ChartDet – Chart detection model parameters
DePlot – Model for chart deconstruction
BLIP Image Captioning – Image captioning model
Pix2Text-MFR – Formula recognition model

A huge thanks to the developers and maintainers of these projects!

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.2

Feb 24, 2025

0.3.0

Feb 24, 2025

0.2.1

Feb 23, 2025

0.2.0

Feb 23, 2025

0.1.7

Feb 3, 2025

0.1.6

Feb 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdify-0.3.2.tar.gz (23.9 kB view details)

Uploaded Feb 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mdify-0.3.2-py3-none-any.whl (24.6 kB view details)

Uploaded Feb 24, 2025 Python 3

File details

Details for the file mdify-0.3.2.tar.gz.

File metadata

Download URL: mdify-0.3.2.tar.gz
Upload date: Feb 24, 2025
Size: 23.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mdify-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`27770a6e359e19a5c8d6c219a534d09205688c16f76a468c12d524e234f7accf`
MD5	`a8972799b84287a907b76ade8c48718a`
BLAKE2b-256	`a99d2be5bc9f3b9988e260fa6fe2b1e6c693a123d53427b7b25dc68f12195205`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mdify-0.3.2.tar.gz:

Publisher: publish.yml on stefanodangelo/mdify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mdify-0.3.2.tar.gz
- Subject digest: 27770a6e359e19a5c8d6c219a534d09205688c16f76a468c12d524e234f7accf
- Sigstore transparency entry: 173817445
- Sigstore integration time: Feb 24, 2025
Source repository:
- Permalink: stefanodangelo/mdify@6e92c84215a5e9ab4e673305f4c89d383b10d112
- Branch / Tag: refs/tags/v0.3.2
- Owner: https://github.com/stefanodangelo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6e92c84215a5e9ab4e673305f4c89d383b10d112
- Trigger Event: release

File details

Details for the file mdify-0.3.2-py3-none-any.whl.

File metadata

Download URL: mdify-0.3.2-py3-none-any.whl
Upload date: Feb 24, 2025
Size: 24.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mdify-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e48d693ebdc85979dd79c00361e7b6fcee665132e2eb26a38ec7dd0294cca7f5`
MD5	`8748021bc221c6e2d238c642a7425139`
BLAKE2b-256	`74bc5847943fc70bd570db24cffd62b6415067f01c40813d9016cb840faf3a90`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mdify-0.3.2-py3-none-any.whl:

Publisher: publish.yml on stefanodangelo/mdify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mdify-0.3.2-py3-none-any.whl
- Subject digest: e48d693ebdc85979dd79c00361e7b6fcee665132e2eb26a38ec7dd0294cca7f5
- Sigstore transparency entry: 173817446
- Sigstore integration time: Feb 24, 2025
Source repository:
- Permalink: stefanodangelo/mdify@6e92c84215a5e9ab4e673305f4c89d383b10d112
- Branch / Tag: refs/tags/v0.3.2
- Owner: https://github.com/stefanodangelo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6e92c84215a5e9ab4e673305f4c89d383b10d112
- Trigger Event: release

mdify 0.3.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

MDify: Convert any document to Markdown

🚀 Installation

⚡ Quickstart

🔹 Key Features

📄 Documentation

🤝 Contributing

💖 Support

⚖️ License

❞ Citation

🔗 Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance