Skip to main content

Convert documents to Markdown

Project description

Doctomarkdown Logo

Doctomarkdown


Doctomarkdown

Doctomarkdown is a Python library to convert documents (like PDF) into clean, readable Markdown format. It supports extracting text, images, and tables, and is easily extensible for more document types.


Features

  • ๐Ÿ“„ Convert PDF to Markdown
  • ๐Ÿ–ผ๏ธ Extract images from documents (optional)
  • ๐Ÿ“Š Extract tables from documents (optional)
  • ๐Ÿค– LLM support for advanced extraction (optional)
  • ๐Ÿ—‚๏ธ Extensible: Add support for DOCX, PPTX, CSV, and more
  • ๐Ÿท๏ธ Custom output directory

Installation

Clone the repository and install in editable mode:

# Clone the repository
$ git clone https://github.com/DocParseAI/doctomarkdown.git
$ cd doctomarkdown

# Install dependencies
$ pip install -r requirements.txt

# Install the package in editable mode
$ pip install -e .

Note: Requires Python 3.10+


Usage Example

1. Convert PDF to Markdown (No LLM)

from doctomarkdown import DocToMarkdown

app = DocToMarkdown()

result = app.convert_pdf_to_markdown(
    filepath="sample_docs/sample.pdf",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output"
)

for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

2. Convert PDF to Markdown using Groq LLM Client

from groq import Groq
from doctomarkdown import DocToMarkdown
from dotenv import load_dotenv
import os
load_dotenv()

client_groq = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)

app = DocToMarkdown(
    llm_client=client_groq,
    llm_model='meta-llama/llama-4-scout-17b-16e-instruct'
)

3. Convert PDF to Markdown using Gemini LLM Client

from google import genai
from dotenv import load_dotenv
import os
load_dotenv()
import google.generativeai as genai
from doctomarkdown import DocToMarkdown

genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
vision_model = genai.GenerativeModel("gemini-1.5-flash")  # Choose your Gemini Vision model

app = DocToMarkdown(
    llm_client=vision_model
)

4. Convert PDF to Markdown using AzureOpenAI Client

from openai import AzureOpenAI
from dotenv import load_dotenv
load_dotenv()

clinet = AzureOpenAI(
    api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
)

app = DocToMarkdown(llm_client=clinet, 
                    llm_model='gpt-4o')

5. Convert PDF to Markdown using Ollama Client

from openai import OpenAI

ollama_client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama',
)

app = DocToMarkdown(llm_client=ollama_client, 
                    llm_model='gemma3:4b')

Command Line Example

You can also run the example script:

python examples/pdf_example.py

Supported File Types

  • PDF (more coming soon: DOCX, PPTX, CSV)

File Structure

doctomarkdown/
โ”œโ”€โ”€ base.py
โ”œโ”€โ”€ factory.py
โ”œโ”€โ”€ __init__.py
โ”œโ”€โ”€ converters/
โ”‚   โ”œโ”€โ”€ pdf_to_markdown.py
โ”‚   โ”œโ”€โ”€ docx_to_markdown.py
โ”‚   โ”œโ”€โ”€ pptx_to_markdown.py
โ”‚   โ”œโ”€โ”€ csv_to_markdown.py
โ”‚   โ””โ”€โ”€ __init__.py
โ”œโ”€โ”€ utils/
โ”‚   โ”œโ”€โ”€ markdown_helpers.py
โ”‚   โ””โ”€โ”€ __init__.py
examples/
โ”œโ”€โ”€ pdf_example.py
โ”œโ”€โ”€ sample_docs/
โ”‚   โ””โ”€โ”€ sample.pdf
markdown_output/
โ”œโ”€โ”€ sample.md
setup.py
requirements.txt
README.md
LICENSE

Contributing

Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.


License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctomarkdown-0.1.0.tar.gz (66.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doctomarkdown-0.1.0-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file doctomarkdown-0.1.0.tar.gz.

File metadata

  • Download URL: doctomarkdown-0.1.0.tar.gz
  • Upload date:
  • Size: 66.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for doctomarkdown-0.1.0.tar.gz
Algorithm Hash digest
SHA256 524f2b0a9c918f713ef5e17b5f9c2444bedc6171f618b7a3d599ba0459869c48
MD5 2afa23aecc1fb4a3d2051d6b9b8bf534
BLAKE2b-256 db04e94a83d85600553072184a205ac93c0156de7adcc85dd08dbac0e3595dea

See more details on using hashes here.

File details

Details for the file doctomarkdown-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: doctomarkdown-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for doctomarkdown-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 38d1eb9ddcd20823fa1ca8cffac31deb5dbf4af0645ec614e836335d9757df4b
MD5 cb8d7f94e1d33d332d91196907382f6e
BLAKE2b-256 cd94cbf1d58c897efe203e80274a4b034de5b0d9a518e8b4d8b7431fcd98d5af

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page