Skip to main content

Convert documents to Markdown

Project description

Doctomarkdown Logo

Doctomarkdown


Doctomarkdown

Doctomarkdown is a robust Python library for converting documents—including PDF, DOCX, PPTX, and CSV—into clean, readable Markdown. It supports extracting text, images, and tables, and is easily extensible for more document types. Advanced extraction is available via LLM (Large Language Model) clients.


Features

  • 📄 Convert PDF, DOCX, PPTX, and CSV to Markdown
  • 🖼️ Extract images from documents (optional)
  • 📊 Extract tables from documents (optional)
  • 🤖 LLM support for advanced extraction (PDF)
  • 🗂️ Extensible: Add support for more document types
  • 🏷️ Custom output directory

Installation

$ pip install doctomarkdown

Note: Requires Python 3.10+


Usage Examples

1. Convert PDF to Markdown (No LLM)

from doctomarkdown import DocToMarkdown

app = DocToMarkdown()

result = app.convert_pdf_to_markdown(
    filepath="sample_docs/Non-text-searchable.pdf",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output"
)

for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

2. Convert PDF to Markdown using Groq LLM Client

from groq import Groq
from doctomarkdown import DocToMarkdown
from dotenv import load_dotenv
import os
load_dotenv()

client_groq = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)

app = DocToMarkdown(
    llm_client=client_groq,
    llm_model='meta-llama/llama-4-scout-17b-16e-instruct'
)

result = app.convert_pdf_to_markdown(
    filepath="sample_docs/Non-text-searchable.pdf",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output"
)

for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

3. Convert PDF to Markdown using Gemini LLM Client

from google import genai
from dotenv import load_dotenv
import os
load_dotenv()
import google.generativeai as genai
from doctomarkdown import DocToMarkdown

genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
vision_model = genai.GenerativeModel("gemini-1.5-flash")  # Choose your Gemini Vision model

app = DocToMarkdown(
    llm_client=vision_model
)

result = app.convert_pdf_to_markdown(
    filepath="sample_docs/Non-text-searchable.pdf",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output"
)

for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

4. Convert PDF to Markdown using Azure OpenAI Client

from doctomarkdown import DocToMarkdown
from openai import AzureOpenAI
from dotenv import load_dotenv
import os
load_dotenv()

client = AzureOpenAI(
    api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
)

app = DocToMarkdown(
    llm_client=client,
    llm_model='gpt-4o'
)

result = app.convert_pdf_to_markdown(
    filepath="sample_docs/Non-text-searchable.pdf",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output"
)

for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

5. Convert PDF to Markdown using Ollama API Client

from doctomarkdown import DocToMarkdown
from openai import OpenAI

ollama_client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama',
)

app = DocToMarkdown(llm_client=ollama_client, llm_model='gemma3:4b')
result = app.convert_pdf_to_markdown(
    filepath="sample_docs/Non-text-searchable.pdf",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output"
)

for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

6. Convert DOCX to Markdown

from doctomarkdown import DocToMarkdown
from dotenv import load_dotenv
load_dotenv()

app = DocToMarkdown()

result = app.convert_docx_to_markdown(
    filepath="sample_docs/Sampledoc-1.docx",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output"
)

for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

7. Convert PPTX to Markdown

from doctomarkdown import DocToMarkdown
from dotenv import load_dotenv
load_dotenv()

app = DocToMarkdown()

result = app.convert_pptx_to_markdown(
    filepath="sample_docs/sample-ppt-1.pptx",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output"
)

for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

8. Convert CSV to Markdown

from doctomarkdown import DocToMarkdown

app = DocToMarkdown()

result = app.convert_csv_to_markdown(
    filepath="sample_docs/sample.csv",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output"
)

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctomarkdown-0.1.1.tar.gz (66.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doctomarkdown-0.1.1-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file doctomarkdown-0.1.1.tar.gz.

File metadata

  • Download URL: doctomarkdown-0.1.1.tar.gz
  • Upload date:
  • Size: 66.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for doctomarkdown-0.1.1.tar.gz
Algorithm Hash digest
SHA256 fb74b7489950099b556f59853f2bdb3cc85856ebe1ed105ae32f61c70ffb2b45
MD5 5631552f2435f65c15f20f8ede301f2f
BLAKE2b-256 d1111700eebee2a8169541a7b1047908c83b7bffebeb575e828f1a321af0fd6a

See more details on using hashes here.

File details

Details for the file doctomarkdown-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: doctomarkdown-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for doctomarkdown-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 899b05b8ac4bce4eb63579fcfc31ced9bdcb4d77158ad12b1e964d7a7d3e990c
MD5 01a88659db47fb9614848df43c25e1be
BLAKE2b-256 02e962f78eeefc86030aa8e1236ac4f3735fbc1ce507af5a6857406bb2ac1e5c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page