Skip to main content

Convert PDF, DOCX, PPTX, Images, URLs like Medium, Wikipedia and CSV documents to text or Markdown. Extracts text, images, and tables. Supports LLM-based extraction.

Project description

Doctomarkdown Logo

🚀 Doctomarkdown

PyPI Version PyPI - Install PyPI - Downloads License

Convert PDFs, DOCX, PPTX, CSV, images, and URLs to clean, readable Markdown in seconds!
Now with LLM-powered extraction, image & table support, and blazing-fast performance.


✨ What's New

  • v0.2.0 (2025-06):
    • 🖼️ Improved image extraction for PDFs and DOCX
    • 🤖 Enhanced LLM support: Gemini, Groq, Ollama
    • 🏷️ Custom output directory and file type
    • 🐍 Python 3.10+ compatibility
    • ⚡ Performance and stability improvements

Doctomarkdown

Doctomarkdown is a robust Python library for converting documents—including PDF, DOCX, PPTX, CSV, images, and URLs—into clean, readable Markdown. It supports extracting text, images, and tables, and is easily extensible for more document types. Advanced extraction is available via LLM (Large Language Model) clients.


Features

  • 📄 Convert PDF, DOCX, PPTX, CSV, URL, and Images to Markdown
  • 🖼️ Extract images from documents (optional)
  • 📊 Extract tables from documents (optional)
  • 🤖 LLM support: Azure OpenAI, OpenAI, Groq, Gemini, Ollama
  • 🗂️ Extensible: Add support for more document types
  • 🏷️ Custom output directory

Supported File Types

File Type Function Name Example File Extension
PDF convert_pdf_to_markdown .pdf
DOCX convert_docx_to_markdown .docx
PPTX convert_pptx_to_markdown .pptx
CSV convert_csv_to_markdown .csv
Image convert_image_to_markdown .png, .jpg, .jpeg
URL convert_url_to_markdown (web link)

Supported LLM Clients

LLM Client How to Initialize
OpenAI See below
Azure OpenAI See below
Groq See below
Gemini See below
Ollama See below

OpenAI

from openai import OpenAI
client = OpenAI(api_key="your-api-key")

Azure OpenAI

from openai import AzureOpenAI
client = AzureOpenAI(
    api_key="your-api-key",
    azure_endpoint="https://your-resource-name.openai.azure.com/",
    api_version="2023-05-15"
)

Groq

from groq import Groq
client = Groq(api_key="your-api-key")

Gemini

import google.generativeai as genai
genai.configure(api_key="your-api-key")
client = genai.GenerativeModel("gemini-pro")

Ollama

from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

Installation

pip install doctomarkdown

Note: Requires Python 3.10+


Basic Usage

Below is the main usage pattern with the specific function to call for each file type:

from doctomarkdown import DocToMarkdown
# Import and initialize your LLM client if needed (see table above)

app = DocToMarkdown(
    llm_client=client,        # Optional: pass your LLM client here
    llm_model='your-model'    # Optional: pass your LLM model name
)

# Choose the appropriate function for your file type:
# PDF:   app.convert_pdf_to_markdown()
# DOCX:  app.convert_docx_to_markdown()
# PPTX:  app.convert_pptx_to_markdown()
# CSV:   app.convert_csv_to_markdown()
# Image: app.convert_image_to_markdown()
# URL:   app.convert_url_to_markdown()

result = app.convert_pdf_to_markdown(  # Change function based on file type
    filepath="path/to/your/file.pdf",  # Change extension based on file type
    extract_images=True,      # Optional
    extract_tables=True,      # Optional
    output_path="markdown_output",  # Optional
    output_type="markdown"   # or 'text' for .txt output
)

for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

Examples: Using OpenAI Client for All File Types

1. PDF to Markdown

from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')

result = app.convert_pdf_to_markdown(
    filepath="sample_docs/sample-1.pdf",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output",
    output_type="markdown"
)
for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

2. DOCX to Markdown

from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')

result = app.convert_docx_to_markdown(
    filepath="sample_docs/sample_document.docx",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output",
    output_type="markdown"
)
for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

3. PPTX to Markdown

from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')

result = app.convert_pptx_to_markdown(
    filepath="sample_docs/sample_ppt_2.pptx",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output",
    output_type="markdown"
)
for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

4. CSV to Markdown

from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')

result = app.convert_csv_to_markdown(
    filepath="sample_docs/sample.csv",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output",
    output_type="markdown"
)
for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

5. Image to Markdown

from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')

result = app.convert_image_to_markdown(
    filepath="sample_docs/sample_image.png",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output",
    output_type="text"
)
for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

6. URL to Markdown

from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')

result = app.convert_url_to_markdown(
    urlpath="https://medium.com/the-ai-forum/build-a-local-reliable-rag-agent-using-crewai-and-groq-013e5d557bcd",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output",
    output_type="markdown"
)
for page in result.pages:
    print(f"Page Number: {page.page_number}")
    print(f"Content Preview: {page.page_content[:500]}...")
    print(f"Total Length: {len(page.page_content)} characters")

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctomarkdown-0.2.0.tar.gz (418.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doctomarkdown-0.2.0-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file doctomarkdown-0.2.0.tar.gz.

File metadata

  • Download URL: doctomarkdown-0.2.0.tar.gz
  • Upload date:
  • Size: 418.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for doctomarkdown-0.2.0.tar.gz
Algorithm Hash digest
SHA256 05673999d77750940570828512cac6acf704c3c89120038d23038b61b11b8968
MD5 f682d169ede2e8ec175465ea157266d0
BLAKE2b-256 e308d67e052c248d107d1f40651905a5de3adaa7cb723cea1d72a55cf7f0e01a

See more details on using hashes here.

Provenance

The following attestation bundles were made for doctomarkdown-0.2.0.tar.gz:

Publisher: publish-to-pypi.yml on DocParseAI/doctomarkdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file doctomarkdown-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: doctomarkdown-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 22.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for doctomarkdown-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d0e12b1b55a1b0e366133e03012c9426016fb1194622fed045a1462444b741b4
MD5 cc0a06f1f390222aa4de45b6165749a1
BLAKE2b-256 31b6be7950233278d37dfcc1ce70f76c2cb6f7333196f4ba8587c291dbb05adb

See more details on using hashes here.

Provenance

The following attestation bundles were made for doctomarkdown-0.2.0-py3-none-any.whl:

Publisher: publish-to-pypi.yml on DocParseAI/doctomarkdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page