Convert PDF, DOCX, PPTX, Images, URLs like Medium, Wikipedia and CSV documents to text or Markdown. Extracts text, images, and tables. Supports LLM-based extraction.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Sayantan99

Project description

Doctomarkdown Logo

🚀 Doctomarkdown

PyPI - Install PyPI - Downloads License

Convert PDFs, DOCX, PPTX, CSV, images, and URLs to clean, readable Markdown in seconds!
Now with LLM-powered extraction, image & table support, and blazing-fast performance.

✨ What's New

v0.2.0 (2025-06):
- 🖼️ Improved image extraction for PDFs and DOCX
- 🤖 Enhanced LLM support: Gemini, Groq, Ollama
- 🏷️ Custom output directory and file type
- 🐍 Python 3.10+ compatibility
- ⚡ Performance and stability improvements

Doctomarkdown

Doctomarkdown is a robust Python library for converting documents—including PDF, DOCX, PPTX, CSV, images, and URLs—into clean, readable Markdown. It supports extracting text, images, and tables, and is easily extensible for more document types. Advanced extraction is available via LLM (Large Language Model) clients.

Features

📄 Convert PDF, DOCX, PPTX, CSV, URL, and Images to Markdown
🖼️ Extract images from documents (optional)
📊 Extract tables from documents (optional)
🤖 LLM support: Azure OpenAI, OpenAI, Groq, Gemini, Ollama
🗂️ Extensible: Add support for more document types
🏷️ Custom output directory

Supported File Types

File Type	Function Name	Example File Extension
PDF	`convert_pdf_to_markdown`	`.pdf`
DOCX	`convert_docx_to_markdown`	`.docx`
PPTX	`convert_pptx_to_markdown`	`.pptx`
CSV	`convert_csv_to_markdown`	`.csv`
Image	`convert_image_to_markdown`	`.png`, `.jpg`, `.jpeg`
URL	`convert_url_to_markdown`	(web link)

Supported LLM Clients

LLM Client	How to Initialize
OpenAI	See below
Azure OpenAI	See below
Groq	See below
Gemini	See below
Ollama	See below

OpenAI

from openai import OpenAI
client = OpenAI(api_key="your-api-key")

Azure OpenAI

from openai import AzureOpenAI
client = AzureOpenAI(
    api_key="your-api-key",
    azure_endpoint="https://your-resource-name.openai.azure.com/",
    api_version="2023-05-15"
)

Groq

from groq import Groq
client = Groq(api_key="your-api-key")

Gemini

import google.generativeai as genai
genai.configure(api_key="your-api-key")
client = genai.GenerativeModel("gemini-pro")

Ollama

from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

Installation

pip install doctomarkdown

Note: Requires Python 3.10+

Basic Usage

Below is the main usage pattern with the specific function to call for each file type:

from doctomarkdown import DocToMarkdown
# Import and initialize your LLM client if needed (see table above)

app = DocToMarkdown(
    llm_client=client,        # Optional: pass your LLM client here
    llm_model='your-model'    # Optional: pass your LLM model name
)

# Choose the appropriate function for your file type:
# PDF:   app.convert_pdf_to_markdown()
# DOCX:  app.convert_docx_to_markdown()
# PPTX:  app.convert_pptx_to_markdown()
# CSV:   app.convert_csv_to_markdown()
# Image: app.convert_image_to_markdown()
# URL:   app.convert_url_to_markdown()

result = app.convert_pdf_to_markdown(  # Change function based on file type
    filepath="path/to/your/file.pdf",  # Change extension based on file type
    extract_images=True,      # Optional
    extract_tables=True,      # Optional
    output_path="markdown_output",  # Optional
    output_type="markdown"   # or 'text' for .txt output
)

for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

Examples: Using OpenAI Client for All File Types

1. PDF to Markdown

from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')

result = app.convert_pdf_to_markdown(
    filepath="sample_docs/sample-1.pdf",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output",
    output_type="markdown"
)
for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

2. DOCX to Markdown

from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')

result = app.convert_docx_to_markdown(
    filepath="sample_docs/sample_document.docx",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output",
    output_type="markdown"
)
for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

3. PPTX to Markdown

from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')

result = app.convert_pptx_to_markdown(
    filepath="sample_docs/sample_ppt_2.pptx",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output",
    output_type="markdown"
)
for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

4. CSV to Markdown

from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')

result = app.convert_csv_to_markdown(
    filepath="sample_docs/sample.csv",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output",
    output_type="markdown"
)
for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

5. Image to Markdown

from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')

result = app.convert_image_to_markdown(
    filepath="sample_docs/sample_image.png",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output",
    output_type="text"
)
for page in result.pages:
    print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")

6. URL to Markdown

from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')

result = app.convert_url_to_markdown(
    urlpath="https://medium.com/the-ai-forum/build-a-local-reliable-rag-agent-using-crewai-and-groq-013e5d557bcd",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output",
    output_type="markdown"
)
for page in result.pages:
    print(f"Page Number: {page.page_number}")
    print(f"Content Preview: {page.page_content[:500]}...")
    print(f"Total Length: {len(page.page_content)} characters")

License

This project is licensed under the MIT License.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Sayantan99

Release history Release notifications | RSS feed

This version

0.2.0

Jun 6, 2025

0.1.9

Jun 5, 2025

0.1.8

Jun 4, 2025

0.1.7

May 30, 2025

0.1.5

May 30, 2025

0.1.3

May 29, 2025

0.1.2

May 28, 2025

0.1.1

May 27, 2025

0.1.0

May 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctomarkdown-0.2.0.tar.gz (418.3 kB view details)

Uploaded Jun 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doctomarkdown-0.2.0-py3-none-any.whl (22.7 kB view details)

Uploaded Jun 6, 2025 Python 3

File details

Details for the file doctomarkdown-0.2.0.tar.gz.

File metadata

Download URL: doctomarkdown-0.2.0.tar.gz
Upload date: Jun 6, 2025
Size: 418.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for doctomarkdown-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`05673999d77750940570828512cac6acf704c3c89120038d23038b61b11b8968`
MD5	`f682d169ede2e8ec175465ea157266d0`
BLAKE2b-256	`e308d67e052c248d107d1f40651905a5de3adaa7cb723cea1d72a55cf7f0e01a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for doctomarkdown-0.2.0.tar.gz:

Publisher: publish-to-pypi.yml on DocParseAI/doctomarkdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: doctomarkdown-0.2.0.tar.gz
- Subject digest: 05673999d77750940570828512cac6acf704c3c89120038d23038b61b11b8968
- Sigstore transparency entry: 230692867
- Sigstore integration time: Jun 6, 2025
Source repository:
- Permalink: DocParseAI/doctomarkdown@90dec0c4de8b2d8377d3cd0bd33de601662ce1e1
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/DocParseAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@90dec0c4de8b2d8377d3cd0bd33de601662ce1e1
- Trigger Event: release

File details

Details for the file doctomarkdown-0.2.0-py3-none-any.whl.

File metadata

Download URL: doctomarkdown-0.2.0-py3-none-any.whl
Upload date: Jun 6, 2025
Size: 22.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for doctomarkdown-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d0e12b1b55a1b0e366133e03012c9426016fb1194622fed045a1462444b741b4`
MD5	`cc0a06f1f390222aa4de45b6165749a1`
BLAKE2b-256	`31b6be7950233278d37dfcc1ce70f76c2cb6f7333196f4ba8587c291dbb05adb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for doctomarkdown-0.2.0-py3-none-any.whl:

Publisher: publish-to-pypi.yml on DocParseAI/doctomarkdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: doctomarkdown-0.2.0-py3-none-any.whl
- Subject digest: d0e12b1b55a1b0e366133e03012c9426016fb1194622fed045a1462444b741b4
- Sigstore transparency entry: 230692872
- Sigstore integration time: Jun 6, 2025
Source repository:
- Permalink: DocParseAI/doctomarkdown@90dec0c4de8b2d8377d3cd0bd33de601662ce1e1
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/DocParseAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@90dec0c4de8b2d8377d3cd0bd33de601662ce1e1
- Trigger Event: release

doctomarkdown 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

🚀 Doctomarkdown

✨ What's New

Doctomarkdown

Features

Supported File Types

Supported LLM Clients

Installation

Basic Usage

Examples: Using OpenAI Client for All File Types

1. PDF to Markdown

2. DOCX to Markdown

3. PPTX to Markdown

4. CSV to Markdown

5. Image to Markdown

6. URL to Markdown

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance