Convert PDF, DOCX, PPTX, Images, URLs like Medium, Wikipedia and CSV documents to text or Markdown. Extracts text, images, and tables. Supports LLM-based extraction.
Project description
🚀 Doctomarkdown
Convert PDFs, DOCX, PPTX, CSV, images, and URLs to clean, readable Markdown in seconds!
Now with LLM-powered extraction, image & table support, and blazing-fast performance.
✨ What's New
- v0.2.0 (2025-06):
- 🖼️ Improved image extraction for PDFs and DOCX
- 🤖 Enhanced LLM support: Gemini, Groq, Ollama
- 🏷️ Custom output directory and file type
- 🐍 Python 3.10+ compatibility
- ⚡ Performance and stability improvements
Doctomarkdown
Doctomarkdown is a robust Python library for converting documents—including PDF, DOCX, PPTX, CSV, images, and URLs—into clean, readable Markdown. It supports extracting text, images, and tables, and is easily extensible for more document types. Advanced extraction is available via LLM (Large Language Model) clients.
Features
- 📄 Convert PDF, DOCX, PPTX, CSV, URL, and Images to Markdown
- 🖼️ Extract images from documents (optional)
- 📊 Extract tables from documents (optional)
- 🤖 LLM support: Azure OpenAI, OpenAI, Groq, Gemini, Ollama
- 🗂️ Extensible: Add support for more document types
- 🏷️ Custom output directory
Supported File Types
| File Type | Function Name | Example File Extension |
|---|---|---|
convert_pdf_to_markdown |
.pdf |
|
| DOCX | convert_docx_to_markdown |
.docx |
| PPTX | convert_pptx_to_markdown |
.pptx |
| CSV | convert_csv_to_markdown |
.csv |
| Image | convert_image_to_markdown |
.png, .jpg, .jpeg |
| URL | convert_url_to_markdown |
(web link) |
Supported LLM Clients
| LLM Client | How to Initialize |
|---|---|
| OpenAI | See below |
| Azure OpenAI | See below |
| Groq | See below |
| Gemini | See below |
| Ollama | See below |
OpenAI
from openai import OpenAI
client = OpenAI(api_key="your-api-key")
Azure OpenAI
from openai import AzureOpenAI
client = AzureOpenAI(
api_key="your-api-key",
azure_endpoint="https://your-resource-name.openai.azure.com/",
api_version="2023-05-15"
)
Groq
from groq import Groq
client = Groq(api_key="your-api-key")
Gemini
import google.generativeai as genai
genai.configure(api_key="your-api-key")
client = genai.GenerativeModel("gemini-pro")
Ollama
from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
Installation
pip install doctomarkdown
Note: Requires Python 3.10+
Basic Usage
Below is the main usage pattern with the specific function to call for each file type:
from doctomarkdown import DocToMarkdown
# Import and initialize your LLM client if needed (see table above)
app = DocToMarkdown(
llm_client=client, # Optional: pass your LLM client here
llm_model='your-model' # Optional: pass your LLM model name
)
# Choose the appropriate function for your file type:
# PDF: app.convert_pdf_to_markdown()
# DOCX: app.convert_docx_to_markdown()
# PPTX: app.convert_pptx_to_markdown()
# CSV: app.convert_csv_to_markdown()
# Image: app.convert_image_to_markdown()
# URL: app.convert_url_to_markdown()
result = app.convert_pdf_to_markdown( # Change function based on file type
filepath="path/to/your/file.pdf", # Change extension based on file type
extract_images=True, # Optional
extract_tables=True, # Optional
output_path="markdown_output", # Optional
output_type="markdown" # or 'text' for .txt output
)
for page in result.pages:
print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")
Examples: Using OpenAI Client for All File Types
1. PDF to Markdown
from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')
result = app.convert_pdf_to_markdown(
filepath="sample_docs/sample-1.pdf",
extract_images=True,
extract_tables=True,
output_path="markdown_output",
output_type="markdown"
)
for page in result.pages:
print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")
2. DOCX to Markdown
from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')
result = app.convert_docx_to_markdown(
filepath="sample_docs/sample_document.docx",
extract_images=True,
extract_tables=True,
output_path="markdown_output",
output_type="markdown"
)
for page in result.pages:
print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")
3. PPTX to Markdown
from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')
result = app.convert_pptx_to_markdown(
filepath="sample_docs/sample_ppt_2.pptx",
extract_images=True,
extract_tables=True,
output_path="markdown_output",
output_type="markdown"
)
for page in result.pages:
print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")
4. CSV to Markdown
from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')
result = app.convert_csv_to_markdown(
filepath="sample_docs/sample.csv",
extract_images=True,
extract_tables=True,
output_path="markdown_output",
output_type="markdown"
)
for page in result.pages:
print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")
5. Image to Markdown
from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')
result = app.convert_image_to_markdown(
filepath="sample_docs/sample_image.png",
extract_images=True,
extract_tables=True,
output_path="markdown_output",
output_type="text"
)
for page in result.pages:
print(f"Page Number: {page.page_number} | Page Content: {page.page_content}")
6. URL to Markdown
from openai import OpenAI
from doctomarkdown import DocToMarkdown
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
app = DocToMarkdown(llm_client=client, llm_model='gpt-4o')
result = app.convert_url_to_markdown(
urlpath="https://medium.com/the-ai-forum/build-a-local-reliable-rag-agent-using-crewai-and-groq-013e5d557bcd",
extract_images=True,
extract_tables=True,
output_path="markdown_output",
output_type="markdown"
)
for page in result.pages:
print(f"Page Number: {page.page_number}")
print(f"Content Preview: {page.page_content[:500]}...")
print(f"Total Length: {len(page.page_content)} characters")
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doctomarkdown-0.2.0.tar.gz.
File metadata
- Download URL: doctomarkdown-0.2.0.tar.gz
- Upload date:
- Size: 418.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05673999d77750940570828512cac6acf704c3c89120038d23038b61b11b8968
|
|
| MD5 |
f682d169ede2e8ec175465ea157266d0
|
|
| BLAKE2b-256 |
e308d67e052c248d107d1f40651905a5de3adaa7cb723cea1d72a55cf7f0e01a
|
Provenance
The following attestation bundles were made for doctomarkdown-0.2.0.tar.gz:
Publisher:
publish-to-pypi.yml on DocParseAI/doctomarkdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
doctomarkdown-0.2.0.tar.gz -
Subject digest:
05673999d77750940570828512cac6acf704c3c89120038d23038b61b11b8968 - Sigstore transparency entry: 230692867
- Sigstore integration time:
-
Permalink:
DocParseAI/doctomarkdown@90dec0c4de8b2d8377d3cd0bd33de601662ce1e1 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/DocParseAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@90dec0c4de8b2d8377d3cd0bd33de601662ce1e1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file doctomarkdown-0.2.0-py3-none-any.whl.
File metadata
- Download URL: doctomarkdown-0.2.0-py3-none-any.whl
- Upload date:
- Size: 22.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0e12b1b55a1b0e366133e03012c9426016fb1194622fed045a1462444b741b4
|
|
| MD5 |
cc0a06f1f390222aa4de45b6165749a1
|
|
| BLAKE2b-256 |
31b6be7950233278d37dfcc1ce70f76c2cb6f7333196f4ba8587c291dbb05adb
|
Provenance
The following attestation bundles were made for doctomarkdown-0.2.0-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on DocParseAI/doctomarkdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
doctomarkdown-0.2.0-py3-none-any.whl -
Subject digest:
d0e12b1b55a1b0e366133e03012c9426016fb1194622fed045a1462444b741b4 - Sigstore transparency entry: 230692872
- Sigstore integration time:
-
Permalink:
DocParseAI/doctomarkdown@90dec0c4de8b2d8377d3cd0bd33de601662ce1e1 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/DocParseAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@90dec0c4de8b2d8377d3cd0bd33de601662ce1e1 -
Trigger Event:
release
-
Statement type: