A Python package to convert markdown content to structured JSON
Project description
markdown2json
A Python package for converting markdown content to structured JSON with advanced parsing capabilities and optional LLM-powered analysis.
Features
-
Document Structure Analysis
- Convert markdown documents to structured JSON format
- Preserve document hierarchy and section relationships
- Extract metadata and document components
-
Table Processing
- Extract and analyze tables with context awareness
- Determine table relationships and semantic roles
- Generate table summaries and statistics
-
Advanced Analysis
- LLM integration with multiple providers (Claude, OpenAI, Ollama)
- Semantic role detection for document elements
- Context-aware content processing
-
Utility Features
- Batch processing for multiple documents
- Customizable output formats
- Rich metadata extraction
Installation
# Basic installation
pip install markdown2json
Requirements
-
Python 3.8+
-
Core Dependencies:
- markdown>=3.3.0
- beautifulsoup4>=4.9.0
- markdown-it-py==3.0.0
- python-dotenv==1.0.1
-
Optional Dependencies:
- anthropic>=0.3.0 (for Claude integration)
- openai>=1.0.0 (for OpenAI integration)
- ollama>=0.4.7 (for Ollama integration)
Quick Start
from markdown2json import MarkdownToJSON
# Initialize parser with markdown content
with open(PATH_TO_MARKDOWN_FILE, 'r', encoding='utf-8') as f:
content = f.read()
parser = MarkdownToJSON(content)
# Extract all content including tables and structure
all_content = parser.extract_all_content()
# Extract only tables with context
tables = parser.extract_tables_by_page()
# Get table statistics
summary = parser.get_table_summary()
Advanced Usage
Table Extraction and Analysis
from markdown2json import MarkdownToJSON
from pathlib import Path
import json
# Initialize parser
with open(PATH_TO_MARKDOWN_FILE, 'r', encoding='utf-8') as f:
content = f.read()
parser = MarkdownToJSON(content)
# Extract tables with context
tables_by_page = parser.extract_tables_by_page()
# Get table summary
summary = parser.get_table_summary()
print(f"Total tables: {summary['total_tables']}")
print(f"Largest table: {summary['largest_table']}")
# Save output
output_dir = Path("output/table_analysis")
output_dir.mkdir(parents=True, exist_ok=True)
# Save table data as JSON
with open(output_dir / "tables_by_page.json", 'w', encoding='utf-8') as f:
json.dump(tables_by_page, f, indent=2)
LLM-Powered Analysis
import asyncio
from markdown2json import MarkdownToJSON
from markdown2json.models.enums import LLMProvider
from markdown2json.utils import llm_processors
with open(PATH_TO_MARKDOWN_FILE, "r", encoding="utf-8") as f:
content = f.read()
async def process_with_llm(markdown_content: str):
# Initialize parser
parser = MarkdownToJSON(markdown_content)
# Get default prompt for content analysis
prompt = llm_processors.get_default_prompt({
"markdown_content": markdown_content
})
# Process with different LLM providers
openai_result = await parser.process_with_llm(
provider=LLMProvider.OPENAI,
model=model_name, # gpt-4o, gpt-4o-mini, gpt-4-turbo
custom_prompt=prompt
)
return openai_result
# Run async processing
result = asyncio.run(process_with_llm(content))
print(result)
Batch Processing
import asyncio
from pathlib import Path
from markdown2json import MarkdownToJSON
import json
async def process_files(input_dir: Path, output_dir: Path):
# Validate input directory
if not input_dir.exists():
raise FileNotFoundError(f"Input directory {input_dir} does not exist")
# Check for markdown files
md_files = list(input_dir.glob("*.md"))
if not md_files:
raise ValueError(f"No markdown files found in {input_dir}")
# Create output directory
output_dir.mkdir(parents=True, exist_ok=True)
print(f"Processing files in {input_dir}")
for md_file in md_files:
try:
print(f"Processing file: {md_file}")
# Read markdown content
with open(md_file, "r", encoding="utf-8") as f:
content = f.read()
# Initialize parser
parser = MarkdownToJSON(content)
# Extract content
tables = parser.extract_tables_by_page()
summary = parser.get_table_summary()
# Save outputs
tables_output = output_dir / f"{md_file.stem}_tables.json"
summary_output = output_dir / f"{md_file.stem}_summary.json"
with open(tables_output, "w", encoding="utf-8") as f:
json.dump(tables, f, indent=2)
with open(summary_output, "w", encoding="utf-8") as f:
json.dump(summary, f, indent=2)
except Exception as e:
print(f"Error processing {md_file}: {e}")
# Run batch processing
asyncio.run(process_files(Path("inputs"), Path("output")))
Markdown to AST(Abstract Syntax Tree)
# Import the MarkdownToJSON parser class
from markdown2json.parser import MarkdownToJSON
import json
def read_markdown_file(filename: str) -> str:
"""Read markdown from a file."""
with open(filename, "r", encoding="utf-8") as f:
return f.read()
# Read the markdown content from input file
markdown_content = read_markdown_file(PATH_TO_MARKDOWN_FILE)
# Initialize parser with markdown content
m2j = MarkdownToJSON(markdown_content)
# Convert markdown to Abstract Syntax Tree (AST) JSON format
# This creates a structured JSON representation of the markdown
json_content = m2j.markdown_to_ast()
# Print the JSON AST structure
print(json.dumps(json_content, indent=2))
# Convert the JSON AST back to markdown format
# This demonstrates roundtrip conversion from markdown -> JSON -> markdown
markdown_content = m2j.ast_to_markdown(json_content)
# Print the regenerated markdown
print(markdown_content)
API Reference
MarkdownToJSON Class
Core Methods
extract_document_components(): Extract key document components including headers, footers, tables, and metadataextract_tables_by_page(): Extract tables organized by page with contextget_table_summary(): Get statistical analysis of tablesextract_all_content(): Get comprehensive document analysisjson_to_markdown(): Convert JSON structure back to markdownmarkdown_to_ast(): Converts the markdown to the json format in AST(Abstract Syntax Tree) Approach.ast_to_markdown(): Revert back the json format to the markdown( Note : the json format should follow AST aproach to revert back.)
LLM Integration
process_with_llm(): Process content with LLM providers- Supports Claude, OpenAI, and Ollama
- Customizable prompts
- Async processing
Project Structure
markdown2json/
├── parser.py # Main parser implementation
├── models/ # Data models and enums
├── utils/
│ ├── text_processors.py # Text processing utilities
│ ├── extractors.py # Content extraction tools
│ └── llm_processors.py # LLM integration
└── helpers/
└── document_analyzer.py # Document analysis tools
Configuration
The package behavior can be customized through environment variables:
MARKDOWN2JSON_LLM_PROVIDER: Default LLM providerMARKDOWN2JSON_MAX_TOKENS: Maximum tokens for LLM requestsMARKDOWN2JSON_TIMEOUT: Request timeout in seconds
Troubleshooting
Common issues and solutions:
-
Table Extraction Issues
- Ensure tables are properly formatted in markdown
- Check for missing headers or malformed cells
-
LLM Integration
- Verify API keys are properly set
- Check network connectivity
- Ensure proper provider configuration
-
Performance Issues
- Consider batch processing for large files
- Use appropriate chunk sizes for LLM requests
Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -am 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
MIT License
Support
For support:
- Open an issue on GitHub
- Check the documentation
- Contact maintainers at support@markdown2json.com
Changelog
Version 0.1.0
- Initial release
- Basic markdown to JSON conversion
- Table extraction and analysis
- LLM integration
- Document component extraction
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markdown2json-0.1.0.tar.gz.
File metadata
- Download URL: markdown2json-0.1.0.tar.gz
- Upload date:
- Size: 27.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e473f20fbe5e27d29467dbd7154fa540a7e40a30c4f27655cf3aed79106244a7
|
|
| MD5 |
383f6dce8d3d43cf0feaf87c48ae87d8
|
|
| BLAKE2b-256 |
a28731996cb63452bd61048cf8283e303a4a54b6b1c40141ccecb7f82338d7c1
|
File details
Details for the file markdown2json-0.1.0-py3-none-any.whl.
File metadata
- Download URL: markdown2json-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd1907fd5279cc63f2da13aeba1038db0b85bf760b357c54d04ea90657728d17
|
|
| MD5 |
948547e371a0cfeb0c097db4ded18650
|
|
| BLAKE2b-256 |
b33b50bc9ded998c71185b5ef5e308643cc60de6d6f14c52bb791fd33408f824
|