Tools to make contexts for AI
Project description
contaix
Tools to make contexts for AI
To install: pip install contaix
Markdown Conversion
This module provides tools for converting various file formats to Markdown. It supports common formats such as PDF, Word, Excel, PowerPoint, HTML, and Jupyter notebooks.
Key Features
- Convert files to Markdown from bytes with format auto-detection
- Batch convert multiple files
- Customize output filenames and processing
- Extensible converter system
Basic Usage
Converting a Single File
The primary function bytes_to_markdown converts a file's bytes to Markdown text:
from contaix import bytes_to_markdown
# Convert with explicit format
pdf_bytes = get_file_bytes('document.pdf')
markdown_text = bytes_to_markdown(pdf_bytes, "pdf")
# Or let the function detect the format from filename
markdown_text = bytes_to_markdown(file_bytes, input_format=None, key="document.docx")
# Or analyze the content to detect format (when no information is available)
markdown_text = bytes_to_markdown(
unknown_bytes,
input_format=None,
key=None,
try_bytes_detection=True
)
Converting Multiple Files
Use bytes_store_to_markdown_store to process multiple files at once:
from contaix import bytes_store_to_markdown_store
from dol import Files
# Convert all files in a directory
src_files = Files('/path/to/documents')
target_store = {}
bytes_store_to_markdown_store(src_files, target_store)
# Now target_store contains {"file1.docx.md": "converted markdown...", ...}
Advanced Usage
Selective Conversion
If you only want to convert specific file types:
# Filter to only include certain file types
filtered_files = {k: v for k, v in src_files.items()
if k.endswith('.docx') or k.endswith('.pdf')}
bytes_store_to_markdown_store(filtered_files, target_store)
Custom Output Naming
You can control how output filenames are generated:
def custom_key_transform(key):
# Remove the extension and add "-markdown.md"
base_name = os.path.splitext(key)[0]
return f"{base_name}-markdown.md"
bytes_store_to_markdown_store(
src_files,
target_store,
old_to_new_key=custom_key_transform
)
Content Aggregation
After conversion, you might want to combine all the markdown into a single document:
def aggregate_content(store):
"""Combine all markdown content into a single document with headers."""
result = "# Combined Markdown Document\n\n"
for filename, content in store.items():
result += f"## {filename}\n\n{content}\n\n---\n\n"
return result
combined_markdown = bytes_store_to_markdown_store(
src_files,
{},
target_store_egress=aggregate_content
)
Custom Converters
You can extend the system with your own converters:
def custom_txt_converter(b):
"""A custom converter for text files."""
text = b.decode('utf-8', errors='ignore')
lines = text.split('\n')
result = f"# Custom Converted Text File\n\n"
for line in lines:
if line.strip():
result += f"> {line}\n"
return result
custom_converters = {"txt": custom_txt_converter}
bytes_store_to_markdown_store(
src_files,
target_store,
converters=custom_converters
)
Format Detection Logic
The bytes_to_markdown function employs a prioritized strategy to find the right converter:
- Explicit Format: If you provide
input_format, it uses that directly - Filename-Based: If
input_formatis None butkey(filename) is provided, it extracts the format from the extension - Content-Based: If
try_bytes_detectionis True, it analyzes the bytes to determine the format - Fallback: If no converter is found through the above methods, it uses the fallback converter
This flexible approach means you can control how formats are detected based on your specific needs.
Performance Considerations
- If you know the file format in advance, specifying
input_formatwill be faster - To disable content-based detection (for better performance), set
try_bytes_detection=False - When processing large batches, filtering to only include supported formats can improve efficiency
Supported Formats
The module currently supports the following formats:
- PDF (
.pdf) - Microsoft Word (
.docx,.doc) - Microsoft Excel (
.xlsx,.xls) - Microsoft PowerPoint (
.pptx,.ppt) - HTML (
.html) - Jupyter Notebooks (
.ipynb) - Plain text (
.txt,.md)
Additional formats can be supported by adding custom converters.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file contaix-0.0.11.tar.gz.
File metadata
- Download URL: contaix-0.0.11.tar.gz
- Upload date:
- Size: 15.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
683e1eabcf82cac66b0f9a367ac919c7d1afc4f63ac2bc9be0137c31c6c60139
|
|
| MD5 |
9128e2dc50da7d7ccc7dfb90adf767ba
|
|
| BLAKE2b-256 |
c8ebf09f49c5215c20e3fe16d9dc7a50ebec6ceb5cd4507af451324b439d6dd3
|
File details
Details for the file contaix-0.0.11-py3-none-any.whl.
File metadata
- Download URL: contaix-0.0.11-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87cb69f7ce667267a001a05690340e116a0377aa3ab7567c29f8478eda84c225
|
|
| MD5 |
e21a836f93522cf839782792182bf89b
|
|
| BLAKE2b-256 |
220e6d09881be4e1bc4133251a2cdc499c1107fd3c9202fd51c188db2a8a3353
|