Un toolkit complet pour extraire et traiter la documentation depuis plusieurs formats de fichiers (PDF, TXT, JSON, CSV, DOCX) avec des bindings Python

These details have not been verified by PyPI

Project links

Project description

🐍 Doc Loader - Python Package

A comprehensive Python toolkit for extracting and processing documentation from multiple file formats (PDF, TXT, JSON, CSV, DOCX) into a universal JSON structure, optimized for vector stores and RAG (Retrieval-Augmented Generation) systems.

🚀 Features

✅ Universal JSON Output: Consistent format across all document types
✅ Multiple Format Support: PDF, TXT, JSON, CSV, DOCX
✅ Native Performance: Rust backend with PyO3 bindings
✅ Intelligent Text Processing: Smart chunking, cleaning, and metadata extraction
✅ Vector Store Ready: Optimized output for embedding and indexing
✅ Rich Metadata: Comprehensive document and chunk-level metadata
✅ Language Detection: Automatic language detection capabilities

📦 Installation

pip install extracteur-docs-rs

🔧 Usage

Quick Start

import extracteur_docs_rs as doc_loader

# Process any supported file format
result = doc_loader.process_file("document.pdf", chunk_size=500)

print(f"Chunks: {result.chunk_count()}")
print(f"Words: {result.total_word_count()}")
print(f"Supported formats: {doc_loader.supported_extensions()}")

Advanced Usage

import extracteur_docs_rs as doc_loader

# Create a processor with custom parameters
processor = doc_loader.PyUniversalProcessor()
params = doc_loader.PyProcessingParams(
    chunk_size=400,
    overlap=60,
    clean_text=True,
    extract_metadata=True
)

# Process a file
result = processor.process_file("document.txt", params)

# Process text content directly
text_result = processor.process_text_content("Your text here...", params)

# Export to JSON
json_output = result.to_json()

📋 Output Format

All processors generate a standardized JSON structure:

{
  "document_metadata": {
    "filename": "document.pdf",
    "document_type": "PDF",
    "file_size": 1024000,
    "title": "Document Title",
    "author": "Author Name"
  },
  "chunks": [
    {
      "id": "pdf_chunk_0",
      "content": "Extracted text content...",
      "chunk_index": 0,
      "metadata": {
        "size": 1000,
        "language": "en",
        "confidence": 0.95
      }
    }
  ],
  "processing_info": {
    "processor": "PdfProcessor",
    "processing_time_ms": 150,
    "total_chunks": 5,
    "total_content_size": 5000
  }
}

📊 Supported Formats

PDF: Text extraction with metadata, page tracking
TXT: Encoding detection, language detection, smart chunking
JSON: Hierarchical analysis, key extraction, schema inference
CSV: Header detection, column analysis, data completeness
DOCX: Document structure, style preservation, metadata

🔧 API Reference

Main Functions

# Process any file format
result = doc_loader.process_file(file_path, **options)

# Get supported file extensions
extensions = doc_loader.supported_extensions()

PyUniversalProcessor Class

processor = doc_loader.PyUniversalProcessor()

# Process file with custom parameters
result = processor.process_file(file_path, params)

# Process text content directly
result = processor.process_text_content(text, params)

PyProcessingParams Class

params = doc_loader.PyProcessingParams(
    chunk_size=1000,          # Maximum chunk size
    overlap=100,              # Overlap between chunks
    clean_text=True,          # Enable text cleaning
    extract_metadata=True,    # Extract rich metadata
    detect_language=True      # Enable language detection
)

Result Methods

# Get chunk count
count = result.chunk_count()

# Get total word count
words = result.total_word_count()

# Export to JSON string
json_str = result.to_json()

# Get processing info
info = result.get_processing_info()

🔗 Integration Examples

RAG/Embedding Pipeline

import extracteur_docs_rs as doc_loader
from sentence_transformers import SentenceTransformer

# Process document
result = doc_loader.process_file("document.pdf")

# Extract chunks for embedding
chunks = [chunk["content"] for chunk in result.to_json()["chunks"]]

# Generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(chunks)

Batch Processing

import extracteur_docs_rs as doc_loader
import os

def process_directory(directory_path):
    results = []
    for filename in os.listdir(directory_path):
        if any(filename.endswith(ext) for ext in doc_loader.supported_extensions()):
            file_path = os.path.join(directory_path, filename)
            result = doc_loader.process_file(file_path)
            results.append(result)
    return results

# Process all documents in a directory
results = process_directory("./documents/")

REST API Integration

from flask import Flask, request, jsonify
import extracteur_docs_rs as doc_loader

app = Flask(__name__)

@app.route('/process', methods=['POST'])
def process_document():
    if 'file' not in request.files:
        return jsonify({'error': 'No file provided'}), 400
    
    file = request.files['file']
    if file.filename == '':
        return jsonify({'error': 'No file selected'}), 400
    
    # Save temporary file
    temp_path = f"/tmp/{file.filename}"
    file.save(temp_path)
    
    try:
        # Process document
        result = doc_loader.process_file(temp_path)
        return jsonify(result.to_json())
    finally:
        os.remove(temp_path)

if __name__ == '__main__':
    app.run()

📈 Performance

Fast Processing: Rust backend for optimal performance
Memory Efficient: Streaming processing for large files
Concurrent Support: Thread-safe processors
Scalable: Suitable for production workloads

🔗 Links

Documentation: https://willisback.github.io/doc_loader/
Source Code: https://github.com/WillIsback/doc_loader
Issue Tracker: https://github.com/WillIsback/doc_loader/issues

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.2

Jul 4, 2025

0.2.0

Jul 4, 2025

0.1.0

Jul 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

extracteur_docs_rs-0.3.2-cp313-cp313-manylinux_2_34_x86_64.whl (1.5 MB view details)

Uploaded Jul 4, 2025 CPython 3.13manylinux: glibc 2.34+ x86-64

File details

Details for the file extracteur_docs_rs-0.3.2-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

Download URL: extracteur_docs_rs-0.3.2-cp313-cp313-manylinux_2_34_x86_64.whl
Upload date: Jul 4, 2025
Size: 1.5 MB
Tags: CPython 3.13, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for extracteur_docs_rs-0.3.2-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`dfb811a37385533b8578f3393215bab0e7570f786a5306fc8bac2e11fac88481`
MD5	`9ed58d26be7f390d388607f1033045f5`
BLAKE2b-256	`2d9adf54abbe7d46389f9c2c68de9664b7635c73f7654ccbfadd8be1f30d2295`

See more details on using hashes here.

extracteur-docs-rs 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🐍 Doc Loader - Python Package

🚀 Features

📦 Installation

🔧 Usage

Quick Start

Advanced Usage

📋 Output Format

📊 Supported Formats

🔧 API Reference

Main Functions

PyUniversalProcessor Class

PyProcessingParams Class

Result Methods

🔗 Integration Examples

RAG/Embedding Pipeline

Batch Processing

REST API Integration

📈 Performance

🔗 Links

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes