Skip to main content

Un toolkit complet pour extraire et traiter la documentation depuis plusieurs formats de fichiers (PDF, TXT, JSON, CSV, DOCX) avec des bindings Python

Project description

🐍 Doc Loader - Python Package

Python PyPI License: MIT Documentation

A comprehensive Python toolkit for extracting and processing documentation from multiple file formats (PDF, TXT, JSON, CSV, DOCX) into a universal JSON structure, optimized for vector stores and RAG (Retrieval-Augmented Generation) systems.

🚀 Features

  • ✅ Universal JSON Output: Consistent format across all document types
  • ✅ Multiple Format Support: PDF, TXT, JSON, CSV, DOCX
  • ✅ Native Performance: Rust backend with PyO3 bindings
  • ✅ Intelligent Text Processing: Smart chunking, cleaning, and metadata extraction
  • ✅ Vector Store Ready: Optimized output for embedding and indexing
  • ✅ Rich Metadata: Comprehensive document and chunk-level metadata
  • ✅ Language Detection: Automatic language detection capabilities

📦 Installation

pip install extracteur-docs-rs

🔧 Usage

Quick Start

import extracteur_docs_rs as doc_loader

# Process any supported file format
result = doc_loader.process_file("document.pdf", chunk_size=500)

print(f"Chunks: {result.chunk_count()}")
print(f"Words: {result.total_word_count()}")
print(f"Supported formats: {doc_loader.supported_extensions()}")

Advanced Usage

import extracteur_docs_rs as doc_loader

# Create a processor with custom parameters
processor = doc_loader.PyUniversalProcessor()
params = doc_loader.PyProcessingParams(
    chunk_size=400,
    overlap=60,
    clean_text=True,
    extract_metadata=True
)

# Process a file
result = processor.process_file("document.txt", params)

# Process text content directly
text_result = processor.process_text_content("Your text here...", params)

# Export to JSON
json_output = result.to_json()

📋 Output Format

All processors generate a standardized JSON structure:

{
  "document_metadata": {
    "filename": "document.pdf",
    "document_type": "PDF",
    "file_size": 1024000,
    "title": "Document Title",
    "author": "Author Name"
  },
  "chunks": [
    {
      "id": "pdf_chunk_0",
      "content": "Extracted text content...",
      "chunk_index": 0,
      "metadata": {
        "size": 1000,
        "language": "en",
        "confidence": 0.95
      }
    }
  ],
  "processing_info": {
    "processor": "PdfProcessor",
    "processing_time_ms": 150,
    "total_chunks": 5,
    "total_content_size": 5000
  }
}

📊 Supported Formats

  • PDF: Text extraction with metadata, page tracking
  • TXT: Encoding detection, language detection, smart chunking
  • JSON: Hierarchical analysis, key extraction, schema inference
  • CSV: Header detection, column analysis, data completeness
  • DOCX: Document structure, style preservation, metadata

🔧 API Reference

Main Functions

# Process any file format
result = doc_loader.process_file(file_path, **options)

# Get supported file extensions
extensions = doc_loader.supported_extensions()

PyUniversalProcessor Class

processor = doc_loader.PyUniversalProcessor()

# Process file with custom parameters
result = processor.process_file(file_path, params)

# Process text content directly
result = processor.process_text_content(text, params)

PyProcessingParams Class

params = doc_loader.PyProcessingParams(
    chunk_size=1000,          # Maximum chunk size
    overlap=100,              # Overlap between chunks
    clean_text=True,          # Enable text cleaning
    extract_metadata=True,    # Extract rich metadata
    detect_language=True      # Enable language detection
)

Result Methods

# Get chunk count
count = result.chunk_count()

# Get total word count
words = result.total_word_count()

# Export to JSON string
json_str = result.to_json()

# Get processing info
info = result.get_processing_info()

🔗 Integration Examples

RAG/Embedding Pipeline

import extracteur_docs_rs as doc_loader
from sentence_transformers import SentenceTransformer

# Process document
result = doc_loader.process_file("document.pdf")

# Extract chunks for embedding
chunks = [chunk["content"] for chunk in result.to_json()["chunks"]]

# Generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(chunks)

Batch Processing

import extracteur_docs_rs as doc_loader
import os

def process_directory(directory_path):
    results = []
    for filename in os.listdir(directory_path):
        if any(filename.endswith(ext) for ext in doc_loader.supported_extensions()):
            file_path = os.path.join(directory_path, filename)
            result = doc_loader.process_file(file_path)
            results.append(result)
    return results

# Process all documents in a directory
results = process_directory("./documents/")

REST API Integration

from flask import Flask, request, jsonify
import extracteur_docs_rs as doc_loader

app = Flask(__name__)

@app.route('/process', methods=['POST'])
def process_document():
    if 'file' not in request.files:
        return jsonify({'error': 'No file provided'}), 400
    
    file = request.files['file']
    if file.filename == '':
        return jsonify({'error': 'No file selected'}), 400
    
    # Save temporary file
    temp_path = f"/tmp/{file.filename}"
    file.save(temp_path)
    
    try:
        # Process document
        result = doc_loader.process_file(temp_path)
        return jsonify(result.to_json())
    finally:
        os.remove(temp_path)

if __name__ == '__main__':
    app.run()

📈 Performance

  • Fast Processing: Rust backend for optimal performance
  • Memory Efficient: Streaming processing for large files
  • Concurrent Support: Thread-safe processors
  • Scalable: Suitable for production workloads

🔗 Links

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extracteur_docs_rs-0.3.2-cp313-cp313-manylinux_2_34_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

File details

Details for the file extracteur_docs_rs-0.3.2-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for extracteur_docs_rs-0.3.2-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 dfb811a37385533b8578f3393215bab0e7570f786a5306fc8bac2e11fac88481
MD5 9ed58d26be7f390d388607f1033045f5
BLAKE2b-256 2d9adf54abbe7d46389f9c2c68de9664b7635c73f7654ccbfadd8be1f30d2295

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page