Skip to main content

MapReduce-inspired framework for extending context windows in large language models

Project description

LLM_MapReduce

llm_mapreduce is an open-source Python package that enables Large Language Models (LLMs) to process long documents efficiently by implementing a MapReduce-inspired framework. This package lets you extend the capabilities of any LLM to handle long texts without retraining the model. It works by dividing documents into manageable chunks, processing them independently, and then aggregating the results to produce a coherent answer.

Overview

Many LLMs are limited by a fixed context window, making it difficult to process extended texts in a single pass. llm_mapreduce overcomes this limitation using a three-stage framework inspired by MapReduce:

  1. Map Stage: The document is split into chunks, each processed by the model to extract relevant information.
  2. Collapse Stage: The mapped results are grouped and summarized, keeping them within the model’s context window.
  3. Reduce Stage: The results from the collapse stage are aggregated to provide a final answer, resolving inter-chunk dependencies and conflicts.

Features

  • Model-Agnostic: Works with any LLM, including OpenAI's GPT, Hugging Face models, and others.
  • Training-Free: No need to fine-tune or retrain the model.
  • Extends Context Window: Supports long-document processing by dividing, summarizing, and aggregating content.
  • Structured Information Protocol: Organizes intermediate outputs into a structured format, ensuring coherence across chunks.
  • In-Context Confidence Calibration: Assigns confidence scores to intermediate results for accurate conflict resolution.

Installation

pip install llm_mapreduce

Usage

Quick Start with OpenAI GPT

To use llm_mapreduce with OpenAI's GPT models, you need an API key. Set up an OpenAI model wrapper and initialize MapReduceLLM to process a large document.

1. Set up OpenAI API Key

export OPENAI_API_KEY='your-openai-api-key'

2. Code Example

import openai
from llm_mapreduce.mapreduce import MapReduceLLM

# Initialize OpenAI API
openai.api_key = "your-openai-api-key"

class OpenAIModelWrapper:
    """Wrapper to make OpenAI API compatible with MapReduceLLM."""
    def __init__(self, model_name="gpt-4"):
        self.model_name = model_name

    def generate(self, query):
        response = openai.ChatCompletion.create(
            model=self.model_name,
            messages=[{"role": "user", "content": query}],
            max_tokens=500,
        )
        output_text = response.choices[0].message['content']
        return {
            "text": output_text,
            "rationale": output_text,
            "answer": output_text.split("\n")[0]  # Simple answer parsing
        }

# Initialize the wrapper and MapReduceLLM
model = OpenAIModelWrapper(model_name="gpt-4")
mapreduce_llm = MapReduceLLM(model=model, context_window=4096)

# Define the document and query
document = """Your large document text goes here..."""
query = "Summarize the key points."

# Process the document
result = mapreduce_llm.process_long_text(document, query)
print("Final Result:", result)

Configuring MapReduceLLM

  • context_window: Define the maximum chunk size based on the model’s token limit.
  • collapse_threshold: Controls when chunks should be grouped and summarized in the Collapse stage.

Components

MapReduceLLM Class

This is the main class that implements the MapReduce process for long text handling.

Methods:

  • map_stage(): Processes each chunk with the model.
  • collapse_stage(): Summarizes mapped results when they exceed the context window.
  • reduce_stage(): Aggregates collapsed results to generate the final output.

StructuredInfoProtocol

Formats intermediate outputs for each chunk into a structured format with:

  • Extracted Information: Key data relevant to the query.
  • Rationale: Explanation of the answer based on the chunk.
  • Answer: Intermediate answer based on extracted information.
  • Confidence Score: Reliability of the answer to manage conflicts between chunks.

ConfidenceCalibrator

Assigns a confidence score to intermediate results based on the rationale, helping resolve conflicts in the reduce stage.

Example Applications

  • Legal and Financial Analysis: Analyze long legal documents or financial reports to extract critical insights.
  • Scientific Research: Summarize and query large research papers or datasets.
  • Customer Support: Summarize and analyze long histories of customer interactions.

Development and Contribution

Contributions are welcome! To set up a development environment:

  1. Clone the repository:

    git clone https://github.com/your-username/llm_mapreduce.git
    cd llm_mapreduce
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Run tests:

    pytest
    

How to Contribute

  • Fork the repository and create a new branch for your feature.
  • Submit a pull request with a clear description of your changes.

References

  • Zhou, Z., Li, C., Chen, X., Wang, S., Chao, Y., et al. (2024). LLM×MapReduce: Simplified Long-Sequence Processing Using Large Language Models. arXiv preprint arXiv:2410.09342.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_mapreduce-0.1.0.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

llm_mapreduce-0.1.0-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file llm_mapreduce-0.1.0.tar.gz.

File metadata

  • Download URL: llm_mapreduce-0.1.0.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for llm_mapreduce-0.1.0.tar.gz
Algorithm Hash digest
SHA256 14d999d4cc71a374a35923847a4b446f0954e79fdbf9be481a033ae2e78b4493
MD5 fd3da7764e584134a65fde914533807d
BLAKE2b-256 dbccd9ab4c564864a49759a30f6c7a85d41e3fbf38c8121ee4055f63d5da5272

See more details on using hashes here.

File details

Details for the file llm_mapreduce-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_mapreduce-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f0ccf0d7e66528a366e28b69b84349ee20974903e76ec754e50343bf87cb931
MD5 bdec9679f27163741224da2baf1637c9
BLAKE2b-256 7674e0f9ddf177d3f1b6110cf7e04e92bfad3bd0ae1d86f7d9200d180773d3ea

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page