MapReduce-inspired framework for extending context windows in large language models

These details have not been verified by PyPI

Project links

Homepage

Project description

LLM_MapReduce

llm_mapreduce is an open-source Python package that enables Large Language Models (LLMs) to process long documents efficiently by implementing a MapReduce-inspired framework. This package lets you extend the capabilities of any LLM to handle long texts without retraining the model. It works by dividing documents into manageable chunks, processing them independently, and then aggregating the results to produce a coherent answer.

Overview

Many LLMs are limited by a fixed context window, making it difficult to process extended texts in a single pass. llm_mapreduce overcomes this limitation using a three-stage framework inspired by MapReduce:

Map Stage: The document is split into chunks, each processed by the model to extract relevant information.
Collapse Stage: The mapped results are grouped and summarized, keeping them within the model’s context window.
Reduce Stage: The results from the collapse stage are aggregated to provide a final answer, resolving inter-chunk dependencies and conflicts.

Features

Model-Agnostic: Works with any LLM, including OpenAI's GPT, Hugging Face models, and others.
Training-Free: No need to fine-tune or retrain the model.
Extends Context Window: Supports long-document processing by dividing, summarizing, and aggregating content.
Structured Information Protocol: Organizes intermediate outputs into a structured format, ensuring coherence across chunks.
In-Context Confidence Calibration: Assigns confidence scores to intermediate results for accurate conflict resolution.

Installation

pip install llm_mapreduce

Usage

Quick Start with OpenAI GPT

To use llm_mapreduce with OpenAI's GPT models, you need an API key. Set up an OpenAI model wrapper and initialize MapReduceLLM to process a large document.

1. Set up OpenAI API Key

export OPENAI_API_KEY='your-openai-api-key'

2. Code Example

import openai
from llm_mapreduce.mapreduce import MapReduceLLM

# Initialize OpenAI API
openai.api_key = "your-openai-api-key"

class OpenAIModelWrapper:
    """Wrapper to make OpenAI API compatible with MapReduceLLM."""
    def __init__(self, model_name="gpt-4"):
        self.model_name = model_name

    def generate(self, query):
        response = openai.ChatCompletion.create(
            model=self.model_name,
            messages=[{"role": "user", "content": query}],
            max_tokens=500,
        )
        output_text = response.choices[0].message['content']
        return {
            "text": output_text,
            "rationale": output_text,
            "answer": output_text.split("\n")[0]  # Simple answer parsing
        }

# Initialize the wrapper and MapReduceLLM
model = OpenAIModelWrapper(model_name="gpt-4")
mapreduce_llm = MapReduceLLM(model=model, context_window=4096)

# Define the document and query
document = """Your large document text goes here..."""
query = "Summarize the key points."

# Process the document
result = mapreduce_llm.process_long_text(document, query)
print("Final Result:", result)

Configuring MapReduceLLM

context_window: Define the maximum chunk size based on the model’s token limit.
collapse_threshold: Controls when chunks should be grouped and summarized in the Collapse stage.

Components

`MapReduceLLM` Class

This is the main class that implements the MapReduce process for long text handling.

Methods:

map_stage(): Processes each chunk with the model.
collapse_stage(): Summarizes mapped results when they exceed the context window.
reduce_stage(): Aggregates collapsed results to generate the final output.

`StructuredInfoProtocol`

Formats intermediate outputs for each chunk into a structured format with:

Extracted Information: Key data relevant to the query.
Rationale: Explanation of the answer based on the chunk.
Answer: Intermediate answer based on extracted information.
Confidence Score: Reliability of the answer to manage conflicts between chunks.

`ConfidenceCalibrator`

Assigns a confidence score to intermediate results based on the rationale, helping resolve conflicts in the reduce stage.

Example Applications

Legal and Financial Analysis: Analyze long legal documents or financial reports to extract critical insights.
Scientific Research: Summarize and query large research papers or datasets.
Customer Support: Summarize and analyze long histories of customer interactions.

Development and Contribution

Contributions are welcome! To set up a development environment:

Clone the repository:

git clone https://github.com/your-username/llm_mapreduce.git
cd llm_mapreduce

Install dependencies:
```
pip install -r requirements.txt
```
Run tests:
```
pytest
```

How to Contribute

Fork the repository and create a new branch for your feature.
Submit a pull request with a clear description of your changes.

References

Zhou, Z., Li, C., Chen, X., Wang, S., Chao, Y., et al. (2024). LLM×MapReduce: Simplified Long-Sequence Processing Using Large Language Models. arXiv preprint arXiv:2410.09342.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Nov 11, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_mapreduce-0.1.0.tar.gz (10.5 kB view details)

Uploaded Nov 11, 2024 Source

Built Distribution

llm_mapreduce-0.1.0-py3-none-any.whl (12.7 kB view details)

Uploaded Nov 11, 2024 Python 3

File details

Details for the file llm_mapreduce-0.1.0.tar.gz.

File metadata

Download URL: llm_mapreduce-0.1.0.tar.gz
Upload date: Nov 11, 2024
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for llm_mapreduce-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`14d999d4cc71a374a35923847a4b446f0954e79fdbf9be481a033ae2e78b4493`
MD5	`fd3da7764e584134a65fde914533807d`
BLAKE2b-256	`dbccd9ab4c564864a49759a30f6c7a85d41e3fbf38c8121ee4055f63d5da5272`

See more details on using hashes here.

File details

Details for the file llm_mapreduce-0.1.0-py3-none-any.whl.

File metadata

Download URL: llm_mapreduce-0.1.0-py3-none-any.whl
Upload date: Nov 11, 2024
Size: 12.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for llm_mapreduce-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8f0ccf0d7e66528a366e28b69b84349ee20974903e76ec754e50343bf87cb931`
MD5	`bdec9679f27163741224da2baf1637c9`
BLAKE2b-256	`7674e0f9ddf177d3f1b6110cf7e04e92bfad3bd0ae1d86f7d9200d180773d3ea`