MapReduce-inspired framework for extending context windows in large language models
Project description
LLM_MapReduce
llm_mapreduce
is an open-source Python package that enables Large Language Models (LLMs) to process long documents efficiently by implementing a MapReduce-inspired framework. This package lets you extend the capabilities of any LLM to handle long texts without retraining the model. It works by dividing documents into manageable chunks, processing them independently, and then aggregating the results to produce a coherent answer.
Overview
Many LLMs are limited by a fixed context window, making it difficult to process extended texts in a single pass. llm_mapreduce
overcomes this limitation using a three-stage framework inspired by MapReduce:
- Map Stage: The document is split into chunks, each processed by the model to extract relevant information.
- Collapse Stage: The mapped results are grouped and summarized, keeping them within the model’s context window.
- Reduce Stage: The results from the collapse stage are aggregated to provide a final answer, resolving inter-chunk dependencies and conflicts.
Features
- Model-Agnostic: Works with any LLM, including OpenAI's GPT, Hugging Face models, and others.
- Training-Free: No need to fine-tune or retrain the model.
- Extends Context Window: Supports long-document processing by dividing, summarizing, and aggregating content.
- Structured Information Protocol: Organizes intermediate outputs into a structured format, ensuring coherence across chunks.
- In-Context Confidence Calibration: Assigns confidence scores to intermediate results for accurate conflict resolution.
Installation
pip install llm_mapreduce
Usage
Quick Start with OpenAI GPT
To use llm_mapreduce
with OpenAI's GPT models, you need an API key. Set up an OpenAI model wrapper and initialize MapReduceLLM
to process a large document.
1. Set up OpenAI API Key
export OPENAI_API_KEY='your-openai-api-key'
2. Code Example
import openai
from llm_mapreduce.mapreduce import MapReduceLLM
# Initialize OpenAI API
openai.api_key = "your-openai-api-key"
class OpenAIModelWrapper:
"""Wrapper to make OpenAI API compatible with MapReduceLLM."""
def __init__(self, model_name="gpt-4"):
self.model_name = model_name
def generate(self, query):
response = openai.ChatCompletion.create(
model=self.model_name,
messages=[{"role": "user", "content": query}],
max_tokens=500,
)
output_text = response.choices[0].message['content']
return {
"text": output_text,
"rationale": output_text,
"answer": output_text.split("\n")[0] # Simple answer parsing
}
# Initialize the wrapper and MapReduceLLM
model = OpenAIModelWrapper(model_name="gpt-4")
mapreduce_llm = MapReduceLLM(model=model, context_window=4096)
# Define the document and query
document = """Your large document text goes here..."""
query = "Summarize the key points."
# Process the document
result = mapreduce_llm.process_long_text(document, query)
print("Final Result:", result)
Configuring MapReduceLLM
context_window
: Define the maximum chunk size based on the model’s token limit.collapse_threshold
: Controls when chunks should be grouped and summarized in the Collapse stage.
Components
MapReduceLLM
Class
This is the main class that implements the MapReduce process for long text handling.
Methods:
map_stage()
: Processes each chunk with the model.collapse_stage()
: Summarizes mapped results when they exceed the context window.reduce_stage()
: Aggregates collapsed results to generate the final output.
StructuredInfoProtocol
Formats intermediate outputs for each chunk into a structured format with:
- Extracted Information: Key data relevant to the query.
- Rationale: Explanation of the answer based on the chunk.
- Answer: Intermediate answer based on extracted information.
- Confidence Score: Reliability of the answer to manage conflicts between chunks.
ConfidenceCalibrator
Assigns a confidence score to intermediate results based on the rationale, helping resolve conflicts in the reduce stage.
Example Applications
- Legal and Financial Analysis: Analyze long legal documents or financial reports to extract critical insights.
- Scientific Research: Summarize and query large research papers or datasets.
- Customer Support: Summarize and analyze long histories of customer interactions.
Development and Contribution
Contributions are welcome! To set up a development environment:
-
Clone the repository:
git clone https://github.com/your-username/llm_mapreduce.git cd llm_mapreduce
-
Install dependencies:
pip install -r requirements.txt
-
Run tests:
pytest
How to Contribute
- Fork the repository and create a new branch for your feature.
- Submit a pull request with a clear description of your changes.
References
- Zhou, Z., Li, C., Chen, X., Wang, S., Chao, Y., et al. (2024). LLM×MapReduce: Simplified Long-Sequence Processing Using Large Language Models. arXiv preprint arXiv:2410.09342.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file llm_mapreduce-0.1.0.tar.gz
.
File metadata
- Download URL: llm_mapreduce-0.1.0.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14d999d4cc71a374a35923847a4b446f0954e79fdbf9be481a033ae2e78b4493 |
|
MD5 | fd3da7764e584134a65fde914533807d |
|
BLAKE2b-256 | dbccd9ab4c564864a49759a30f6c7a85d41e3fbf38c8121ee4055f63d5da5272 |
File details
Details for the file llm_mapreduce-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: llm_mapreduce-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f0ccf0d7e66528a366e28b69b84349ee20974903e76ec754e50343bf87cb931 |
|
MD5 | bdec9679f27163741224da2baf1637c9 |
|
BLAKE2b-256 | 7674e0f9ddf177d3f1b6110cf7e04e92bfad3bd0ae1d86f7d9200d180773d3ea |