A package for Nuggetizer

These details have not been verified by PyPI

Project links

Homepage

Project description

nuggetizer

A powerful tool for extracting, scoring, and assigning information nuggets from text using LLMs.

📟 Installation

Create Conda Environment

conda create -n nuggetizer python=3.10
conda activate nuggetizer

Pip Installation

pip install nuggetizer

Development Installation

For development or the latest features, install from source:

git clone https://github.com/castorini/nuggetizer.git
cd nuggetizer
pip install -e .

Environment Setup

Create a .env file with your OpenAI credentials. For Azure OpenAI (default for GPT models):

AZURE_OPENAI_API_BASE=your_azure_endpoint
AZURE_OPENAI_API_VERSION=your_api_version
AZURE_OPENAI_API_KEY=your_api_key

Or for OpenAI API:

OPEN_AI_API_KEY=your_openai_api_key

🚀 Quick Start

Here's a simple example of how to use nuggetizer:

from nuggetizer.core.types import Query, Document, Request
from nuggetizer.models.nuggetizer import Nuggetizer
from nuggetizer.models.scorer import NuggetScorer
from nuggetizer.models.assigner import NuggetAssigner

# Create a sample request
query = Query(qid="1", text="What are the main features of Python?")
documents = [
    Document(
        docid="1",
        segment="""Python is a high-level programming language known for its 
        simplicity and readability. It supports multiple programming paradigms 
        including procedural, object-oriented, and functional programming."""
    ),
    Document(
        docid="2",
        segment="""Python was created by Guido van Rossum in 1991."""
    ),
    Document(
        docid="3",
        segment="""Python is widely used in web development, data analysis, 
        artificial intelligence, and scientific computing."""
    ),
]
request = Request(query=query, documents=documents)

# Initialize components (API keys and configuration are loaded automatically)
nuggetizer = Nuggetizer(model="gpt-4o")  # Uses Azure OpenAI by default for GPT models
scorer = NuggetScorer(model="gpt-4o")
assigner = NuggetAssigner(model="gpt-4o")

# Process the request
nuggets, _ = nuggetizer.process(request)
scored_nuggets = scorer.score(nuggets)
assigned_nuggets = assigner.assign(documents[0].segment, scored_nuggets)

# Print results
for nugget in assigned_nuggets:
    print(f"Nugget: {nugget.text}")
    print(f"Importance: {nugget.importance}")
    print(f"Assignment: {nugget.assignment}\n")

🛠️ Components

Nuggetizer: Extracts atomic information nuggets from text
NuggetScorer: Scores nuggets based on their importance
NuggetAssigner: Assigns nuggets to specific text passages

Processing Pool and RAG Answer Files

To process a pool file (pool.jsonl) and a RAG answer file (ragnarok.jsonl), follow these steps:

First, generate nuggets and score them:

# Extract nuggets
python3 scripts/create_nuggets.py --input_file pool.jsonl --output_file nuggets.jsonl --log_level 1

# Score the nuggets
python3 scripts/score_nuggets.py --input_file nuggets.jsonl --output_file scored_nuggets.jsonl

For RAG answers, we assume they take on the format laid out by the wonderful TREC 2024 RAG Track:

{
    "run_id": "ragnarok",
    "topic_id": "2027497", # query_id
    "topic": "how often should you take your toddler to the potty when potty training", # query
    "references": [ # top-k segments returned used from the retrieval step. We have k equals to 20 segments for this example.
        "msmarco_v2.1_doc_51_766815931#2_1606878413", "msmarco_v2.1_doc_51_766815931#1_1606876582", "msmarco_v2.1_doc_51_766815931#5_1606882767", 
        "msmarco_v2.1_doc_51_766815931#6_1606884302", "msmarco_v2.1_doc_51_766815931#3_1606879951", "msmarco_v2.1_doc_51_766815931#4_1606881348", 
        "msmarco_v2.1_doc_37_463237391#10_984448281", "msmarco_v2.1_doc_51_766815931#0_1606874600", "msmarco_v2.1_doc_37_463237391#9_984446615", 
        "msmarco_v2.1_doc_28_472446307#22_1012988885", "msmarco_v2.1_doc_51_766815931#7_1606885873", "msmarco_v2.1_doc_28_472446307#21_1012986800", 
        "msmarco_v2.1_doc_29_562342450#23_1356565296", "msmarco_v2.1_doc_29_562342450#17_1356555947", "msmarco_v2.1_doc_49_418787959#7_861728734", 
        "msmarco_v2.1_doc_49_418787959#6_861726964", "msmarco_v2.1_doc_26_680625866#7_1289507527", "msmarco_v2.1_doc_10_1346272776#19_2165266355", 
        "msmarco_v2.1_doc_56_1491300640#3_3012150696", "msmarco_v2.1_doc_10_672519892#5_1260010758"], 
    "response_length": 192, 
    "answer": [ # RAG answer broken into sentences with each answer sentence grounding information from the index mentioned in references.
            {"text": "The frequency with which you should take your toddler to the potty depends on their readiness for potty training.", "citations": [0, 1, 12, 13, 19]}, 
            {"text": "Some sources suggest that toddlers should be taken to the potty about three times a day: first thing in the morning, after mealtimes, and again before bedtime.", "citations": [0, 4, 6, 8]}, 
            {"text": "It is recommended that you watch for facial expressions or poses that may signal that they need to \"go\".", "citations": [6, 8]}, 
            {"text": "If they are reluctant to use the potty, don't force them.", "citations": [6, 8]}, 
            {"text": "Other sources suggest that toddlers should be taken to the potty every two hours, whether they have to go or not.", "citations": [14, 15]}, 
            {"text": "This includes first thing in the morning, before leaving the house, and before naps and bedtime.", "citations": [14, 15]}, 
            {"text": "Some sources recommend taking toddlers to the potty every 30 minutes to an hour.", "citations": [9, 11, 17]}, 
            {"text": "This is to increase the chances of them peeing in the potty instead of on the floor.", "citations": [9, 11]}, 
            {"text": "It is important to keep in mind that every toddler is different, and their potty training journey will be unique to them.", "citations": [0, 4]}, 
            {"text": "It is recommended that you let your toddler lead the way and be gentle throughout the process, as their self-esteem can be fragile during this time.", "citations": [0, 1]}
        ]
}

To easily generate answers in this format, consider using Ragnarök. Let's now assign the nuggets to the RAG answers:

# Assign nuggets to RAG answers
python3 scripts/assign_nuggets.py \
    --nugget_file scored_nuggets.jsonl \
    --answer_file ragnarok.jsonl \
    --output_file final_assignments.jsonl

# Calculate metrics
python3 scripts/calculate_metrics.py \
    --input_file final_assignments.jsonl \
    --output_file metrics.jsonl

The final output file (final_assignments.jsonl) will contain:

query: The original query
qid: Query ID
answer_text: Full answer text
response_length: Response length
run_id: Run ID (derived from the RAG answer filename)
nuggets: Nuggets with their importance labels and assignments

The final metrics file (metrics.jsonl) will contain:

Per-response metrics:
- strict_vital_score: Score counting only full support for vital nuggets
- strict_all_score: Score counting only full support for all nuggets
- vital_score: Score counting full (1.0) and partial (0.5) support for vital nuggets
- all_score: Score counting full (1.0) and partial (0.5) support for all nuggets
Global mean metrics across all responses

Advanced Configuration

While the default configuration works out of the box, you can customize the behavior:

# Custom window size and mode
nuggetizer = Nuggetizer(
    model="gpt-4o",
    window_size=10,  # Process more documents at once
    stride=10,        # Overlap between windows
    max_nuggets=30   # Maximum number of nuggets to extract
)

# Custom scoring mode
scorer = NuggetScorer(
    model="gpt-4o",
    mode=NuggetScoreMode.VITAL_OKAY,
    window_size=20
)

# Custom assignment mode
assigner = NuggetAssigner(
    model="gpt-4o",
    mode=NuggetAssignMode.SUPPORT_GRADE_3,  # 3-level support grading
    window_size=15
)

# Using non-Azure OpenAI models
nuggetizer = Nuggetizer(model="claude-3")  # Will use regular OpenAI API

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📝 License

This project is licensed under the Apache License - see the LICENSE file for details.

🙏 Acknowledgments

This project is built with Azure OpenAI and other powerful LLM technologies.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.5

Feb 4, 2025

0.0.4

Jan 31, 2025

0.0.3.1

Jan 31, 2025

0.0.3

Jan 30, 2025

0.0.2

Jan 29, 2025

This version

0.0.1

Jan 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nuggetizer-0.0.1.tar.gz (23.0 kB view details)

Uploaded Jan 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nuggetizer-0.0.1-py3-none-any.whl (22.1 kB view details)

Uploaded Jan 27, 2025 Python 3

File details

Details for the file nuggetizer-0.0.1.tar.gz.

File metadata

Download URL: nuggetizer-0.0.1.tar.gz
Upload date: Jan 27, 2025
Size: 23.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.10.6

File hashes

Hashes for nuggetizer-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`b98c06a7aa5d1118acec12ade1b4076c3321f30a81e1b76cef4c92e054d92883`
MD5	`8cd2a4711d5e00db48bce6f8354ce029`
BLAKE2b-256	`929f2add61969761094b7914009710dce10e63d35b7b868d3a8d6dea7a9851da`

See more details on using hashes here.

File details

Details for the file nuggetizer-0.0.1-py3-none-any.whl.

File metadata

Download URL: nuggetizer-0.0.1-py3-none-any.whl
Upload date: Jan 27, 2025
Size: 22.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.10.6

File hashes

Hashes for nuggetizer-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5b9f000288ca5c09a2d2fddb0c1ec6e0f95284bd9df4acbcedc1858c6d8090b3`
MD5	`9ff31dccec9b4735dae2148bcbd53dbf`
BLAKE2b-256	`2a86869c844f3714927741799a112d1542c29b3659d3db27fc0798a57547b55b`

See more details on using hashes here.

nuggetizer 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

nuggetizer

📟 Installation

Create Conda Environment

Pip Installation

Development Installation

Environment Setup

🚀 Quick Start

🛠️ Components

Processing Pool and RAG Answer Files

Advanced Configuration

🤝 Contributing

📝 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes