A package for extracting structured data from free text

These details have not been verified by PyPI

Project description

Laurium

A Python package for extracting structured data from text and generating synthetic data using language models.

Organisations collect vast amounts of free text data containing untapped information that could provide decision makers with valuable insights. Laurium addresses this by providing tools for converting unstructured text into structured data using Large Language Models. Through prompt engineering, the package can be adapted to different use cases and data extraction requirements, unlocking the value hidden in text data.

For example, customer feedback stating "The login system crashed and I lost all my work!" contains information about the sentiment of the review, how urgently it needs to be addressed, what department is responsible for addressing the complaints and if action is required. Laurium provides the tools to extract and structure this information enabling quantitative analysis and data-driven decision making:

                                            text sentiment  urgency department action_required
The login system crashed and I lost all my work!  negative        5         IT             yes

This can be scaled to datasets which would be impossible to manually review and label.

This package started from work done by the BOLD Families project on estimating the number of children who have a parent in prison.

Install Laurium

You can install Laurium either from PyPI or from GitHub directly. If installing from PyPI, you will need to install a spaCy dependency alongside the package.

The Decoder and Encoder aspects of the package have been split under optional dependencies for smoother installation. If you wanted to install the decoder-only aspects of the package, for example, you would do so as follows:

From GitHub

# using uv
uv add "laurium[decoder] @ git+https://github.com/moj-analytical-services/laurium.git"

# using pip
pip install "laurium[decoder] @ git+https://github.com/moj-analytical-services/laurium.git"

From PyPI

# using uv
uv add laurium[decoder]
uv add https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl

# using pip
pip install laurium[decoder]
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl

If you want to install the encoder-only aspect of the package replace laurium[decoder] with laurium[encoder], and to install all optional dependencies replace with laurium[all]

LLM Provider Setup

Laurium works with both local and cloud-based language models:

Local Models with Ollama

For running models locally without API costs:

Install Ollama from ollama.ai
Spin up a local ollama server by running in your terminal: ollama serve
Pull a model by running in your terminal: ollama pull qwen2.5:7b

Benefits:

No API costs or rate limits
Data stays local for privacy
Works offline

Requirements:

Sufficient disk space for model storage
GPU recommended for faster processing

AWS Bedrock Models

For cloud-based models like Claude:

AWS Account with Bedrock service enabled
Configure AWS credentials via AWS CLI, environment variables, or IAM roles
Bedrock permissions for your AWS user/role

Basic Usage

Text Classification Pipeline

Laurium specializes in structured data extraction from text. Here's how to build a classification pipeline:

Using Ollama (Local)

from laurium.decoder_models import llm, prompts, pydantic_models, extract
from langchain_core.output_parsers import PydanticOutputParser
import pandas as pd
from typing import Literal

# 1. Create LLM instance
sentiment_llm = llm.create_llm(
    llm_platform="ollama", model_name="qwen2.5:7b", temperature=0.0
)

# 2. Define output schema
schema = {"ai_label": Literal[0, 1]}  # 1 for positive, 0 for negative
descriptions = {
    "ai_label": "Sentiment classification (1=positive, 0=negative)"
}

# 3. Build prompt with automatic schema integration
system_message = prompts.create_system_message(
    base_message="You are a sentiment analysis assistant. Use 1 for positive"
    "sentiment, 0 for negative sentiment.",
    keywords=["positive", "negative"],
)

extraction_prompt = prompts.create_prompt(
    system_message=system_message,
    examples=None,
    example_human_template=None,
    example_assistant_template=None,
    final_query="Analyze this text: {text}",
    schema=schema,  # Tell the LLM the output format we expect
    descriptions=descriptions,  # Provides field context to LLM
)

# 4. Create Pydantic model using same schema
OutputModel = pydantic_models.make_dynamic_example_model(
    schema=schema, descriptions=descriptions, model_name="SentimentOutput"
)

# 5. Create extractor and process data
parser = PydanticOutputParser(pydantic_object=OutputModel)
extractor = extract.BatchExtractor(
    llm=sentiment_llm, prompt=extraction_prompt, parser=parser
)

# Process your data
data = pd.DataFrame(
    {
        "text": [
            "I absolutely love this product!",
            "This is terrible, worst purchase ever.",
            "Great value for money, highly recommend!",
        ]
    }
)

results = extractor.process_chunk(data, text_column="text")
print(results.to_string(index=False))

Using AWS Bedrock

# Same code as above, but create LLM with Bedrock:
sentiment_llm = llm.create_llm(
    llm_platform="bedrock",
    model_name="anthropic.claude-3-haiku-20240307-v1:0",
    temperature=0.0,
    aws_region_name="eu-west-1",
)
# ... rest of the code remains the same

This will output something like:

                                    text  ai_label
         I absolutely love this product!         1
  This is terrible, worst purchase ever.         0
Great value for money, highly recommend!         1

Multi-Field Extraction

Define Complex Output Schemas

Extract multiple pieces of structured data simultaneously:

# Create LLM instance
feedback_llm = llm.create_llm(
    llm_platform="ollama", model_name="qwen2.5:7b", temperature=0.0
)

# Schema for analyzing customer feedback using Literal types for constraints
from typing import Literal

schema = {
    "sentiment": Literal["positive", "negative", "neutral"],
    "urgency": Literal[1, 2, 3, 4, 5],  # 1-5 scale
    "department": Literal["IT", "Support", "Product", "Sales", "Other"],
    "action_required": Literal["yes", "no"],
}

descriptions = {
    "sentiment": "Customer's emotional tone",
    "urgency": "How quickly this needs attention (1=low, 5=urgent)",
    "department": "Which department should handle this",
    "action_required": "Whether immediate action is needed",
}

# Build prompt with automatic schema integration
system_message = prompts.create_system_message(
    base_message="Analyze customer feedback and extract structured information.",
    keywords=["urgent", "complaint", "praise", "bug", "feature"],
)

# Schema automatically added to prompt - no manual JSON formatting needed!
extraction_prompt = prompts.create_prompt(
    system_message=system_message,
    examples=None,  # We'll add examples in the next section
    example_human_template=None,
    example_assistant_template=None,
    final_query="Feedback: {text}",
    schema=schema,  # Automatically shows allowed values and types
    descriptions=descriptions,  # Provides field context to LLM
)

FeedbackModel = pydantic_models.make_dynamic_example_model(
    schema=schema,
    descriptions=descriptions,
    model_name="CustomerFeedbackAnalysis",
)

Improve Accuracy with Examples

Add few-shot examples to guide the model:

# Training examples for better extraction - JSON format must match schema
few_shot_examples = [
    {
        "text": "System is down, can't access anything!",
        "sentiment": "negative",
        "urgency": 5,
        "department": "IT",
        "action_required": "yes",
    },
    {
        "text": "Love the new interface design",
        "sentiment": "positive",
        "urgency": 1,
        "department": "Product",
        "action_required": "yes",
    },
]

extraction_prompt = prompts.create_prompt(
    system_message=system_message,
    examples=few_shot_examples,
    example_human_template="Feedback: {text}",
    example_assistant_template="""{{
        "sentiment": "{sentiment}",
        "urgency": {urgency},
        "department": "{department}",
        "action_required": "{action_required}"
    }}""",
    final_query="Feedback: {text}",
    schema=schema,  # Schema formatting still included with examples
    descriptions=descriptions,
)

# Create extractor and process sample data
parser = PydanticOutputParser(pydantic_object=FeedbackModel)
extractor = extract.BatchExtractor(
    llm=feedback_llm,  # your LLM instance
    prompt=extraction_prompt,
    parser=parser,
)

# Sample customer feedback data
feedback_data = pd.DataFrame(
    {
        "text": [
            "The login system crashed and I lost all my work!",
            "Really appreciate the new dark mode feature",
            "Can we get a mobile app version soon?",
            "Billing charged me twice this month, need help",
        ]
    }
)

results = extractor.process_chunk(feedback_data, text_column="text")
print(results.to_string(index=False))

This will output something like:

                                            text sentiment  urgency department action_required
The login system crashed and I lost all my work!  negative        5         IT             yes
     Really appreciate the new dark mode feature  positive        2    Product              no
           Can we get a mobile app version soon?   neutral        3    Product             yes
  Billing charged me twice this month, need help  negative        3    Support             yes

Notebooks

The notebooks/ directory contains a combination of Jupyter and marimo notebooks for exploring Laurium. We recommend starting with the Jupyter notebooks, especially if unfamiliar with marimo.

To run one of the marimo notebooks:

Clone the Laurium repo
Sync dependencies with uv (uv sync)

Run the notebook of your choosing with the command

uv run marimo run notebooks/[name of notebook].py

(For more advanced users) To get a deeper look at the code, you can open the notebook in "edit" mode, which allows you to view the code being run alongside the notebook itself.
```
uv run marimo edit notebooks/[name of notebook].py
```

For more information about using marimo, check out their documentation.

Prompt engineering notebook

The prompt engineering notebook provides a walkthrough of using Laurium's decoder-only methods to extract custom information from the Rotten Tomatoes dataset of movie reviews. Starting from configuring the LLM, the notebook steps through writing a prompt, defining output fields and evaluating the results on this labelled dataset.

Fine-tuning notebook

The fine-tuning notebook illustrates a couple of different ways of fine-tuning transformer models using Laurium's encoder-only methods. This notebook is best run in marimo's edit mode, allowing the user to view both the code and the output at the same time.

Supported Models

Ollama (Local)

Use any model available in Ollama.

AWS Bedrock (Cloud)

Supported Bedrock models:

claude-3-sonnet - Best for complex extraction tasks
claude-3-haiku - Faster, cost-effective option

Modules Reference

Module	Sub-module	Description
decoder_models	`llm`	Create and manage LLM instances from Ollama and AWS Bedrock
	`prompts`	Create and manage prompt templates with optional few-shot examples
	`extract`	Efficient batch processing of text using LLMs
	`pydantic_models`	Dynamic Pydantic models for structured LLM output
components	`extract_context`	Extract keyword mentions with configurable context windows
	`evaluate`	Compute evaluation metrics for model predictions
	`load`	Load and chunk data from various sources (CSV, SQL, etc.)
encoder_models	`nli`	Natural Language Inference models for text analysis
	`fine_tune`	Fine-tune transformer models for custom tasks

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Why 'Laurium'

Laurium was an ancient Greek mine, famed for its rich silver veins that fueled the rise of Athens as a Mediterranean powerhouse.

Just as Laurium’s silver generated immense wealth for ancient Athens, so modern text mining (based on LLMs) holds the potential to unlock huge untapped value from unstructured information.

Contact Us

Please reach out to the AI for Linked Data team at AI_for_linked_data@justice.gov.uk or bold@justice.gov.uk.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.0

May 6, 2026

0.5.0

Apr 9, 2026

0.4.3

Jan 27, 2026

0.4.2

Dec 19, 2025

This version

0.4.1

Dec 10, 2025

0.4.0

Nov 17, 2025

0.3.0

Sep 18, 2025

0.2.0

Aug 6, 2025

0.1.0

Aug 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

laurium-0.4.1.tar.gz (199.9 kB view details)

Uploaded Dec 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

laurium-0.4.1-py3-none-any.whl (28.1 kB view details)

Uploaded Dec 10, 2025 Python 3

File details

Details for the file laurium-0.4.1.tar.gz.

File metadata

Download URL: laurium-0.4.1.tar.gz
Upload date: Dec 10, 2025
Size: 199.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for laurium-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`1e82cb44044aa06f038f36c9a982287450425455b2ab6a02c82e0270443a8c3d`
MD5	`14e8c3485af7e9edc4b93b666c531f02`
BLAKE2b-256	`5809d11c0a16cb631303f017188bdfaefac00e648a8860595542d1ffdfa62e3b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for laurium-0.4.1.tar.gz:

Publisher: publish.yml on moj-analytical-services/laurium

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: laurium-0.4.1.tar.gz
- Subject digest: 1e82cb44044aa06f038f36c9a982287450425455b2ab6a02c82e0270443a8c3d
- Sigstore transparency entry: 756970284
- Sigstore integration time: Dec 10, 2025
Source repository:
- Permalink: moj-analytical-services/laurium@92a0ba3cfa671e1a538a16391cb27c95a9366852
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/moj-analytical-services
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@92a0ba3cfa671e1a538a16391cb27c95a9366852
- Trigger Event: push

File details

Details for the file laurium-0.4.1-py3-none-any.whl.

File metadata

Download URL: laurium-0.4.1-py3-none-any.whl
Upload date: Dec 10, 2025
Size: 28.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for laurium-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3bd13c86c16b0492b245546b72b584d4526ce60f2a91cee3bdaa939e05b2b0e9`
MD5	`4124fbbd05169a37022d77cdbbd774d2`
BLAKE2b-256	`716e02e7e911aa7db08bb96ca05f014d1a8d4279bfb44c56f8c337e4ab31b312`

See more details on using hashes here.

Provenance

The following attestation bundles were made for laurium-0.4.1-py3-none-any.whl:

Publisher: publish.yml on moj-analytical-services/laurium

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: laurium-0.4.1-py3-none-any.whl
- Subject digest: 3bd13c86c16b0492b245546b72b584d4526ce60f2a91cee3bdaa939e05b2b0e9
- Sigstore transparency entry: 756970288
- Sigstore integration time: Dec 10, 2025
Source repository:
- Permalink: moj-analytical-services/laurium@92a0ba3cfa671e1a538a16391cb27c95a9366852
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/moj-analytical-services
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@92a0ba3cfa671e1a538a16391cb27c95a9366852
- Trigger Event: push

laurium 0.4.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Laurium

Install Laurium

From GitHub

From PyPI

LLM Provider Setup

Local Models with Ollama

AWS Bedrock Models

Basic Usage

Text Classification Pipeline

Using Ollama (Local)

Using AWS Bedrock

Multi-Field Extraction

Define Complex Output Schemas

Improve Accuracy with Examples

Notebooks

Prompt engineering notebook

Fine-tuning notebook

Supported Models

Ollama (Local)

AWS Bedrock (Cloud)

Modules Reference

Contributing

Why 'Laurium'

Contact Us

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance