Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search

These details have not been verified by PyPI

Project links

Project description

Corp Extractor

Extract structured subject-predicate-object statements from unstructured text using the T5-Gemma 2 model.

Features

Structured Extraction: Converts unstructured text into subject-predicate-object triples
Entity Type Recognition: Identifies 12 entity types (ORG, PERSON, GPE, LOC, PRODUCT, EVENT, etc.)
High-Quality Output: Uses Diverse Beam Search to generate multiple candidates
Smart Retry Logic: Automatically retries extraction if output quality is below threshold
Multiple Output Formats: Get results as Pydantic models, JSON, XML, or dictionaries

Installation

pip install corp-extractor

Note: Requires PyTorch. For GPU support, install PyTorch with CUDA first:

pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install corp-extractor

Quick Start

from statement_extractor import extract_statements

# Extract statements from text
result = extract_statements("""
    Apple Inc. announced the iPhone 15 at their September event.
    Tim Cook presented the new features to customers worldwide.
""")

# Iterate over extracted statements
for stmt in result:
    print(f"{stmt.subject.text} ({stmt.subject.type}) "
          f"--[{stmt.predicate}]--> "
          f"{stmt.object.text} ({stmt.object.type})")

Output:

Apple Inc. (ORG) --[announced]--> iPhone 15 (PRODUCT)
Tim Cook (PERSON) --[presented]--> new features (UNKNOWN)

Output Formats

from statement_extractor import (
    extract_statements,
    extract_statements_as_json,
    extract_statements_as_xml,
    extract_statements_as_dict,
)

text = "Microsoft acquired GitHub in 2018."

# Pydantic models (default)
result = extract_statements(text)
for stmt in result.statements:
    print(stmt.subject, stmt.predicate, stmt.object)

# JSON string
json_output = extract_statements_as_json(text)
print(json_output)

# Raw XML (model's native format)
xml_output = extract_statements_as_xml(text)
print(xml_output)

# Python dictionary
dict_output = extract_statements_as_dict(text)
print(dict_output)

Advanced Usage

Custom Extraction Options

from statement_extractor import extract_statements, ExtractionOptions

options = ExtractionOptions(
    num_beams=8,              # More beams = more diverse candidates
    diversity_penalty=1.5,    # Higher = more diversity between beams
    max_new_tokens=4096,      # Max tokens to generate
    min_statement_ratio=0.5,  # Min statements per sentence
    max_attempts=5,           # Retry attempts for under-extraction
    deduplicate=True,         # Remove duplicate statements
)

result = extract_statements("Your text here...", options=options)

Using the Extractor Class

For better performance when processing multiple texts:

from statement_extractor import StatementExtractor

# Create extractor once
extractor = StatementExtractor(
    model_id="Corp-o-Rate-Community/statement-extractor",
    device="cuda",  # or "cpu"
)

# Process multiple texts
texts = ["Text 1...", "Text 2...", "Text 3..."]
for text in texts:
    result = extractor.extract(text)
    print(f"Found {len(result)} statements")

Pydantic Models

The library provides fully-typed Pydantic models:

from statement_extractor import Statement, Entity, EntityType, ExtractionResult

# Access statement properties
stmt: Statement = result.statements[0]
print(stmt.subject.text)      # "Apple Inc."
print(stmt.subject.type)      # EntityType.ORG
print(stmt.predicate)         # "announced"
print(stmt.object.text)       # "iPhone 15"
print(stmt.source_text)       # Original sentence (if available)

# Convert to simple tuples
triples = result.to_triples()
# [("Apple Inc.", "announced", "iPhone 15"), ...]

Entity Types

Type	Description	Example
`ORG`	Organizations	Apple Inc., United Nations
`PERSON`	People	Tim Cook, Elon Musk
`GPE`	Geopolitical entities	USA, California, Paris
`LOC`	Non-GPE locations	Mount Everest, Pacific Ocean
`PRODUCT`	Products	iPhone, Model S
`EVENT`	Events	World Cup, CES 2024
`WORK_OF_ART`	Creative works	Mona Lisa, Game of Thrones
`LAW`	Legal documents	GDPR, Clean Air Act
`DATE`	Dates	2024, January 15
`MONEY`	Monetary values	$50 million, €100
`PERCENT`	Percentages	25%, 0.5%
`QUANTITY`	Quantities	500 employees, 1.5 tons
`UNKNOWN`	Unrecognized	(fallback)

How It Works

This library uses the T5-Gemma 2 statement extraction model with Diverse Beam Search (Vijayakumar et al., 2016) to generate high-quality extractions:

Diverse Beam Search: Generates 4+ candidate outputs using beam groups with diversity penalty
Quality-Based Retry: If extraction count is below threshold, automatically retries
Deduplication: Removes duplicate statements based on subject-predicate-object triples
Best Selection: Selects the longest valid output (typically most complete)

Requirements

Python 3.10+
PyTorch 2.0+
Transformers 4.35+
Pydantic 2.0+
~2GB VRAM (GPU) or ~4GB RAM (CPU)

License

MIT License - see LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.4

Feb 2, 2026

0.9.3

Jan 26, 2026

0.9.0

Jan 22, 2026

0.8.0

Jan 21, 2026

0.6.0

Jan 20, 2026

0.5.0

Jan 20, 2026

0.4.1

Jan 16, 2026

0.4.0

Jan 16, 2026

0.3.0

Jan 15, 2026

0.2.11

Jan 15, 2026

0.2.10

Jan 15, 2026

0.2.9

Jan 15, 2026

0.2.8

Jan 15, 2026

0.2.7

Jan 15, 2026

0.2.6

Jan 15, 2026

0.2.5

Jan 15, 2026

0.2.4

Jan 14, 2026

0.2.3

Jan 14, 2026

0.2.2

Jan 14, 2026

0.2.1

Jan 14, 2026

0.2.0

Jan 14, 2026

This version

0.1.0

Jan 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corp_extractor-0.1.0.tar.gz (8.6 kB view details)

Uploaded Jan 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

corp_extractor-0.1.0-py3-none-any.whl (9.6 kB view details)

Uploaded Jan 14, 2026 Python 3

File details

Details for the file corp_extractor-0.1.0.tar.gz.

File metadata

Download URL: corp_extractor-0.1.0.tar.gz
Upload date: Jan 14, 2026
Size: 8.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.14

File hashes

Hashes for corp_extractor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`288a29b3b0ebe2bfb05d37794d12ebd74956f3b8ffe2205bc18dc0fe964dd68a`
MD5	`93daf8b1486bd7584454e939bbdc3c97`
BLAKE2b-256	`244f6020935a150a6c3fb969d7d05dbb7a2fbfb330fa93e5fcced080c49a38f0`

See more details on using hashes here.

File details

Details for the file corp_extractor-0.1.0-py3-none-any.whl.

File metadata

Download URL: corp_extractor-0.1.0-py3-none-any.whl
Upload date: Jan 14, 2026
Size: 9.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.14

File hashes

Hashes for corp_extractor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a2bd5e80e6da02cfab6b136904ea30ade65f56b0fbaf633b8e96c85c8f2af683`
MD5	`49ea25135c7e3d0ef1adc66d7a2e38a4`
BLAKE2b-256	`345ddf47587ddf9364f9b307bc20571ea688f29018d0a817fae308e0cf76945e`

See more details on using hashes here.

corp-extractor 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Corp Extractor

Features

Installation

Quick Start

Output Formats

Advanced Usage

Custom Extraction Options

Using the Extractor Class

Pydantic Models

Entity Types

How It Works

Requirements

Links

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes