Skip to main content

Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search

Project description

Corp Extractor

Extract structured subject-predicate-object statements from unstructured text using the T5-Gemma 2 model.

PyPI version Python 3.10+ License: MIT

Features

  • Structured Extraction: Converts unstructured text into subject-predicate-object triples
  • Entity Type Recognition: Identifies 12 entity types (ORG, PERSON, GPE, LOC, PRODUCT, EVENT, etc.)
  • High-Quality Output: Uses Diverse Beam Search to generate multiple candidates
  • Smart Retry Logic: Automatically retries extraction if output quality is below threshold
  • Multiple Output Formats: Get results as Pydantic models, JSON, XML, or dictionaries

Installation

pip install corp-extractor

Note: Requires PyTorch. For GPU support, install PyTorch with CUDA first:

pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install corp-extractor

Quick Start

from statement_extractor import extract_statements

# Extract statements from text
result = extract_statements("""
    Apple Inc. announced the iPhone 15 at their September event.
    Tim Cook presented the new features to customers worldwide.
""")

# Iterate over extracted statements
for stmt in result:
    print(f"{stmt.subject.text} ({stmt.subject.type}) "
          f"--[{stmt.predicate}]--> "
          f"{stmt.object.text} ({stmt.object.type})")

Output:

Apple Inc. (ORG) --[announced]--> iPhone 15 (PRODUCT)
Tim Cook (PERSON) --[presented]--> new features (UNKNOWN)

Output Formats

from statement_extractor import (
    extract_statements,
    extract_statements_as_json,
    extract_statements_as_xml,
    extract_statements_as_dict,
)

text = "Microsoft acquired GitHub in 2018."

# Pydantic models (default)
result = extract_statements(text)
for stmt in result.statements:
    print(stmt.subject, stmt.predicate, stmt.object)

# JSON string
json_output = extract_statements_as_json(text)
print(json_output)

# Raw XML (model's native format)
xml_output = extract_statements_as_xml(text)
print(xml_output)

# Python dictionary
dict_output = extract_statements_as_dict(text)
print(dict_output)

Advanced Usage

Custom Extraction Options

from statement_extractor import extract_statements, ExtractionOptions

options = ExtractionOptions(
    num_beams=8,              # More beams = more diverse candidates
    diversity_penalty=1.5,    # Higher = more diversity between beams
    max_new_tokens=4096,      # Max tokens to generate
    min_statement_ratio=0.5,  # Min statements per sentence
    max_attempts=5,           # Retry attempts for under-extraction
    deduplicate=True,         # Remove duplicate statements
)

result = extract_statements("Your text here...", options=options)

Using the Extractor Class

For better performance when processing multiple texts:

from statement_extractor import StatementExtractor

# Create extractor once
extractor = StatementExtractor(
    model_id="Corp-o-Rate-Community/statement-extractor",
    device="cuda",  # or "cpu"
)

# Process multiple texts
texts = ["Text 1...", "Text 2...", "Text 3..."]
for text in texts:
    result = extractor.extract(text)
    print(f"Found {len(result)} statements")

Pydantic Models

The library provides fully-typed Pydantic models:

from statement_extractor import Statement, Entity, EntityType, ExtractionResult

# Access statement properties
stmt: Statement = result.statements[0]
print(stmt.subject.text)      # "Apple Inc."
print(stmt.subject.type)      # EntityType.ORG
print(stmt.predicate)         # "announced"
print(stmt.object.text)       # "iPhone 15"
print(stmt.source_text)       # Original sentence (if available)

# Convert to simple tuples
triples = result.to_triples()
# [("Apple Inc.", "announced", "iPhone 15"), ...]

Entity Types

Type Description Example
ORG Organizations Apple Inc., United Nations
PERSON People Tim Cook, Elon Musk
GPE Geopolitical entities USA, California, Paris
LOC Non-GPE locations Mount Everest, Pacific Ocean
PRODUCT Products iPhone, Model S
EVENT Events World Cup, CES 2024
WORK_OF_ART Creative works Mona Lisa, Game of Thrones
LAW Legal documents GDPR, Clean Air Act
DATE Dates 2024, January 15
MONEY Monetary values $50 million, €100
PERCENT Percentages 25%, 0.5%
QUANTITY Quantities 500 employees, 1.5 tons
UNKNOWN Unrecognized (fallback)

How It Works

This library uses the T5-Gemma 2 statement extraction model with Diverse Beam Search (Vijayakumar et al., 2016) to generate high-quality extractions:

  1. Diverse Beam Search: Generates 4+ candidate outputs using beam groups with diversity penalty
  2. Quality-Based Retry: If extraction count is below threshold, automatically retries
  3. Deduplication: Removes duplicate statements based on subject-predicate-object triples
  4. Best Selection: Selects the longest valid output (typically most complete)

Requirements

  • Python 3.10+
  • PyTorch 2.0+
  • Transformers 4.35+
  • Pydantic 2.0+
  • ~2GB VRAM (GPU) or ~4GB RAM (CPU)

Links

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corp_extractor-0.1.0.tar.gz (8.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corp_extractor-0.1.0-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file corp_extractor-0.1.0.tar.gz.

File metadata

  • Download URL: corp_extractor-0.1.0.tar.gz
  • Upload date:
  • Size: 8.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for corp_extractor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 288a29b3b0ebe2bfb05d37794d12ebd74956f3b8ffe2205bc18dc0fe964dd68a
MD5 93daf8b1486bd7584454e939bbdc3c97
BLAKE2b-256 244f6020935a150a6c3fb969d7d05dbb7a2fbfb330fa93e5fcced080c49a38f0

See more details on using hashes here.

File details

Details for the file corp_extractor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for corp_extractor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a2bd5e80e6da02cfab6b136904ea30ade65f56b0fbaf633b8e96c85c8f2af683
MD5 49ea25135c7e3d0ef1adc66d7a2e38a4
BLAKE2b-256 345ddf47587ddf9364f9b307bc20571ea688f29018d0a817fae308e0cf76945e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page