Skip to main content

Information extractor using spaCy + SpanBERT

Project description

information_extractor

Overview

CI
License: MIT
Python 3.8+
Code style: black

information_extractor is a Python package that combines spaCy, coreferee, and SpanBERT to extract structured relationships between entities in natural language text. It's purpose-built for anyone who wants to bridge NER, coreference resolution, and relation extraction into one streamlined pipeline.

Features

✅ Entity Linking & Coreference Resolution

  • Uses spaCy with coreferee to resolve pronouns and link entity mentions.
  • Flexible support for multiple entity types: PERSON, ORG, LOC, DATE, etc.

✅ Relation Extraction with SpanBERT

  • Uses fine-tuned SpanBERT model trained on TACRED.
  • Handles subject/object marking and context-aware classification.
  • Confidence scoring and de-duplication of extracted relations.
  • GPU acceleration supported out of the box.

✅ CLI Interface

ie --text "Barack Obama was born in Hawaii." [--deps]
  • --deps: Downloads and installs required pretrained models if not present.

Installation

pip install information_extractor

Optional: Download model dependencies

Run the following once to download SpanBERT, spaCy model, coreferee model:

ie --deps

Alternatively, you can import and run the dependency script directly:

from information_extractor.dependency import setup_dependencies
setup_dependencies()

Example Usage

from information_extractor.pipeline import RelationExtractor

text = "Sundar Pichai is the CEO of Google. He lives in California."

extractor = RelationExtractor()
results = extractor.extract(text)

for relation in results:
    print(relation)

Sample Output

[
  {
    "subject": "Sundar Pichai",
    "object": "Google",
    "relation": "per:employee_of",
    "confidence": 0.92
  },
  ...
]

Project Structure

information_extractor/
├── assets/
│   └── pretrained_spanbert/
├── dependency.py         # Downloads all model dependencies
├── pipeline.py           # Core logic for NLP + SpanBERT
├── main.py               # CLI entrypoint

Pretrained Assets

Models are downloaded from hosted GitHub release assets:

  • SpanBERT weights & config
  • en_core_web_md spaCy model
  • coreferee_model_en for coreference resolution
  • torch wheel for reproducibility

Citation

This project builds on the work of Facebook Research. If you use SpanBERT, please cite:

@article{joshi2019spanbert,
  title={{SpanBERT}: Improving Pre-training by Representing and Predicting Spans},
  author={Mandar Joshi and Danqi Chen and Yinhan Liu and Daniel S. Weld and Luke Zettlemoyer and Omer Levy},
  journal={arXiv preprint arXiv:1907.10529},
  year={2019}
}

License

MIT. See LICENSE for full terms.
Note: This project redistributes pretrained model weights for convenience under fair use for research.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

information_extractor-0.2.0.tar.gz (32.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

information_extractor-0.2.0-py3-none-any.whl (33.8 kB view details)

Uploaded Python 3

File details

Details for the file information_extractor-0.2.0.tar.gz.

File metadata

  • Download URL: information_extractor-0.2.0.tar.gz
  • Upload date:
  • Size: 32.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for information_extractor-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a4703acc60ee4a730fecfece24654f9a41cd509f60056807a1eb9de57d67557a
MD5 62b8e0bef02d51100f5ba6dd76bb9c1e
BLAKE2b-256 1a834d3e7bf6fe265b6b277de172fa74ae5471863a39d9e3d684d4d04be6a91a

See more details on using hashes here.

Provenance

The following attestation bundles were made for information_extractor-0.2.0.tar.gz:

Publisher: publish.yml on rajatasusual/information_extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file information_extractor-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for information_extractor-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e96ce3573753cfb4673f3fa837f979e6a4afbe78b0815e25d4553cf6e55cd476
MD5 9cd0c2b8db59568e72feb29a9af95dbc
BLAKE2b-256 ede1a02944f96986c877fd1cf8619034851505bd4f413bbd7fd84d585fb94194

See more details on using hashes here.

Provenance

The following attestation bundles were made for information_extractor-0.2.0-py3-none-any.whl:

Publisher: publish.yml on rajatasusual/information_extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page