Skip to main content

A powerful Named Entity Recognition and Resolution library with semantic matching

Project description

Advanced Text Processing

PyPI version Python Versions License: MIT Code style: black

A powerful Named Entity Recognition (NER) and Entity Resolution library designed for complex text processing tasks. It combines state-of-the-art NLP models (spaCy, Transformers) with robust knowledge bases (Wikidata, WordNet) to provide accurate entity extraction, canonicalization, and semantic matching.

🚀 Features

  • Advanced Entity Resolution:
    • Mode A (Sequential): Fast, early-stopping pipeline for high-confidence matches.
    • Mode B (Parallel): Aggregates multiple signals (fuzzy, semantic, contextual) for difficult cases.
  • Semantic Matching: Maps inputs to canonical schemas using sentence embeddings (SentenceTransformers).
  • Alias Retrieval: Automatically fetches aliases from Wikidata and synonyms from WordNet.
  • Canonicalization:
    • Entities (e.g., "Apple" -> "Apple Inc.")
    • Relationships (e.g., "relies on" -> "depends_on")
    • Properties (e.g., "birth date" -> "date_of_birth")
  • Flexible Candidate Generation: Supports exact lookup, fulltext blocking, and ANN search (FAISS/hnswlib).

📦 Installation

pip install advanced-text-processing

After installation, download the required models:

# Download spaCy model
python -m spacy download en_core_web_lg

# Download NLTK data
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"

See Installation Guide for detailed instructions.

⚡ Quick Start

Named Entity Recognition

from ner_lib import recognize_entities

text = "Apple Inc. was founded by Steve Jobs in Cupertino."
result = recognize_entities(text)

for entity in result['entities']:
    print(f"{entity['text']} ({entity['type']})")
# Output:
# Apple Inc. (ORG)
# Steve Jobs (PERSON)
# Cupertino (GPE)

Entity Canonicalization

from ner_lib import canonicalize_entity

# Canonicalize an entity mention
result = canonicalize_entity("apple inc", mode="progressive")
print(f"Canonical: {result['canonical_name']}")
print(f"Aliases: {result['aliases']}")
# Output: 
# Canonical: Apple Inc.
# Aliases: ['Apple', 'AAPL', 'Apple Computer', ...]

Relationship Canonicalization

from ner_lib import Config, canonicalize_relationship

# Configure semantic matching
config = Config()
config.semantic_matching.enabled = True
config.semantic_matching.canonical_relationships = ["depends_on", "created_by"]

# Canonicalize a relationship phrase
result = canonicalize_relationship("relies heavily on", config=config)
print(f"Canonical: {result['canonical_name']}")
# Output: Canonical: depends_on

📚 Documentation

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details on how to get started.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgements

This library stands on the shoulders of giants. We gratefully acknowledge the following open-source projects:

See ACKNOWLEDGEMENTS.md for the full list of dependencies and credits.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

advanced_text_processing-0.2.1.tar.gz (63.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

advanced_text_processing-0.2.1-py3-none-any.whl (56.5 kB view details)

Uploaded Python 3

File details

Details for the file advanced_text_processing-0.2.1.tar.gz.

File metadata

  • Download URL: advanced_text_processing-0.2.1.tar.gz
  • Upload date:
  • Size: 63.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for advanced_text_processing-0.2.1.tar.gz
Algorithm Hash digest
SHA256 458a91c11e63ab4e7d671a3d4deb70512a005ab0b370ae7353c773ed3faf4b16
MD5 5ed3ebc3adca46451f831f15b9916520
BLAKE2b-256 60f8ae3cd05f8ce741135c3f33211f0385ae9cb566746c9d28806febcef16efb

See more details on using hashes here.

File details

Details for the file advanced_text_processing-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for advanced_text_processing-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b96c15587f73948b860abeb643660594fafbd4c081f3e429d5a887bd4bff0565
MD5 3dfb6b40b1e6893e2658a0e3703dde8e
BLAKE2b-256 8410bdc55c0a80bf7e2a47f92a579994019c54619793ae6c0a7279965f9bfee3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page