Skip to main content

A powerful Named Entity Recognition and Resolution library with semantic matching

Project description

Advanced Text Processing

PyPI version Python Versions License: MIT Code style: black

A powerful Named Entity Recognition (NER) and Entity Resolution library designed for complex text processing tasks. It combines state-of-the-art NLP models (spaCy, Transformers) with robust knowledge bases (Wikidata, WordNet) to provide accurate entity extraction, canonicalization, and semantic matching.

🚀 Features

  • Advanced Entity Resolution:
    • Mode A (Sequential): Fast, early-stopping pipeline for high-confidence matches.
    • Mode B (Parallel): Aggregates multiple signals (fuzzy, semantic, contextual) for difficult cases.
  • Semantic Matching: Maps inputs to canonical schemas using sentence embeddings (SentenceTransformers).
  • Alias Retrieval: Automatically fetches aliases from Wikidata and synonyms from WordNet.
  • Canonicalization:
    • Entities (e.g., "Apple" -> "Apple Inc.")
    • Relationships (e.g., "relies on" -> "depends_on")
    • Properties (e.g., "birth date" -> "date_of_birth")
  • Flexible Candidate Generation: Supports exact lookup, fulltext blocking, and ANN search (FAISS/hnswlib).

📦 Installation

pip install advanced-text-processing

After installation, download the required models:

# Download spaCy model
python -m spacy download en_core_web_lg

# Download NLTK data
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"

See Installation Guide for detailed instructions.

⚡ Quick Start

Named Entity Recognition

from ner_lib import recognize_entities

text = "Apple Inc. was founded by Steve Jobs in Cupertino."
result = recognize_entities(text)

for entity in result['entities']:
    print(f"{entity['text']} ({entity['type']})")
# Output:
# Apple Inc. (ORG)
# Steve Jobs (PERSON)
# Cupertino (GPE)

Entity Canonicalization

from ner_lib import canonicalize_entity

# Canonicalize an entity mention
result = canonicalize_entity("apple inc", mode="progressive")
print(f"Canonical: {result['canonical_name']}")
print(f"Aliases: {result['aliases']}")
# Output: 
# Canonical: Apple Inc.
# Aliases: ['Apple', 'AAPL', 'Apple Computer', ...]

Relationship Canonicalization

from ner_lib import Config, canonicalize_relationship

# Configure semantic matching
config = Config()
config.semantic_matching.enabled = True
config.semantic_matching.canonical_relationships = ["depends_on", "created_by"]

# Canonicalize a relationship phrase
result = canonicalize_relationship("relies heavily on", config=config)
print(f"Canonical: {result['canonical_name']}")
# Output: Canonical: depends_on

📚 Documentation

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details on how to get started.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgements

This library stands on the shoulders of giants. We gratefully acknowledge the following open-source projects:

See ACKNOWLEDGEMENTS.md for the full list of dependencies and credits.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

advanced_text_processing-0.2.0.tar.gz (63.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

advanced_text_processing-0.2.0-py3-none-any.whl (56.5 kB view details)

Uploaded Python 3

File details

Details for the file advanced_text_processing-0.2.0.tar.gz.

File metadata

  • Download URL: advanced_text_processing-0.2.0.tar.gz
  • Upload date:
  • Size: 63.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for advanced_text_processing-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3bf3c655b48e167d4331bd6b7e73ee63409cf056ec86fdcfe7342f9491830325
MD5 ce892dac337fc3bc6de1f029c53d78e5
BLAKE2b-256 baa43ff80c778892f02813f9da2353c31dcb0084f2aa35c93d6f16f172c3709e

See more details on using hashes here.

File details

Details for the file advanced_text_processing-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for advanced_text_processing-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a4eca98ea7bebb29f94c07283895803895927a244bb3530dabde17818d5cf855
MD5 d780cd84ca3318f4362ec14cc4b0be6b
BLAKE2b-256 20867e199ab1b9b266f1a2ca9536d6cb17eef16e2f491b099491e80cec4ad1cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page