A powerful Named Entity Recognition and Resolution library with semantic matching
Project description
Advanced Text Processing
A powerful Named Entity Recognition (NER) and Entity Resolution library designed for complex text processing tasks. It combines state-of-the-art NLP models (spaCy, Transformers) with robust knowledge bases (Wikidata, WordNet) to provide accurate entity extraction, canonicalization, and semantic matching.
🚀 Features
- Advanced Entity Resolution:
- Mode A (Sequential): Fast, early-stopping pipeline for high-confidence matches.
- Mode B (Parallel): Aggregates multiple signals (fuzzy, semantic, contextual) for difficult cases.
- Semantic Matching: Maps inputs to canonical schemas using sentence embeddings (SentenceTransformers).
- Alias Retrieval: Automatically fetches aliases from Wikidata and synonyms from WordNet.
- Canonicalization:
- Entities (e.g., "Apple" -> "Apple Inc.")
- Relationships (e.g., "relies on" -> "depends_on")
- Properties (e.g., "birth date" -> "date_of_birth")
- Flexible Candidate Generation: Supports exact lookup, fulltext blocking, and ANN search (FAISS/hnswlib).
📦 Installation
pip install advanced-text-processing
After installation, download the required models:
# Download spaCy model
python -m spacy download en_core_web_lg
# Download NLTK data
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"
See Installation Guide for detailed instructions.
⚡ Quick Start
Named Entity Recognition
from ner_lib import recognize_entities
text = "Apple Inc. was founded by Steve Jobs in Cupertino."
result = recognize_entities(text)
for entity in result['entities']:
print(f"{entity['text']} ({entity['type']})")
# Output:
# Apple Inc. (ORG)
# Steve Jobs (PERSON)
# Cupertino (GPE)
Entity Canonicalization
from ner_lib import canonicalize_entity
# Canonicalize an entity mention
result = canonicalize_entity("apple inc", mode="progressive")
print(f"Canonical: {result['canonical_name']}")
print(f"Aliases: {result['aliases']}")
# Output:
# Canonical: Apple Inc.
# Aliases: ['Apple', 'AAPL', 'Apple Computer', ...]
Relationship Canonicalization
from ner_lib import Config, canonicalize_relationship
# Configure semantic matching
config = Config()
config.semantic_matching.enabled = True
config.semantic_matching.canonical_relationships = ["depends_on", "created_by"]
# Canonicalize a relationship phrase
result = canonicalize_relationship("relies heavily on", config=config)
print(f"Canonical: {result['canonical_name']}")
# Output: Canonical: depends_on
📚 Documentation
🤝 Contributing
We welcome contributions! Please see our Contributing Guide for details on how to get started.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgements
This library stands on the shoulders of giants. We gratefully acknowledge the following open-source projects:
- spaCy: For industrial-strength NLP.
- Sentence Transformers: For state-of-the-art text embeddings.
- Wikidata: For the comprehensive knowledge base.
- NLTK & WordNet: For lexical database support.
See ACKNOWLEDGEMENTS.md for the full list of dependencies and credits.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file advanced_text_processing-0.2.0.tar.gz.
File metadata
- Download URL: advanced_text_processing-0.2.0.tar.gz
- Upload date:
- Size: 63.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3bf3c655b48e167d4331bd6b7e73ee63409cf056ec86fdcfe7342f9491830325
|
|
| MD5 |
ce892dac337fc3bc6de1f029c53d78e5
|
|
| BLAKE2b-256 |
baa43ff80c778892f02813f9da2353c31dcb0084f2aa35c93d6f16f172c3709e
|
File details
Details for the file advanced_text_processing-0.2.0-py3-none-any.whl.
File metadata
- Download URL: advanced_text_processing-0.2.0-py3-none-any.whl
- Upload date:
- Size: 56.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4eca98ea7bebb29f94c07283895803895927a244bb3530dabde17818d5cf855
|
|
| MD5 |
d780cd84ca3318f4362ec14cc4b0be6b
|
|
| BLAKE2b-256 |
20867e199ab1b9b266f1a2ca9536d6cb17eef16e2f491b099491e80cec4ad1cf
|