Context-aware English part-of-speech tagger with clause and connective detection
Project description
english-pos
Context-aware English part-of-speech tagger with clause and connective detection.
english-pos is a lightweight Python library for tagging English text with Penn Treebank part-of-speech tags. It combines a state-of-the-art neural backend (spaCy) with a hand-crafted context-rule layer that corrects common mis-tags in auxiliary chains, adjective/verb ambiguities, passives, and more. It also detects clause boundaries and logical connectives.
Features
- Dual backend – uses spaCy (
en_core_web_sm) when installed for highest accuracy; falls back to NLTK's averaged perceptron tagger automatically. - Context-rule layer – 19+ heuristic patterns correct auxiliary chains, passive voice, predicate adjectives, comparative/superlative forms, and more.
- Clause detection –
find_clauses()segments a sentence into main, subordinate, and relative clauses with semantic subtypes (causal, concessive, temporal, conditional, nominal). - Connective detection –
find_connectives()identifies coordinating, subordinating, and relative connectives with their positions. - Custom overrides – register per-word tag overrides (fixed string or callable) via
register_word_tag(). - Batch processing –
analyze_batch()processes multiple sentences efficiently usingnlp.pipe()when spaCy is available. - Zero configuration – NLTK data is downloaded automatically on first use.
Installation
Minimal (NLTK backend only)
pip install english-pos
With spaCy backend (recommended – faster and more accurate)
pip install "english-pos[spacy]"
Note: The spaCy extra installs
en_core_web_smautomatically via the direct wheel URL.
If you manage spaCy models separately, runpip install english-posand then
python -m spacy download en_core_web_sm.
Quick Start
import english
# Tag a single sentence
tagged = english.analyze_sentence("The fast robot jumped over 2 walls!")
print(tagged)
# [('The', 'DT'), ('fast', 'JJ'), ('robot', 'NN'), ('jumped', 'VBD'),
# ('over', 'IN'), ('2', 'CD'), ('walls', 'NNS'), ('!', '.')]
# Tag multiple sentences efficiently
results = english.analyze_batch([
"She has been running every morning.",
"The letter was written by the president.",
])
for tokens in results:
print(tokens)
# Detect clauses
tagged = english.analyze_sentence("Although she was tired, she kept working.")
for clause in english.find_clauses(tagged):
words = " ".join(w for w, _ in clause["tokens"])
print(f"[{clause['type']}/{clause['subtype'] or '-'}] {words}")
# [subordinate/concessive] Although she was tired
# [main/-] she kept working
# Detect connectives
for conn in english.find_connectives(tagged):
print(conn["type"], conn["subtype"], repr(conn["word"]))
# subordinating concessive 'Although'
API Reference
analyze_sentence(text: str) → list[tuple[str, str]]
Tag a single English sentence.
| Parameter | Type | Description |
|---|---|---|
text |
str |
Raw English text (any punctuation) |
Returns a list of (word, tag) tuples using Penn Treebank tags.
english.analyze_sentence("Dogs run faster than cats.")
# [('Dogs', 'NNS'), ('run', 'VBP'), ('faster', 'RBR'), ('than', 'IN'), ('cats', 'NNS'), ('.', '.')]
analyze_batch(texts: list[str]) → list[list[tuple[str, str]]]
Tag multiple sentences in one call. When spaCy is available this uses
nlp.pipe() for significantly better throughput.
results = english.analyze_batch(["The cat sat.", "Dogs run fast."])
find_clauses(tagged: list[tuple[str, str]]) → list[dict]
Segment a POS-tagged sentence into logical clauses.
Each returned dict contains:
| Key | Type | Values / Notes |
|---|---|---|
type |
str |
'main', 'subordinate', 'relative' |
subtype |
str |
'causal', 'concessive', 'temporal', 'conditional', 'nominal', or '' |
connective |
str |
Opening conjunction / relative pronoun, or '' for root main |
tokens |
list |
list[tuple[str, str]] – tagged tokens in this clause |
tagged = english.analyze_sentence(
"He stayed home because it was raining."
)
for c in english.find_clauses(tagged):
print(c["type"], c["subtype"], c["connective"])
# main '' ''
# subordinate causal because
find_connectives(tagged: list[tuple[str, str]]) → list[dict]
Identify logical connectives in a tagged sentence.
Each returned dict contains:
| Key | Type | Values / Notes |
|---|---|---|
word |
str |
The connective as it appears in the text |
tag |
str |
Penn Treebank POS tag |
type |
str |
'subordinating', 'coordinating', 'relative' |
subtype |
str |
Semantic subtype for subordinating; '' for others |
position |
int |
Index of the connective in tagged |
tagged = english.analyze_sentence(
"She left early because she was tired, but he stayed."
)
for c in english.find_connectives(tagged):
print(c["type"], c["subtype"], c["word"])
# subordinating causal because
# coordinating '' but
register_word_tag(word: str, tag_or_fn) → None
Register a custom POS tag (or tag-computing callable) for a specific word. Overrides are applied after the tagger and all context rules.
# Fixed override
english.register_word_tag("Python", "NNP")
# Callable override
def fix_data(word, current_tag, context):
return "NNS" if current_tag == "NN" else current_tag
english.register_word_tag("data", fix_data)
The callable signature is (word: str, current_tag: str, context: list[tuple[str, str]]) -> str.
unregister_word_tag(word: str) → None
Remove the override for a specific word (case-insensitive; no-op if absent).
clear_word_tag_overrides() → None
Remove all registered overrides.
get_word_tag_overrides() → dict
Return a shallow copy of the current override registry.
Penn Treebank POS Tags (Reference)
| Tag | Description | Tag | Description |
|---|---|---|---|
CC |
Coordinating conjunction | PRP |
Personal pronoun |
CD |
Cardinal number | PRP$ |
Possessive pronoun |
DT |
Determiner | RB |
Adverb |
EX |
Existential there | RBR |
Adverb, comparative |
FW |
Foreign word | RBS |
Adverb, superlative |
IN |
Preposition / subordinating conj | RP |
Particle |
JJ |
Adjective | TO |
to |
JJR |
Adjective, comparative | UH |
Interjection |
JJS |
Adjective, superlative | VB |
Verb, base form |
MD |
Modal | VBD |
Verb, past tense |
NN |
Noun, singular | VBG |
Verb, gerund / present part |
NNS |
Noun, plural | VBN |
Verb, past participle |
NNP |
Proper noun, singular | VBP |
Verb, non-3rd-person sing |
NNPS |
Proper noun, plural | VBZ |
Verb, 3rd-person sing |
PDT |
Predeterminer | WDT |
Wh-determiner |
POS |
Possessive ending | WP |
Wh-pronoun |
WRB |
Wh-adverb |
Requirements
- Python ≥ 3.10
nltk≥ 3.8- (optional)
spacy≥ 3.5 +en_core_web_sm
License
Non-Commercial Use Only.
This software is licensed under the english-pos Non-Commercial License v1.0.
- ✅ Free for personal projects, academic research, and open-source work.
- 💼 Commercial use (production systems, SaaS, paid products/services) requires a separate paid license.
To obtain a commercial license, open an issue or contact the author at https://github.com/GSL20051013.
Changelog
See CHANGELOG.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gsl20051013_english-1.0.1.tar.gz.
File metadata
- Download URL: gsl20051013_english-1.0.1.tar.gz
- Upload date:
- Size: 24.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f476631f8ac8e903443f9d3e7a03850d07db707e04167e67279fe6ba0752e26
|
|
| MD5 |
bbeef54ae7c92a8ce6055f1943c281c3
|
|
| BLAKE2b-256 |
5993e304d1322bf5c2a467836f581a43d73263c4abd2a675a4ec5aedd347b8f4
|
Provenance
The following attestation bundles were made for gsl20051013_english-1.0.1.tar.gz:
Publisher:
workflow.yml on GSL20051013/English
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gsl20051013_english-1.0.1.tar.gz -
Subject digest:
4f476631f8ac8e903443f9d3e7a03850d07db707e04167e67279fe6ba0752e26 - Sigstore transparency entry: 1192074041
- Sigstore integration time:
-
Permalink:
GSL20051013/English@fe8f512509e736d22523b3cba8cfd08c2cdeead1 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/GSL20051013
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@fe8f512509e736d22523b3cba8cfd08c2cdeead1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file gsl20051013_english-1.0.1-py3-none-any.whl.
File metadata
- Download URL: gsl20051013_english-1.0.1-py3-none-any.whl
- Upload date:
- Size: 21.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7adc7b7b4458643f18e0ed2a9ac491ab3b32029b6011b9208de8371e7a5682e2
|
|
| MD5 |
b04895600d2d22f53695a7a380692b3a
|
|
| BLAKE2b-256 |
98ef2f6f2847e1bd92e52fbe2059b3ef9e0b433fcf5ad05133db1bbccced81bc
|
Provenance
The following attestation bundles were made for gsl20051013_english-1.0.1-py3-none-any.whl:
Publisher:
workflow.yml on GSL20051013/English
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gsl20051013_english-1.0.1-py3-none-any.whl -
Subject digest:
7adc7b7b4458643f18e0ed2a9ac491ab3b32029b6011b9208de8371e7a5682e2 - Sigstore transparency entry: 1192074044
- Sigstore integration time:
-
Permalink:
GSL20051013/English@fe8f512509e736d22523b3cba8cfd08c2cdeead1 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/GSL20051013
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@fe8f512509e736d22523b3cba8cfd08c2cdeead1 -
Trigger Event:
release
-
Statement type: