Context-aware English part-of-speech tagger with clause and connective detection

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

GSL2005

These details have not been verified by PyPI

Project description

english-pos

Context-aware English part-of-speech tagger with clause and connective detection.

english-pos is a lightweight Python library for tagging English text with Penn Treebank part-of-speech tags. It combines a state-of-the-art neural backend (spaCy) with a hand-crafted context-rule layer that corrects common mis-tags in auxiliary chains, adjective/verb ambiguities, passives, and more. It also detects clause boundaries and logical connectives.

Features

Dual backend – uses spaCy (en_core_web_sm) when installed for highest accuracy; falls back to NLTK's averaged perceptron tagger automatically.
Context-rule layer – 19+ heuristic patterns correct auxiliary chains, passive voice, predicate adjectives, comparative/superlative forms, and more.
Clause detection – find_clauses() segments a sentence into main, subordinate, and relative clauses with semantic subtypes (causal, concessive, temporal, conditional, nominal).
Connective detection – find_connectives() identifies coordinating, subordinating, and relative connectives with their positions.
Custom overrides – register per-word tag overrides (fixed string or callable) via register_word_tag().
Batch processing – analyze_batch() processes multiple sentences efficiently using nlp.pipe() when spaCy is available.
Zero configuration – NLTK data is downloaded automatically on first use.

Installation

Minimal (NLTK backend only)

pip install english-pos

With spaCy backend (recommended – faster and more accurate)

pip install "english-pos[spacy]"

Note: The spaCy extra installs en_core_web_sm automatically via the direct wheel URL.
If you manage spaCy models separately, run pip install english-pos and then
python -m spacy download en_core_web_sm.

Quick Start

import english

# Tag a single sentence
tagged = english.analyze_sentence("The fast robot jumped over 2 walls!")
print(tagged)
# [('The', 'DT'), ('fast', 'JJ'), ('robot', 'NN'), ('jumped', 'VBD'),
#  ('over', 'IN'), ('2', 'CD'), ('walls', 'NNS'), ('!', '.')]

# Tag multiple sentences efficiently
results = english.analyze_batch([
    "She has been running every morning.",
    "The letter was written by the president.",
])
for tokens in results:
    print(tokens)

# Detect clauses
tagged = english.analyze_sentence("Although she was tired, she kept working.")
for clause in english.find_clauses(tagged):
    words = " ".join(w for w, _ in clause["tokens"])
    print(f"[{clause['type']}/{clause['subtype'] or '-'}]  {words}")
# [subordinate/concessive]  Although she was tired
# [main/-]  she kept working

# Detect connectives
for conn in english.find_connectives(tagged):
    print(conn["type"], conn["subtype"], repr(conn["word"]))
# subordinating concessive 'Although'

API Reference

`analyze_sentence(text: str) → list[tuple[str, str]]`

Tag a single English sentence.

Parameter	Type	Description
`text`	`str`	Raw English text (any punctuation)

Returns a list of (word, tag) tuples using Penn Treebank tags.

english.analyze_sentence("Dogs run faster than cats.")
# [('Dogs', 'NNS'), ('run', 'VBP'), ('faster', 'RBR'), ('than', 'IN'), ('cats', 'NNS'), ('.', '.')]

`analyze_batch(texts: list[str]) → list[list[tuple[str, str]]]`

Tag multiple sentences in one call. When spaCy is available this uses nlp.pipe() for significantly better throughput.

results = english.analyze_batch(["The cat sat.", "Dogs run fast."])

`find_clauses(tagged: list[tuple[str, str]]) → list[dict]`

Segment a POS-tagged sentence into logical clauses.

Each returned dict contains:

Key	Type	Values / Notes
`type`	`str`	`'main'`, `'subordinate'`, `'relative'`
`subtype`	`str`	`'causal'`, `'concessive'`, `'temporal'`, `'conditional'`, `'nominal'`, or `''`
`connective`	`str`	Opening conjunction / relative pronoun, or `''` for root main
`tokens`	`list`	`list[tuple[str, str]]` – tagged tokens in this clause

tagged = english.analyze_sentence(
    "He stayed home because it was raining."
)
for c in english.find_clauses(tagged):
    print(c["type"], c["subtype"], c["connective"])
# main      ''      ''
# subordinate causal  because

`find_connectives(tagged: list[tuple[str, str]]) → list[dict]`

Identify logical connectives in a tagged sentence.

Each returned dict contains:

Key	Type	Values / Notes
`word`	`str`	The connective as it appears in the text
`tag`	`str`	Penn Treebank POS tag
`type`	`str`	`'subordinating'`, `'coordinating'`, `'relative'`
`subtype`	`str`	Semantic subtype for subordinating; `''` for others
`position`	`int`	Index of the connective in `tagged`

tagged = english.analyze_sentence(
    "She left early because she was tired, but he stayed."
)
for c in english.find_connectives(tagged):
    print(c["type"], c["subtype"], c["word"])
# subordinating causal   because
# coordinating  ''       but

`register_word_tag(word: str, tag_or_fn) → None`

Register a custom POS tag (or tag-computing callable) for a specific word. Overrides are applied after the tagger and all context rules.

# Fixed override
english.register_word_tag("Python", "NNP")

# Callable override
def fix_data(word, current_tag, context):
    return "NNS" if current_tag == "NN" else current_tag

english.register_word_tag("data", fix_data)

The callable signature is (word: str, current_tag: str, context: list[tuple[str, str]]) -> str.

`unregister_word_tag(word: str) → None`

Remove the override for a specific word (case-insensitive; no-op if absent).

`clear_word_tag_overrides() → None`

Remove all registered overrides.

`get_word_tag_overrides() → dict`

Return a shallow copy of the current override registry.

Penn Treebank POS Tags (Reference)

Tag	Description	Tag	Description
`CC`	Coordinating conjunction	`PRP`	Personal pronoun
`CD`	Cardinal number	`PRP$`	Possessive pronoun
`DT`	Determiner	`RB`	Adverb
`EX`	Existential there	`RBR`	Adverb, comparative
`FW`	Foreign word	`RBS`	Adverb, superlative
`IN`	Preposition / subordinating conj	`RP`	Particle
`JJ`	Adjective	`TO`	to
`JJR`	Adjective, comparative	`UH`	Interjection
`JJS`	Adjective, superlative	`VB`	Verb, base form
`MD`	Modal	`VBD`	Verb, past tense
`NN`	Noun, singular	`VBG`	Verb, gerund / present part
`NNS`	Noun, plural	`VBN`	Verb, past participle
`NNP`	Proper noun, singular	`VBP`	Verb, non-3rd-person sing
`NNPS`	Proper noun, plural	`VBZ`	Verb, 3rd-person sing
`PDT`	Predeterminer	`WDT`	Wh-determiner
`POS`	Possessive ending	`WP`	Wh-pronoun
		`WRB`	Wh-adverb

Requirements

Python ≥ 3.10
nltk ≥ 3.8
(optional) spacy ≥ 3.5 + en_core_web_sm

License

Non-Commercial Use Only.

This software is licensed under the english-pos Non-Commercial License v1.0.

✅ Free for personal projects, academic research, and open-source work.
💼 Commercial use (production systems, SaaS, paid products/services) requires a separate paid license.

To obtain a commercial license, open an issue or contact the author at https://github.com/GSL20051013.

Changelog

See CHANGELOG.md.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

GSL2005

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.1

Mar 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gsl20051013_english-1.0.1.tar.gz (24.0 kB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gsl20051013_english-1.0.1-py3-none-any.whl (21.3 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file gsl20051013_english-1.0.1.tar.gz.

File metadata

Download URL: gsl20051013_english-1.0.1.tar.gz
Upload date: Mar 29, 2026
Size: 24.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gsl20051013_english-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`4f476631f8ac8e903443f9d3e7a03850d07db707e04167e67279fe6ba0752e26`
MD5	`bbeef54ae7c92a8ce6055f1943c281c3`
BLAKE2b-256	`5993e304d1322bf5c2a467836f581a43d73263c4abd2a675a4ec5aedd347b8f4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gsl20051013_english-1.0.1.tar.gz:

Publisher: workflow.yml on GSL20051013/English

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gsl20051013_english-1.0.1.tar.gz
- Subject digest: 4f476631f8ac8e903443f9d3e7a03850d07db707e04167e67279fe6ba0752e26
- Sigstore transparency entry: 1192074041
- Sigstore integration time: Mar 29, 2026
Source repository:
- Permalink: GSL20051013/English@fe8f512509e736d22523b3cba8cfd08c2cdeead1
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/GSL20051013
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@fe8f512509e736d22523b3cba8cfd08c2cdeead1
- Trigger Event: release

File details

Details for the file gsl20051013_english-1.0.1-py3-none-any.whl.

File metadata

Download URL: gsl20051013_english-1.0.1-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 21.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gsl20051013_english-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7adc7b7b4458643f18e0ed2a9ac491ab3b32029b6011b9208de8371e7a5682e2`
MD5	`b04895600d2d22f53695a7a380692b3a`
BLAKE2b-256	`98ef2f6f2847e1bd92e52fbe2059b3ef9e0b433fcf5ad05133db1bbccced81bc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gsl20051013_english-1.0.1-py3-none-any.whl:

Publisher: workflow.yml on GSL20051013/English

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gsl20051013_english-1.0.1-py3-none-any.whl
- Subject digest: 7adc7b7b4458643f18e0ed2a9ac491ab3b32029b6011b9208de8371e7a5682e2
- Sigstore transparency entry: 1192074044
- Sigstore integration time: Mar 29, 2026
Source repository:
- Permalink: GSL20051013/English@fe8f512509e736d22523b3cba8cfd08c2cdeead1
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/GSL20051013
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@fe8f512509e736d22523b3cba8cfd08c2cdeead1
- Trigger Event: release

GSL20051013-english 1.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

english-pos

Features

Installation

Minimal (NLTK backend only)

With spaCy backend (recommended – faster and more accurate)

Quick Start

API Reference

analyze_sentence(text: str) → list[tuple[str, str]]

analyze_batch(texts: list[str]) → list[list[tuple[str, str]]]

find_clauses(tagged: list[tuple[str, str]]) → list[dict]

find_connectives(tagged: list[tuple[str, str]]) → list[dict]

register_word_tag(word: str, tag_or_fn) → None

unregister_word_tag(word: str) → None

clear_word_tag_overrides() → None

get_word_tag_overrides() → dict

Penn Treebank POS Tags (Reference)

Requirements

License

Changelog

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`analyze_sentence(text: str) → list[tuple[str, str]]`

`analyze_batch(texts: list[str]) → list[list[tuple[str, str]]]`

`find_clauses(tagged: list[tuple[str, str]]) → list[dict]`

`find_connectives(tagged: list[tuple[str, str]]) → list[dict]`

`register_word_tag(word: str, tag_or_fn) → None`

`unregister_word_tag(word: str) → None`

`clear_word_tag_overrides() → None`

`get_word_tag_overrides() → dict`