Skip to main content

Context-aware English part-of-speech tagger with clause and connective detection

Project description

english-pos

Context-aware English part-of-speech tagger with clause and connective detection.

PyPI version Python 3.10+ License: Non-Commercial

english-pos is a lightweight Python library for tagging English text with Penn Treebank part-of-speech tags. It combines a state-of-the-art neural backend (spaCy) with a hand-crafted context-rule layer that corrects common mis-tags in auxiliary chains, adjective/verb ambiguities, passives, and more. It also detects clause boundaries and logical connectives.


Features

  • Dual backend – uses spaCy (en_core_web_sm) when installed for highest accuracy; falls back to NLTK's averaged perceptron tagger automatically.
  • Context-rule layer – 19+ heuristic patterns correct auxiliary chains, passive voice, predicate adjectives, comparative/superlative forms, and more.
  • Clause detectionfind_clauses() segments a sentence into main, subordinate, and relative clauses with semantic subtypes (causal, concessive, temporal, conditional, nominal).
  • Connective detectionfind_connectives() identifies coordinating, subordinating, and relative connectives with their positions.
  • Custom overrides – register per-word tag overrides (fixed string or callable) via register_word_tag().
  • Batch processinganalyze_batch() processes multiple sentences efficiently using nlp.pipe() when spaCy is available.
  • Zero configuration – NLTK data is downloaded automatically on first use.

Installation

Minimal (NLTK backend only)

pip install english-pos

With spaCy backend (recommended – faster and more accurate)

pip install "english-pos[spacy]"

Note: The spaCy extra installs en_core_web_sm automatically via the direct wheel URL.
If you manage spaCy models separately, run pip install english-pos and then
python -m spacy download en_core_web_sm.


Quick Start

import english

# Tag a single sentence
tagged = english.analyze_sentence("The fast robot jumped over 2 walls!")
print(tagged)
# [('The', 'DT'), ('fast', 'JJ'), ('robot', 'NN'), ('jumped', 'VBD'),
#  ('over', 'IN'), ('2', 'CD'), ('walls', 'NNS'), ('!', '.')]

# Tag multiple sentences efficiently
results = english.analyze_batch([
    "She has been running every morning.",
    "The letter was written by the president.",
])
for tokens in results:
    print(tokens)

# Detect clauses
tagged = english.analyze_sentence("Although she was tired, she kept working.")
for clause in english.find_clauses(tagged):
    words = " ".join(w for w, _ in clause["tokens"])
    print(f"[{clause['type']}/{clause['subtype'] or '-'}]  {words}")
# [subordinate/concessive]  Although she was tired
# [main/-]  she kept working

# Detect connectives
for conn in english.find_connectives(tagged):
    print(conn["type"], conn["subtype"], repr(conn["word"]))
# subordinating concessive 'Although'

API Reference

analyze_sentence(text: str) → list[tuple[str, str]]

Tag a single English sentence.

Parameter Type Description
text str Raw English text (any punctuation)

Returns a list of (word, tag) tuples using Penn Treebank tags.

english.analyze_sentence("Dogs run faster than cats.")
# [('Dogs', 'NNS'), ('run', 'VBP'), ('faster', 'RBR'), ('than', 'IN'), ('cats', 'NNS'), ('.', '.')]

analyze_batch(texts: list[str]) → list[list[tuple[str, str]]]

Tag multiple sentences in one call. When spaCy is available this uses nlp.pipe() for significantly better throughput.

results = english.analyze_batch(["The cat sat.", "Dogs run fast."])

find_clauses(tagged: list[tuple[str, str]]) → list[dict]

Segment a POS-tagged sentence into logical clauses.

Each returned dict contains:

Key Type Values / Notes
type str 'main', 'subordinate', 'relative'
subtype str 'causal', 'concessive', 'temporal', 'conditional', 'nominal', or ''
connective str Opening conjunction / relative pronoun, or '' for root main
tokens list list[tuple[str, str]] – tagged tokens in this clause
tagged = english.analyze_sentence(
    "He stayed home because it was raining."
)
for c in english.find_clauses(tagged):
    print(c["type"], c["subtype"], c["connective"])
# main      ''      ''
# subordinate causal  because

find_connectives(tagged: list[tuple[str, str]]) → list[dict]

Identify logical connectives in a tagged sentence.

Each returned dict contains:

Key Type Values / Notes
word str The connective as it appears in the text
tag str Penn Treebank POS tag
type str 'subordinating', 'coordinating', 'relative'
subtype str Semantic subtype for subordinating; '' for others
position int Index of the connective in tagged
tagged = english.analyze_sentence(
    "She left early because she was tired, but he stayed."
)
for c in english.find_connectives(tagged):
    print(c["type"], c["subtype"], c["word"])
# subordinating causal   because
# coordinating  ''       but

register_word_tag(word: str, tag_or_fn) → None

Register a custom POS tag (or tag-computing callable) for a specific word. Overrides are applied after the tagger and all context rules.

# Fixed override
english.register_word_tag("Python", "NNP")

# Callable override
def fix_data(word, current_tag, context):
    return "NNS" if current_tag == "NN" else current_tag

english.register_word_tag("data", fix_data)

The callable signature is (word: str, current_tag: str, context: list[tuple[str, str]]) -> str.


unregister_word_tag(word: str) → None

Remove the override for a specific word (case-insensitive; no-op if absent).


clear_word_tag_overrides() → None

Remove all registered overrides.


get_word_tag_overrides() → dict

Return a shallow copy of the current override registry.


Penn Treebank POS Tags (Reference)

Tag Description Tag Description
CC Coordinating conjunction PRP Personal pronoun
CD Cardinal number PRP$ Possessive pronoun
DT Determiner RB Adverb
EX Existential there RBR Adverb, comparative
FW Foreign word RBS Adverb, superlative
IN Preposition / subordinating conj RP Particle
JJ Adjective TO to
JJR Adjective, comparative UH Interjection
JJS Adjective, superlative VB Verb, base form
MD Modal VBD Verb, past tense
NN Noun, singular VBG Verb, gerund / present part
NNS Noun, plural VBN Verb, past participle
NNP Proper noun, singular VBP Verb, non-3rd-person sing
NNPS Proper noun, plural VBZ Verb, 3rd-person sing
PDT Predeterminer WDT Wh-determiner
POS Possessive ending WP Wh-pronoun
WRB Wh-adverb

Requirements

  • Python ≥ 3.10
  • nltk ≥ 3.8
  • (optional) spacy ≥ 3.5 + en_core_web_sm

License

Non-Commercial Use Only.

This software is licensed under the english-pos Non-Commercial License v1.0.

  • Free for personal projects, academic research, and open-source work.
  • 💼 Commercial use (production systems, SaaS, paid products/services) requires a separate paid license.

To obtain a commercial license, open an issue or contact the author at https://github.com/GSL20051013.


Changelog

See CHANGELOG.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gsl20051013_english-1.0.1.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gsl20051013_english-1.0.1-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file gsl20051013_english-1.0.1.tar.gz.

File metadata

  • Download URL: gsl20051013_english-1.0.1.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gsl20051013_english-1.0.1.tar.gz
Algorithm Hash digest
SHA256 4f476631f8ac8e903443f9d3e7a03850d07db707e04167e67279fe6ba0752e26
MD5 bbeef54ae7c92a8ce6055f1943c281c3
BLAKE2b-256 5993e304d1322bf5c2a467836f581a43d73263c4abd2a675a4ec5aedd347b8f4

See more details on using hashes here.

Provenance

The following attestation bundles were made for gsl20051013_english-1.0.1.tar.gz:

Publisher: workflow.yml on GSL20051013/English

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gsl20051013_english-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for gsl20051013_english-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7adc7b7b4458643f18e0ed2a9ac491ab3b32029b6011b9208de8371e7a5682e2
MD5 b04895600d2d22f53695a7a380692b3a
BLAKE2b-256 98ef2f6f2847e1bd92e52fbe2059b3ef9e0b433fcf5ad05133db1bbccced81bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for gsl20051013_english-1.0.1-py3-none-any.whl:

Publisher: workflow.yml on GSL20051013/English

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page