Linguistic Pattern Lab using spaCy
Project description
LingPatLab
Linguistic Pattern Laboratory: Advanced NLP pipeline for text analysis, entity extraction, and pattern recognition.
Features
- Tokenization: Custom Graffl tokenizer with intelligent handling of contractions, abbreviations, and punctuation
- Parsing: Deep linguistic analysis with POS tagging, dependency parsing, and WordNet integration
- Entity Extraction: Pattern-based extraction of people and topics with anaphora resolution
- Segmentation: Paragraph and sentence boundary detection
- Rich Annotations: Sentiment, lemmatization, stemming, and morphological features
Installation
pip install lingpatlab
Quick Start
from lingpatlab import LingPatLab
api = LingPatLab()
# Parse text into structured tokens
sentence = api.parse_input_text("Admiral Nimitz commanded the Pacific Fleet.")
print(sentence.to_string())
# Extract people with anaphora resolution
text = "Admiral William Halsey led the fleet. Halsey was known for his aggressive tactics."
sentence = api.parse_input_text(text)
people = api.extract_people(sentence)
# Returns: {'Halsey': ['Admiral William Halsey', 'Halsey']}
# Extract topics and named entities
topics = api.extract_topics(sentence)
Usage Examples
Parse Multiple Lines
lines = [
"The Battle of Midway was a turning point.",
"Admiral Nimitz made crucial decisions."
]
sentences = api.parse_input_lines(lines)
for sentence in sentences:
print(sentence.to_string())
Segmentation
from lingpatlab import segment_input_text
text = "First sentence. Second sentence. Third sentence."
segments = segment_input_text(text)
# Returns: ['First sentence.', 'Second sentence.', 'Third sentence.']
Access Token Details
sentence = api.parse_input_text("The quick brown fox jumps.")
for token in sentence:
print(f"Text: {token.text}")
print(f"POS: {token.pos}")
print(f"Lemma: {token.normal}")
print(f"Is WordNet: {token.is_wordnet}")
print(f"Dependency: {token.dep}")
Data Classes
Sentence: Single sentence with token listSentences: Collection of sentencesSpacyResult: Individual token with full linguistic annotationOtherInfo: Additional morphological and dependency metadata
Architecture
LingPatLab
├── tokenizer/ # Custom tokenization with Graffl
├── parser/ # spaCy integration + enhancements
├── analyzer/ # Entity extraction with pattern matching
├── segmenter/ # Sentence and paragraph segmentation
└── utils/ # WordNet, Porter stemmer, utilities
Requirements
- Python 3.10+
- spaCy 3.8.2
- spaCy model:
en_core_web_sm
Development
# Install with dev dependencies
pip install -e ".[linting,testing]"
# Run tests
pytest
# Run regression suite
python regression/regression_runner.py
Links
License
MIT License - see LICENSE for details.
Author
Craig Trim - craigtrim@gmail.com
More NLP articles and demos at craigtrim.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lingpatlab-1.1.1.tar.gz
(49.7 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lingpatlab-1.1.1.tar.gz.
File metadata
- Download URL: lingpatlab-1.1.1.tar.gz
- Upload date:
- Size: 49.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e5d719bd742243ffe726bf9090a1950d67cbc595d2ab2d6d03151140dc783e2
|
|
| MD5 |
84fdb5e0ecb4cac44723dbf575ca89e2
|
|
| BLAKE2b-256 |
5073f2833911de27b52790e26bedda3b7b4be0ec20dbf245961c5bd63ecc0064
|
File details
Details for the file lingpatlab-1.1.1-py3-none-any.whl.
File metadata
- Download URL: lingpatlab-1.1.1-py3-none-any.whl
- Upload date:
- Size: 73.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7e3a552a0e63679d9b9ec9fb7602edea8ef51fd94cb97a020c65c21218d175e
|
|
| MD5 |
ad444ff79634316bedd479e8c6cbd623
|
|
| BLAKE2b-256 |
3ef80515b6b3ba9be2ed766245a209f0e7f97176450d07cb1edd0a5964e2915b
|