SpaCy pipeline component for adding document or sentence-level ngrams.
Project description
spacy-ngram
spacy-ngram is a flexible SpaCy pipeline component for adding document- or sentence-level ngrams to your NLP pipeline.
It extracts ngrams from lemmas (default) or tokens, handling stop words, punctuation, and digits automatically.
Table of Contents
About the Project
This component provides an easy way to enrich your Doc or Span objects with n-character or n-word sequences (
ngrams). It's designed to be lightweight and highly configurable.
Getting Started
Prerequisites
- Python 3.10+
- SpaCy 3.5.0+
Installation
- Install from PyPI:
pip install spacy-ngram
- Download a SpaCy model:
python -m spacy download en_core_web_sm
Usage
Quick Start
By default, the component adds unigrams and bigrams at the document level, filtering out stop words, punctuation, and digits.
import spacy
# the component is registered automatically on import
import spacy_ngram
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacy-ngram')
text = 'Quark soup is an interacting localized assembly of quarks and gluons.'
doc = nlp(text)
print(doc._.ngram_1)
# ['quark', 'soup', 'interact', 'localize', 'assembly', 'quark', 'gluon']
print(doc._.ngram_2)
# ['quark_soup', 'soup_interact', 'interact_localize', 'localize_assembly', 'assembly_quark', 'quark_gluon']
Custom Configuration
You can customize the extension name, ngram sizes, and whether to include Boundary-of-Sentence (<BOS>) /
End-of-Sentence (<EOS>) tags.
nlp.add_pipe('spacy-ngram', config={
'extension_name': 'my_ngrams',
'ngrams': (2, 3), # Extract bi- and trigrams
'sentence_level': True, # Process each sentence individually
'doc_level': True, # Also process the entire document
'include_bos': True, # Include <BOS> tags
'include_eos': True, # Include <EOS> tags
})
doc = nlp("This is a test. This is only a test.")
# access document-level trigrams
print(doc._.my_ngrams_3)
# access sentence-level bigrams
for sent in doc.sents:
print(sent._.my_ngrams_2)
Pipeline Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
extension_name |
str |
'ngram' |
Base name for the Doc/Span extensions |
ngrams |
tuple[int] |
(1, 2) |
List of ngram sizes to extract |
include_bos |
bool |
False |
include BOS tags at end of sentence/document |
include_eos |
bool |
False |
include EOS tags at end of sentence/document |
sentence_level |
bool |
False |
perform ngram-extraction at sentence-level |
doc_level |
bool |
True |
perform ngram-extraction at document-level |
Versions
Roadmap
See the open issues for a list of proposed features (and known issues).
Contributing
Contributions are welcome!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Please use the issue tracker.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spacy_ngram-1.0.1.tar.gz.
File metadata
- Download URL: spacy_ngram-1.0.1.tar.gz
- Upload date:
- Size: 7.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e9f0f8cd41e01c8843cff847eb3ac40812c4b14bb92fbf4cd575d0a3cd38d74
|
|
| MD5 |
1a0e502e71f33ca1d5f885bff697fcd1
|
|
| BLAKE2b-256 |
f5f936322ad78df510d94e88e04998b058920af275047ee809007a0b97e7f454
|
File details
Details for the file spacy_ngram-1.0.1-py3-none-any.whl.
File metadata
- Download URL: spacy_ngram-1.0.1-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
031277f7f9afd10727d4d907cc6c004b27b6e55cc91820fbabf208a91a097d2e
|
|
| MD5 |
f19f91889e91922dbe51e25b0954d264
|
|
| BLAKE2b-256 |
2fba806515aec0929dea3ef807f5693a8bad85c7f573a2a82f9197531ac93610
|