Skip to main content

SpaCy pipeline component for adding document or sentence-level ngrams.

Project description

spacy-ngram

MIT License PyPI Version Python Versions Contributors Issues

spacy-ngram is a flexible SpaCy pipeline component for adding document- or sentence-level ngrams to your NLP pipeline. It extracts ngrams from lemmas (default) or tokens, handling stop words, punctuation, and digits automatically.

Table of Contents

About the Project

This component provides an easy way to enrich your Doc or Span objects with n-character or n-word sequences ( ngrams). It's designed to be lightweight and highly configurable.

Getting Started

Prerequisites

  • Python 3.10+
  • SpaCy 3.5.0+

Installation

  1. Install from PyPI:
pip install spacy-ngram
  1. Download a SpaCy model:
python -m spacy download en_core_web_sm

Usage

Quick Start

By default, the component adds unigrams and bigrams at the document level, filtering out stop words, punctuation, and digits.

import spacy
# the component is registered automatically on import
import spacy_ngram

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacy-ngram')

text = 'Quark soup is an interacting localized assembly of quarks and gluons.'
doc = nlp(text)

print(doc._.ngram_1)
# ['quark', 'soup', 'interact', 'localize', 'assembly', 'quark', 'gluon']

print(doc._.ngram_2)
# ['quark_soup', 'soup_interact', 'interact_localize', 'localize_assembly', 'assembly_quark', 'quark_gluon']

Custom Configuration

You can customize the extension name, ngram sizes, and whether to include Boundary-of-Sentence (<BOS>) / End-of-Sentence (<EOS>) tags.

nlp.add_pipe('spacy-ngram', config={
    'extension_name': 'my_ngrams',
    'ngrams': (2, 3),  # Extract bi- and trigrams
    'sentence_level': True,  # Process each sentence individually
    'doc_level': True,  # Also process the entire document
    'include_bos': True,  # Include <BOS> tags
    'include_eos': True,  # Include <EOS> tags
})

doc = nlp("This is a test. This is only a test.")

# access document-level trigrams
print(doc._.my_ngrams_3)

# access sentence-level bigrams
for sent in doc.sents:
    print(sent._.my_ngrams_2)

Pipeline Parameters

Parameter Type Default Description
extension_name str 'ngram' Base name for the Doc/Span extensions
ngrams tuple[int] (1, 2) List of ngram sizes to extract
include_bos bool False include BOS tags at end of sentence/document
include_eos bool False include EOS tags at end of sentence/document
sentence_level bool False perform ngram-extraction at sentence-level
doc_level bool True perform ngram-extraction at document-level

Versions

Uses SEMVER. See Releases.

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are welcome!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Please use the issue tracker.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_ngram-1.0.1.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spacy_ngram-1.0.1-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file spacy_ngram-1.0.1.tar.gz.

File metadata

  • Download URL: spacy_ngram-1.0.1.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spacy_ngram-1.0.1.tar.gz
Algorithm Hash digest
SHA256 6e9f0f8cd41e01c8843cff847eb3ac40812c4b14bb92fbf4cd575d0a3cd38d74
MD5 1a0e502e71f33ca1d5f885bff697fcd1
BLAKE2b-256 f5f936322ad78df510d94e88e04998b058920af275047ee809007a0b97e7f454

See more details on using hashes here.

File details

Details for the file spacy_ngram-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: spacy_ngram-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spacy_ngram-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 031277f7f9afd10727d4d907cc6c004b27b6e55cc91820fbabf208a91a097d2e
MD5 f19f91889e91922dbe51e25b0954d264
BLAKE2b-256 2fba806515aec0929dea3ef807f5693a8bad85c7f573a2a82f9197531ac93610

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page