SpaCy pipeline component for adding document or sentence-level ngrams.

These details have not been verified by PyPI

Project links

Home

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3 :: Only
Topic
- Text Processing :: Linguistic

Project description

spacy-ngram

SpaCy pipeline component for adding document or sentence-level ngrams.

About the Project
Getting Started
- Prerequisites
- Installation
Usage
Roadmap
Contributing
License
Contact
Acknowledgements

About the Project

SpaCy pipeline component for adding document or sentence-level ngrams.

Getting Started

Prerequisites

Python 3.10+

Installation

Install from PyPI:

pip install spacy-ngram

This will install spacy, but spacy requires a model:
- E.g., download: python -m spacy download en_core_web_sm
- Or, manually download and install with pip install ...

Usage

Quick Start

spacy-ngram allows the creation of ngrams of any size. These will be added at either the document- or sentence-level.

import spacy
from spacy_ngram import NgramComponent

nlp = spacy.load('en_core_web_sm')  # or whatever model you downloaded
nlp.add_pipe('spacy-ngram')  # default to document-level ngrams, removing stopwords

text = 'Quark soup is an interacting localized assembly of quarks and gluons.'
doc = nlp(text)

print(doc._.ngram_1)
# ['quark', 'soup', 'interact', 'localize', 'assembly', 'quark', 'gluon']

print(doc._.ngram_2)
# ['quark_soup', 'soup_interact', 'interact_localize', 'localize_assembly', 'assembly_quark', 'quark_gluon']

Quick Reference

spacy-ngram creates new extensions under the Doc and/or Span classes, depending on the parameters (it defaults to Doc). The extension begins with the prefix ngram_ followed by the level of ngram desired (e.g., ngram_1).

unigram (1 included in ngrams argument): Doc._.ngram_1
bigram (2 included in ngrams argument): Doc._.ngram_2

Pipeline Parameters

The pipeline can be parametrized depending on needs. E.g., to process at the sentence-level:

nlp.add_pipe('spacy-ngram', config={
    'sentence_level': True,  # initialize sentence-level ngrams
    'doc_level': False,  # skip processing at document-level
    'ngrams': (2, 3),  # bi- and trigram only
})
doc = nlp(text)
sentence = list(doc.sents)

print(sentence._.ngram_1)
# raises AttributeError
sentence._.ngram_2  # returns list of bigrams
sentence._.ngram_3  # returns list of trigrams

Parameter	Type	Default	Description
`ngrams`	`tuple[int]`	`(1, 2)`	1 for unigram, 2 for bigram, etc.
`include_bos`	`bool`	`False`	include `BOS` tags at end of sentence/document
`include_eos`	`bool`	`False`	include `EOS` tags at end of sentence/document
`sentence_level`	`bool`	`False`	perform ngram-extraction at sentence-level
`doc_level`	`bool`	`True`	perform ngram-extraction at document-level

Versions

Uses SEMVER.

See https://github.com/kpwhri/spacy-ngram/releases.

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License.

See LICENSE or https://kpwhri.mit-license.org for more information.

Contact

Please use the issue tracker.

Acknowledgements

Project details

These details have not been verified by PyPI

Project links

Home

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3 :: Only
Topic
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

This version

0.0.3

Jul 25, 2023

0.0.2

Mar 21, 2023

0.0.1

Mar 21, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_ngram-0.0.3.tar.gz (8.5 kB view hashes)

Uploaded Jul 25, 2023 Source

Built Distribution

spacy_ngram-0.0.3-py3-none-any.whl (5.6 kB view hashes)

Uploaded Jul 25, 2023 Python 3

Hashes for spacy_ngram-0.0.3.tar.gz

Hashes for spacy_ngram-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`b84cd14221745828928afc15a888592a95eb79c2408f5160848478ed7cd783cc`
MD5	`337912e0118059e25582740aa1eeb481`
BLAKE2b-256	`5542d27a7e2dea7a42e8ff911a4458926891b657a1f6e46db82ad3162a30c098`

Hashes for spacy_ngram-0.0.3-py3-none-any.whl

Hashes for spacy_ngram-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5cad2ec422d5b2638cf0d46c2a9711f77b01f38e52fdaafe079a306a56a80a11`
MD5	`d801cb119cdf92799cda260641e8b33e`
BLAKE2b-256	`bfa5b2dfe976b7e66323c88208c6eacca45d1bc4beb2a3cabb4a9c8fed4e87d7`