Skip to main content

A spaCy pipeline component for counting tokens a pipeline has seen.

Project description

Corpus Statistics: A Very Basic (for now) spaCy Pipeline Component

If you want to know what tokens your spaCy pipeline has seen, this is the component for you.

pip install corpus-statistics
# OR for latest
pip install git+https://github.com/pmbaumgartner/corpus_statistics

⚡️ Example

from spacy.lang.en import English

# Use some example data
from datasets import load_dataset
dataset = load_dataset("imdb")
texts = dataset["train"]["text"]

nlp = English()  # or spacy.load('a_model')
nlp.add_pipe("simple_corpus_stats")  

# ✨ start the magic 
for doc in nlp.pipe(texts):
    # ➡️ do your pipeline stuff! ➡️
    pass

corpus_stats = nlp.get_pipe("simple_corpus_stats")

# check if a token has been processed through this pipeline
token = "apple"
if token in corpus_stats:
    token_count = corpus_stats[token]
    print(f"'{token}' mentioned {token_count} times")

# 'apple' mentioned 24 times

It's got all your favorite legomena like hapax and dis.

only_seen_once = len(corpus_stats.hapax_legomena)
percent_of_vocab = only_seen_once / corpus_stats.vocab_size
print(f"{percent_of_vocab*100:.1f}% tokens only occurred once.")
# 47.6% tokens only occurred once.

only_seen_twice = len(corpus_stats.dis_legomena)
percent_of_vocab_2x = only_seen_twice / corpus_stats.vocab_size
print(f"{percent_of_vocab_2x*100:.1f}% tokens occurred twice.")
# 12.3% tokens occurred twice.

We counted some things too:

# corpus_stats.vocabulary is a collections.Counter 🔢
print(*corpus_stats.vocabulary.most_common(5), sep="\n")
# ('the', 289838)
# (',', 275296)
# ('.', 236702)
# ('and', 156484)
# ('a', 156282)

mean_doc_length = sum(corpus_stats.doc_lengths) / corpus_stats.corpus_length
print(f"Mean doc length: {mean_doc_length:.1f}")
# Mean doc length: 272.5

Use in Model Training and Config Files

This can be quite helpful if you wanted to know what tokens were seen in your training data. You can include this component in your training config as follows.

...
[nlp]
lang = "en"
pipeline = ["simple_corpus_stats", ...]
...

[components]

[components.simple_corpus_stats]
factory = "simple_corpus_stats"
n_train = 1000  # This is important! See below

⚠️ 🔁 If you use this component in a training config, your pipeline will see the same docs multiple times, due to the number of training epochs and evaluation steps, so the vocab counter will be incorrect. To correct for this, you need to specify the number of examples in your training dataset as the n_train config parameter.

import spacy

nlp = spacy.load("your_trained_model")
corpus_stats = nlp.get_pipe("simple_corpus_stats")

assert min(corpus_stats.vocabulary.values()) == 1

# value from config
assert len(corpus_stats.doc_lengths) == 1000

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus-statistics-0.1.1.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

corpus_statistics-0.1.1-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file corpus-statistics-0.1.1.tar.gz.

File metadata

  • Download URL: corpus-statistics-0.1.1.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.8.11 Darwin/21.3.0

File hashes

Hashes for corpus-statistics-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8bc4a9237219a084f68f67e6fa130c1e656d122a0b62152492929dfcba42709a
MD5 5aa853019880f10037f3135f4f3ef624
BLAKE2b-256 8e4adf942725028f2fe6a79cfc8be81a05f7fd4b18f781ca92d9ae1d71f155a4

See more details on using hashes here.

File details

Details for the file corpus_statistics-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for corpus_statistics-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 af29c378b691b608e45a90b79025e606a97d349746768f927400d99e63578a98
MD5 2c8e64954bb2cd53c7c2e46b40ceae22
BLAKE2b-256 108c1a2933c231e568c54ed342fcb6dd91184a46b817e90946824d7c4f32b6a5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page