Skip to main content

No project description provided

Project description

Corpus Statistics: A Very Basic (for now) spaCy Pipeline Component

If you want to know what tokens your pipeline has seen, this is the component for you.

pip install git+https://github.com/pmbaumgartner/corpus_statistics

⚡️ Example

from spacy.lang.en import English

# Use some example data
from datasets import load_dataset
dataset = load_dataset("imdb")
texts = dataset["train"]["text"]

# ✨ start the magic 
nlp = English()  # or spacy.load('a_model')
nlp.add_pipe("simple_corpus_stats")

for doc in nlp.pipe(texts):
    # ➡️ do your pipeline stuff! ➡️
    pass

corpus_stats = nlp.get_pipe("simple_corpus_stats")

# check if a token has been processed through this pipeline
token = "apple"
if token in corpus_stats:
    token_count = corpus_stats[token]
    print(f"'{token}' mentioned {token_count} times")

# 'apple' mentioned 24 times

It's got all your favorite legomena like hapax and dis.

only_seen_once = len(corpus_stats.hapax_legomena)
percent_of_vocab = only_seen_once / corpus_stats.vocab_size
print(f"{percent_of_vocab*100:.1f}% tokens only occurred once.")
# 47.6% tokens only occurred once.

only_seen_twice = len(corpus_stats.dis_legomena)
percent_of_vocab_2x = only_seen_twice / corpus_stats.vocab_size
print(f"{percent_of_vocab_2x*100:.1f}% tokens occurred twice.")
# 12.3% tokens occurred twice.

We counted some things too:

# corpus_stats.vocabulary is a collections.Counter 🔢
print(*corpus_stats.vocabulary.most_common(5), sep="\n")
# ('the', 289838)
# (',', 275296)
# ('.', 236702)
# ('and', 156484)
# ('a', 156282)

mean_doc_length = sum(corpus_stats.doc_lengths) / corpus_stats.corpus_length
print(f"Mean doc length: {mean_doc_length:.1f}")
# Mean doc length: 272.5

Use in Model Training and Config Files

This can be quite helpful if you wanted to know what tokens were seen in your training data. You can include this component in your training config as follows.

...
[nlp]
lang = "en"
pipeline = ["simple_corpus_stats", ...]
...

[components]

[components.simple_corpus_stats]
factory = "simple_corpus_stats"
n_train = 1000  # This is important! See below

⚠️ 🔁 If you use this component in a training config, your pipeline will see the same docs multiple times, due to the number of training epochs and evaluation steps, so the vocab counter will be incorrect. To correct for this, you need to specify the number of examples in your training dataset as the n_train config parameter.

import spacy

nlp = spacy.load("your_trained_model")
corpus_stats = nlp.get_pipe("simple_corpus_stats")

assert min(corpus_stats.vocabulary.values()) == 1

# value from config
assert len(corpus_stats.doc_lengths) == 1000

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus-statistics-0.1.0.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corpus_statistics-0.1.0-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file corpus-statistics-0.1.0.tar.gz.

File metadata

  • Download URL: corpus-statistics-0.1.0.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.8.11 Darwin/21.3.0

File hashes

Hashes for corpus-statistics-0.1.0.tar.gz
Algorithm Hash digest
SHA256 593636abee0373d18832e6743b753f5a28c5388a4d773e245f4aa61837c00dec
MD5 b3ffc5cac825e80e99cb1c5d3e18cc03
BLAKE2b-256 c28707d3e4a380688167ffc96bceaa7509ef46bd450cef5860e8d394e6308cb1

See more details on using hashes here.

File details

Details for the file corpus_statistics-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: corpus_statistics-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.8.11 Darwin/21.3.0

File hashes

Hashes for corpus_statistics-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3a315837bc6a47734f75eaf3f6e2eade8a3f26d1868035ad80fc8036be292208
MD5 b3a957273a8dcb144305f0349aa1a7b5
BLAKE2b-256 e410fe95e79f2f2c74ebf091d1493cf5a0695433e1327fd3fc92c671c6ac06bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page