A spaCy pipeline component for counting tokens a pipeline has seen.
Project description
Corpus Statistics: A Very Basic (for now) spaCy Pipeline Component
If you want to know what tokens your spaCy pipeline has seen, this is the component for you.
pip install corpus-statistics
# OR for latest
pip install git+https://github.com/pmbaumgartner/corpus_statistics
⚡️ Example
from spacy.lang.en import English
# Use some example data
from datasets import load_dataset
dataset = load_dataset("imdb")
texts = dataset["train"]["text"]
nlp = English() # or spacy.load('a_model')
nlp.add_pipe("simple_corpus_stats")
# ✨ start the magic
for doc in nlp.pipe(texts):
# ➡️ do your pipeline stuff! ➡️
pass
corpus_stats = nlp.get_pipe("simple_corpus_stats")
# check if a token has been processed through this pipeline
token = "apple"
if token in corpus_stats:
token_count = corpus_stats[token]
print(f"'{token}' mentioned {token_count} times")
# 'apple' mentioned 24 times
It's got all your favorite legomena like hapax
and dis
.
only_seen_once = len(corpus_stats.hapax_legomena)
percent_of_vocab = only_seen_once / corpus_stats.vocab_size
print(f"{percent_of_vocab*100:.1f}% tokens only occurred once.")
# 47.6% tokens only occurred once.
only_seen_twice = len(corpus_stats.dis_legomena)
percent_of_vocab_2x = only_seen_twice / corpus_stats.vocab_size
print(f"{percent_of_vocab_2x*100:.1f}% tokens occurred twice.")
# 12.3% tokens occurred twice.
We counted some things too:
# corpus_stats.vocabulary is a collections.Counter 🔢
print(*corpus_stats.vocabulary.most_common(5), sep="\n")
# ('the', 289838)
# (',', 275296)
# ('.', 236702)
# ('and', 156484)
# ('a', 156282)
mean_doc_length = sum(corpus_stats.doc_lengths) / corpus_stats.corpus_length
print(f"Mean doc length: {mean_doc_length:.1f}")
# Mean doc length: 272.5
Use in Model Training and Config Files
This can be quite helpful if you wanted to know what tokens were seen in your training data. You can include this component in your training config as follows.
...
[nlp]
lang = "en"
pipeline = ["simple_corpus_stats", ...]
...
[components]
[components.simple_corpus_stats]
factory = "simple_corpus_stats"
n_train = 1000 # This is important! See below
⚠️ 🔁 If you use this component in a training config, your pipeline will see the same docs multiple times, due to the number of training epochs and evaluation steps, so the vocab counter will be incorrect. To correct for this, you need to specify the number of examples in your training dataset as the n_train
config parameter.
import spacy
nlp = spacy.load("your_trained_model")
corpus_stats = nlp.get_pipe("simple_corpus_stats")
assert min(corpus_stats.vocabulary.values()) == 1
# value from config
assert len(corpus_stats.doc_lengths) == 1000
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file corpus-statistics-0.1.1.tar.gz
.
File metadata
- Download URL: corpus-statistics-0.1.1.tar.gz
- Upload date:
- Size: 7.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.8.11 Darwin/21.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bc4a9237219a084f68f67e6fa130c1e656d122a0b62152492929dfcba42709a |
|
MD5 | 5aa853019880f10037f3135f4f3ef624 |
|
BLAKE2b-256 | 8e4adf942725028f2fe6a79cfc8be81a05f7fd4b18f781ca92d9ae1d71f155a4 |
File details
Details for the file corpus_statistics-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: corpus_statistics-0.1.1-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.8.11 Darwin/21.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | af29c378b691b608e45a90b79025e606a97d349746768f927400d99e63578a98 |
|
MD5 | 2c8e64954bb2cd53c7c2e46b40ceae22 |
|
BLAKE2b-256 | 108c1a2933c231e568c54ed342fcb6dd91184a46b817e90946824d7c4f32b6a5 |