Fetch and analyze Google Ngram data for specified word forms.
Project description
This package has functions for processing Google’s Ngram repositories without having to download them locally. These repositories vary in their size, but the larger ones (like th one for the letter s or common bigrams) contain multiple gigabytes.
The main function uses scan_csv from the polars package to reduce memory load. Still, depending on the specific word forms being searched, loading and processing the data tables can sometimes take a few minutes if they are large.
vnc
To analyze the returned data, the package also contains functions based on the work of Gries and Hilpert (2012) for Variability-Based Neighbor Clustering.
The idea is to use hierarchical clustering to aid “bottom up” periodization of language change. The python functions are built on their original R code.
Distances, therefore, are calculated in sums of standard deviations and coefficients of variation, according to their stated method.
Dendrograms are plotted using matplotlib, with custom implementations for hierarchical clustering that maintain the plotting order of the leaves according to the requirements of the method.
The package also has a custom implementation of dendrogram truncation that consolidates leaves under a specified number of time periods (or clusters) while also maintaining the leaf order to facilitate the reading and interpretation of large dendrograms.
Lightweight Implementation
Starting with version 0.2.0, google_ngrams uses lightweight, custom implementations for statistical computations instead of heavy dependencies like scipy and statsmodels. This design choice reduces installation overhead while maintaining full functionality for the core VNC methodology and smoothing operations.
Installation
You can install the released version of google_ngrams from PyPI:
pip install google-ngrams
Usage
To use the google_ngrams package, import google_ngram to fetch data and TimeSeries for analysis.
from google_ngrams import google_ngram, TimeSeries
Fetching n-gram data
The google_ngram function supports different varieties of English (e.g., British, American) and allows aggregation by year or decade. Word forms (even a single word form) must be formatted as a list:
The following would return counts for the word x-ray in US English by year:
xray_year = google_ngram(word_forms = ["x-ray"], variety = "us", by = "year")
Alternatively, the following would return counts of the combined forms xray and xrays in British English by decade:
xray_decade = google_ngram(word_forms = ["x-ray", "x-rays"], variety = "gb", by = "decade")
The function returns a polars DataFrame with either a time interval column (either Year or Decade) and columns for Token, AF (absolute frequency) and RF (relative frequency).
The returned DataFrame, then, can be manipulated using the polars API:
import polars as pl
xray_filtered = xray_decade.filter(pl.col("Decade") >= 1900)
Analyzing time series data
To analyze the data, use TimeSeries, specifying a column of time intervals and a column of relative frequencies:
xray_ts = TimeSeries(xray_filtered, time_col="Decade", values_col="RF")
VNC dendrograms can then be plotted with a variety of options:
xray_ts.timeviz_vnc()
For additional information, consult the documentation.
License
Code licensed under MIT License. See LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file google_ngrams-0.2.0.tar.gz.
File metadata
- Download URL: google_ngrams-0.2.0.tar.gz
- Upload date:
- Size: 49.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ea9925ebd942b3205746525ae46194df5b8bcc4b333f1ce76f106b116d28fa0
|
|
| MD5 |
aec7f14ae911be0bb23108abba5ead23
|
|
| BLAKE2b-256 |
aec10398ed616bbc595543f9a3d01c8c0286351ae63df13cdc69d276301461a2
|
Provenance
The following attestation bundles were made for google_ngrams-0.2.0.tar.gz:
Publisher:
ci.yml on browndw/google_ngrams
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
google_ngrams-0.2.0.tar.gz -
Subject digest:
8ea9925ebd942b3205746525ae46194df5b8bcc4b333f1ce76f106b116d28fa0 - Sigstore transparency entry: 501228235
- Sigstore integration time:
-
Permalink:
browndw/google_ngrams@cf52d9e90723eea6b8833c93a7643adf00fd6a43 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/browndw
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@cf52d9e90723eea6b8833c93a7643adf00fd6a43 -
Trigger Event:
push
-
Statement type:
File details
Details for the file google_ngrams-0.2.0-py3-none-any.whl.
File metadata
- Download URL: google_ngrams-0.2.0-py3-none-any.whl
- Upload date:
- Size: 46.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e370a1205f0283498cedfc7fdb654f44223cc33a1bf5cdbb7eea767057539ecc
|
|
| MD5 |
3aa1f6af801b0a9287b8dafebda47793
|
|
| BLAKE2b-256 |
2d3d9cb21ec29cb0d41fa9d8d27e27f02f8c7acc6134e505c241bee7e034f9c5
|
Provenance
The following attestation bundles were made for google_ngrams-0.2.0-py3-none-any.whl:
Publisher:
ci.yml on browndw/google_ngrams
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
google_ngrams-0.2.0-py3-none-any.whl -
Subject digest:
e370a1205f0283498cedfc7fdb654f44223cc33a1bf5cdbb7eea767057539ecc - Sigstore transparency entry: 501228261
- Sigstore integration time:
-
Permalink:
browndw/google_ngrams@cf52d9e90723eea6b8833c93a7643adf00fd6a43 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/browndw
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@cf52d9e90723eea6b8833c93a7643adf00fd6a43 -
Trigger Event:
push
-
Statement type: