Skip to main content

The streaming access to the Google ngram data.

Project description

https://travis-ci.org/dimazest/google-ngram-downloader.png?branch=master https://coveralls.io/repos/dimazest/google-ngram-downloader/badge.png?branch=master Requirements Status Latest PyPI version Number of PyPI downloads

The Google Books Ngram Viewer dataset is a freely available resource under a Creative Commons Attribution 3.0 Unported License which provides ngram counts over books scanned by Google.

The data is so big, that storing it is almost impossible. However, sometimes you need an aggregate data over the dataset. For example to build a co-occurrence matrix.

This package provides an iterator over the dataset stored at Google. It decompresses the data on the fly and provides you the access to the underlying data.

Example use

>>> from google_ngram_downloader import readline_google_store
>>>
>>> fname, url, records = next(readline_google_store(ngram_len=5))
>>> fname
'googlebooks-eng-all-5gram-20120701-0.gz'
>>> url
'http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-5gram-20120701-0.gz'
>>> next(records)
Record(ngram=u'0 " A most useful', year=1860, match_count=1, volume_count=1)

Installation

pip install google-ngram-downloader

The command line tool

It also provides a simple command line tool to download the ngrams called google-ngram-downloader.

Changes

Version 3.1

  • The cooccurrence command does not perform any ngram modification.

Version 3.0

  • download, readile and cooccurrence subcommands.

  • readline_google_store transforms lines to Record in several processes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

google-ngram-downloader-3.1.tar.gz (9.0 kB view details)

Uploaded Source

File details

Details for the file google-ngram-downloader-3.1.tar.gz.

File metadata

File hashes

Hashes for google-ngram-downloader-3.1.tar.gz
Algorithm Hash digest
SHA256 199d043c06cf8c6811fa92069a49a5f8692f7aa7fe996da69a38c41c43b9143a
MD5 691e6fb552cf66e73d98ccf6b9aead9c
BLAKE2b-256 e8ef6d962d8b3d26671ef083dfd71fa0469eea47d3eaa17dad454e6cdcd41095

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page