Skip to main content

Library to create and interogate local cache for Project Gutenberg

Project description

GutenbergPy

image

This package makes filtering and getting information from Project Gutenberg easier from python.

It's target audience is machine learning guys that need data for their project, but may be freely used by anybody.

The package:

  • Generates a local cache (of all gutenberg informations) that you can interogate to get book ids. The Local cache may be sqlite (default) or mongodb (for wich you need to have installed the pymongodb packet)
  • Downloads and cleans raw text from gutenberg books

The package has been tested with Python 3.6 on both Windows and Linux It is faster, smaller and less third-party intensive alternative to https://github.com/c-w/Gutenberg

About development: http://www.raduangelescu.com/gutenbergpy.html

Installation

or just install it from source (it's all just python code)

Usage

Downloading a text

Query the cache

To do this you first need to create the cache (this is a one time thing per os, until you decide to redo it)

for debugging/better control you have these boolean options on create

  • refresh deletes the old cache
  • download property downloads the rdf file from the gutenberg project
  • unpack unpacks it
  • parse parses it in memory
  • cache writes the cache

for even better control you may set the GutenbergCacheSettings

  • CacheFilename
  • CacheUnpackDir
  • CacheArchiveName
  • ProgressBarMaxLength
  • CacheRDFDownloadLink
  • TextFilesCacheFolder
  • MongoDBCacheServer

After doing a create you need to wait, it will be over in about 5 minutes depending on your internet speed and computer power (On a i7 with gigabit connection and ssd it finishes in about 1 minute)

Get the cache

Now you can do queries

Get the book Gutenberg unique indices by using this query function

Standard query fields:

  • languages
  • authors
  • types
  • titles
  • subjects
  • publishers
  • bookshelves
  • downloadtype

Or do a native query on the sqlite database

For SQLITE custom queries take a look at the SQLITE database scheme:

image

For MongoDB queries you have all the books collection. Each book with the following fields:

  • book(publisher, rights, language, book_shelf, gutenberg_book_id, date_issued, num_downloads, titles, subjects, authors, files ,type)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gutenbergpy-0.3.4.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gutenbergpy-0.3.4-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file gutenbergpy-0.3.4.tar.gz.

File metadata

  • Download URL: gutenbergpy-0.3.4.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for gutenbergpy-0.3.4.tar.gz
Algorithm Hash digest
SHA256 b2ae14aab841ce28ff2d288cb3556156273e75808164422ee8790d6a595fc95b
MD5 73560cb039a6b796034cfc831baedbc3
BLAKE2b-256 d9ff14a32426de9b70f2f4fd9085e83cd882b92b2b3ca3b10cfef8b4dfada007

See more details on using hashes here.

File details

Details for the file gutenbergpy-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: gutenbergpy-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 21.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for gutenbergpy-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 063994175aa61322451a1866d9e18015b1099142f0d1ca1e725a747df6a6fa97
MD5 7f55073e15afccd214ded27ed1aac9ac
BLAKE2b-256 7d4c9a83de34d73144075b40186ec83f3b52f15fd8a49a0e5c5656c358e237ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page