Gutenberg

Project Gutenberg corpus interface

Project description

Overview

This package contains a variety of scripts to make working with the Project Gutenberg body of public domain texts easier.

The functionality provided by this package includes:

Downloading texts using the Project Gutenberg API.
Cleaning up the texts: removing headers and footers.
Making meta-data about the texts easily accessible through a database.

Installation

This project is on PyPI, so I’d recommend that you just install everything from there using your favourite Python package manager.

pip install gutenberg
easy_install gutenberg

If you want to install from source, you’ll need to clone this repository:

git clone https://github.com/c-w/Gutenberg.git && cd Gutenberg

Now, you should probably install the dependencies for the package and verify your install.

The recommended way of doing this is using the project’s makefile. The command make virtualenv will install all the required dependencies for the package in a local directory called virtualenv
You might want to run the tests to see if everything installed correctly: make test.
Now run source virtualenv/bin/activate and you’re good to go.

Another setup task you might want to run is make docs to automatically generate some API documentation for the project. After running the command, you can enjoy your documentation by pointing your browser at docs/_build/html/index.html.

Usage

There are a number of programs demonstrating how to use this library in the scripts directory.

How to help

(Good first fix) Currently this library only makes use of the author and title meta-data exposed by Project Gutenberg and does not leverage information such as genre, publication date, etc. Making this information usable by the library is a pretty straight forward three-step process. First, the TextSource.textinfo_converter method needs to be extended to parse the new meta-data attributes. Second, the new attributes need to be wired through to the TextInfo class. Lastly, a new method leveraging the new meta-data source should be added to the Corpus class (such as texts_for_genre or texts_for_year). See #2.
It would be great if there was an option to make the text retrieval functions on the Corpus class (like texts_for_author) perform fuzzy matching so that small spelling mistakes can automatically be corrected. See #3.
The TextSource object should probably track its state so that it only yields every text once (unless explicitly requested to re-yield all texts from the start). See #4.
The library is in dire need of more tests and robustness fixes. See #5.

Limitations

This project deliberately does not include any natural language processing functionality. Consuming and processing the text is the responsibility of the client; this library merely focuses on offering a simple and easy to use interface to the works in the Project Gutenberg corpus. Any linguistic processing can easily be done client-side e.g. using the TextBlob library.

Project details

Release history Release notifications | RSS feed

0.8.2

Dec 26, 2021

0.8.1

Apr 30, 2020

0.8.0

Aug 24, 2019

0.7.0

May 18, 2018

0.6.1

Jan 11, 2018

0.5.0

Apr 28, 2017

0.4.5

Feb 19, 2017

0.4.4

Feb 19, 2017

0.4.2

Jan 9, 2016

0.4.1

Dec 1, 2015

0.4.0

Mar 11, 2015

0.3.3

Feb 28, 2015

0.3.2

Feb 28, 2015

0.3.1

Feb 28, 2015

0.3

Feb 28, 2015

0.2.2

Jan 3, 2015

This version

0.2.1

Nov 18, 2014

0.2.0

Sep 29, 2014

0.1.1

Aug 3, 2014

0.1.0

Aug 3, 2014

0.0.0

Aug 3, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Gutenberg-0.2.1.tar.gz (15.7 kB view details)

Uploaded Nov 18, 2014 Source

File details

Details for the file Gutenberg-0.2.1.tar.gz.

File metadata

Download URL: Gutenberg-0.2.1.tar.gz
Upload date: Nov 18, 2014
Size: 15.7 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for Gutenberg-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`2e5dd76fba71f9f13b0c60a5c2cf4e162c62724285a0f1374abb2467f2c1c224`
MD5	`7e22ca98cb14b7a6b3f58aaece617046`
BLAKE2b-256	`ef0a79659f52ec0dd05818a3754bef7c9b4ddc8e8258189c43e8c54a74f76668`

See more details on using hashes here.

Gutenberg 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Overview

Installation

Usage

How to help

Limitations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes