Skip to main content

Open Book Genome Project

Project description

Welcome

Welcome to the Open Book Genome Project (OBGP) Sequencer™, an open-source Book Processing Pipeline of responsibly vetted community "modules" which classify, sequence, and fingerprint book fulltext to reveal public insights.

How it Works

Each month, the OBGP Sequencer™ gets run against the fulltext of more than 1M books, generating valuable public insights for book lovers and researchers around the globe. OBGP Sequencer™ consists of carefully vetted community-contributed modules which aim to responsibly help increase the discoverability and usefulness of books, e.g.:

  • Identifying urls, isbns, and citations within the text
  • Generating word frequency mappings
  • Guessing grade reading levels

Contributing a Module

  1. Please read the whitepaper and look through our community list of proposed or requested modules
  2. Propose a "module" by creating a github issue
  3. Get the code: Fork this git repository and clone it to your workspace. Ceate a new branch for your module (named after its corresponding github issue number and title: e.g. git checkout -b 12/module/find-isbns). Install
  4. Create a new module to the modules/ directory
  5. Test your module locally using Internet Archive's unrestricted collection of ~800k books
  6. open a Pull Request so your contribution may be reviewed.

Questions?

Please open an issue and request a slack invite

Installation

Production

If you want to run the OBGP Sequencer™ pipeline, run:

pip install obgp

Development

git clone https://github.com/Open-Book-Genome-Project/sequencer.git  # get the code
virtualenv venv && source venv/bin/activate  # setup a virtual environment
cd sequencer  # change into project directory
pip install -e .  # install the library (and re-run in project root as you make changes)

Usage

Once you've install either the production code or build your developer code, you may proceed to start python and import the runner.pipeline with whatever modules you'd like.

Let's say you want to process the book https://archive.org/details/hpmor which has identifier hpmor on Archive.org. First, you would define your Sequencer as follows:

>>> from bgp.runner import Sequencer, NGramProcessor, WordFreqModule, STOP_WORDS
>>> s = Sequencer({
...     'words': NGramProcessor(modules={
...         'term_freq': WordFreqModule()
...     }, n=1, stop_words=STOP_WORDS)
... })

Then, you would pass this book identifier into the Sequencer to sequence the book to get back a genome Sequence object:

>>> genome = s.sequence('hpmor')
>>> genome.results

If your internetarchive tool is configured against an account with sufficient permissions, you can then upload your genome results back to an Archive.org item (we'll arbitrarily pick the identifier bgp) by running:

>>> genome.write_results_to_item('bgp')

This will upload the genome.results as json to <book_identifier>_results.json (e.g. hpmor_results.json) unless otherwise specificed by overriding params.

You will then be able to see your file hpmor_results.json within the bgp item's file downloads: https://archive.org/download/bgp

If you want to run a default test to make sure everything works, try:

>>> from bgp import test_sequence_item
>>> genome = test_sequence_item('hpmor')
>>> genome.results

## Who we are

OBGP is an independent, community-run, not-for-profit committee of open-source and book enthusiasts who want to responsibly further the effort of making books as useful and accessible as possible.

## Public Testing Data sets

Here's a corpus of ~800k Archive.org item identifiers of public domain books (of varying quality/appearance/language) which may be used for testing your module:

https://archive.org/download/869k-public-domain-book-urls-dataset/2017-12-26_public-domain-books-dataset_800k-identifiers.csv (~19mb)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

obgp-0.0.32.tar.gz (6.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page