Skip to main content

Tools for fetching OCRed text of Library of Congress items.

Project description

README

This fetches full text from Library of Congress OCR files for LOC items. It returns the text, when found, and None otherwise.

Usage

It can take as input either a result item from a JSON API response or the URL of an item:

from locr import Fetcher

# From item or resource URL
Fetcher.full_text_from_url('https://www.loc.gov/resource/mss85943.001811/')

# From search result
# See https://libraryofcongress.github.io/data-exploration/requests.html
url = 'https://www.loc.gov/search/?fo=json&fa=subject:cats'
response = requests.get(url)
Fetcher(response['results'][0]).full_text()

Note that the above example is not guaranteed to work. In particular, not all objects have online text available.

Fetcher may raise the following exceptions:

  • ObjectNotOnline: when the object does not have any online formats.
  • AmbiguousText: when multiple fulltext options are found.
  • UnknownFormat: when locr is not sure how to handle the fulltext link's filetype.

If you encounter these exceptions, kindly file an issue or open a PR about the newly discovered edge case. Thanks.

Why LOCR?

The Library of Congress has put OCRed full text online for many of its items. However:

  • the API does not in general return the URLs to these items
  • OCRed text exists on different servers, with different URL formats; there is not one single way to construct the relevant URL for an item

While full text is easy to retrieve via the web site for a single item, perhaps you, like me, would like to fetch it programmatically.

Development

This package has a humiliating lack of tests, and I have done nothing to verify appropriate versions for dependencies. It really can use your help. PRs welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

locr-0.4.4.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

locr-0.4.4-py2.py3-none-any.whl (5.3 kB view details)

Uploaded Python 2Python 3

File details

Details for the file locr-0.4.4.tar.gz.

File metadata

  • Download URL: locr-0.4.4.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.9.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for locr-0.4.4.tar.gz
Algorithm Hash digest
SHA256 84ed6fc24a4ff4cf93878201b3fa962fd3b1650d396c7803507a2b9797257312
MD5 5efbadf13d5b9b19e81a81395deab67a
BLAKE2b-256 2bcf9a3f9b29e532d2d26ef576a6165b3fbcb6dd10fbcf5a054a942d0750a752

See more details on using hashes here.

File details

Details for the file locr-0.4.4-py2.py3-none-any.whl.

File metadata

  • Download URL: locr-0.4.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 5.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.9.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for locr-0.4.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 aebcf14b7f9902ecc0ce36eaeb295bb1965d29a3354d3c905c43208f82290dc7
MD5 3ed41eaf12380422f50d5eac358e8804
BLAKE2b-256 1b44f929b34ecbf396a189e2667d71ed31094646fcd849cf5cd5e5296d99b154

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page