Skip to main content

A package for curating doc file collections, with ability to sync with youtube and archive.org doc items.

Project description

Build status Documentation Status

doc curation

A package for curating doc file collections. Prominent features:

  • Scrape texts off various sites, such as Wikisource. See example here. (PS: Consider contributing to raw_etexts repo. )

  • OCR some pdf with google drive. Automatically splits into 25 page bits and ocrs them individually. See usage example here, function here.

For users

Installation or upgrade:

  • For stable version pip install doc_curation -U

  • For latest code pip install git+https://github.com/sanskrit-coders/doc_curation/@master -U

  • Web.

Usage:

  • Enable Google Driver API and download service account key file having Google Driver API access.

from doc_curation import pdf
pdf_file = '/home/file.pdf'
key_file = '/home/key.json'
pdf.split_and_ocr_on_drive(pdf_file, key_file)

For contributors

Contact

Have a problem or question? Please head to github.

Packaging

  • ~/.pypirc should have your pypi login credentials.

python setup.py bdist_wheel
twine upload dist/* --skip-existing

Build documentation

  • sphinx html docs can be generated with cd docs; make html

Testing

Run pytest in the root directory.

Auxiliary tools

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

doc_curation-0.0.8-py3-none-any.whl (38.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page