A package for curating doc file collections, with ability to sync with youtube and archive.org doc items.
Project description
doc curation
A package for curating doc file collections. Prominent features:
Scrape texts off various sites, such as Wikisource. See example here. (PS: Consider contributing to raw_etexts repo. )
OCR some pdf with google drive. Automatically splits into 25 page bits and ocrs them individually. See usage example here, function here.
For users
Manually and periodically generated docs here
For detailed examples and help, please see individual module files in this package.
Installation or upgrade:
For stable version pip install doc_curation -U
For latest code pip install git+https://github.com/sanskrit-coders/doc_curation/@master -U
Web.
Usage:
Enable Google Driver API and download service account key file having Google Driver API access.
from doc_curation import pdf
pdf_file = '/home/file.pdf'
key_file = '/home/key.json'
doc_curation.pdf.drive_ocr.split_and_ocr_on_drive(pdf_file, key_file)
Usage for the google_vision_pdf.py to OCR pdf to txt files.
Follow the instructions here: https://cloud.google.com/vision/docs/before-you-begin.
Make sure to set the environment variable for GOOGLE_APPLICATION_CREDENTIALS to the path of json containing your service account key.
Example:
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"
Invoke the script passing in the input file. Eg:
python3 google_vision_pdf.py --input-file <input.pdf>
For contributors
Contact
Have a problem or question? Please head to github.
Packaging
~/.pypirc should have your pypi login credentials.
python setup.py bdist_wheel twine upload dist/* --skip-existing
Build documentation
sphinx html docs can be generated with cd docs; make html
Testing
Run pytest in the root directory.
Auxiliary tools
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for doc_curation-0.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e74b544bc095156feaf4f1236f1935b9220b9a86fac4f672e8a44496d3be4d93 |
|
MD5 | 98d18e79fca16c165b3bbcd0584bf783 |
|
BLAKE2b-256 | d0f7f5a05db4fefe44a729100439a603b30739b277fd14c8ff4417a30d0ca872 |