Skip to main content

Tools to extract information from digitized historical documents

Project description

Quipucamayoc: tools for digitizing historical data

Development Status Build Status License DOI

GitHub Releases Python version Supported implementations

quipucamayoc is a Python package that simplifies the extraction of historical data from scanned images and PDFs. It's designed to be modular and so it can be used together with other existing tools, and can be extended easily by users.

For an overview of how to use quipucamayoc to digitize historical data, see this research article, which amongst other things details the different steps involved, the methods used, and provides practical examples. For an user guide, documentation, and installation instructions, see http://scorreia.com/software/quipucamayoc/ (TODO).

If you want to contribute by improving the code or extending its functionality (much welcome!), head here.

Installation

Pip

To manage quipucamayoc using pip, open the command line and run:

  • pip install quipucamayoc to install
    • pip install "quipucamayoc[dev]" to include extra dependencies used when developing the code
  • pip install -U quipucamayoc to upgrade
  • pip uninstall quipucamayoc to remove

Note that quipucamayoc has been tested against Python 3.10 and newer versions, but should also work with Python 3.9.

Git Install

After cloning the repo to your computer and navigating to the quipucamayoc folder, run:

  • pip install . to install the package locally
  • pip install -e . to install locally with a symlink so changes are automatically updated (recommended for developers)

After installation

AWS

AWS configuration is quite cumbersome, so it has been automated. To do so, follow these four steps:

  1. Download and install the aws command line interface (CLI). Update: quipucamayoc installs the awscli package so this step might not be necessary anymore.
  2. Configure your credentials with aws configure. This requires an Amazon/AWS account.
  3. From the command line, run the quipucamayoc command quipu aws install

Notes:

  • You can avoid step 1 by directly [writing your credentials[(https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)] to the credentials file.
  • Steps 3-4 are also available from within Python in the setup_textract() and test_textract() functions.
  • If you want to remove all quipucamayoc artifacts from your AWS account, you can run quipu aws uninstall from the command line.
  • The default AWS region is aws-east-1. To use other regions, use the --region <name> option.

Usage

  • From the command line, you can extract tables using AWS via quipu extract-tables --filename <myfile.pdf>

TODO

  • Automatically set up Textract pipeline
  • Expose key functions as command line tools
  • Allow parallel (async?) tasks. Useful for OpenCV (CPU-intensive) and Textract calls (IO-intensive). Consider also uvloop
  • Include Poppler by default on Windows
  • Add mypy/(flake8|black)

Contributing

Feel free to submit push requests. For consistency, code should comply with pep8 (as long as its reasonable), and with the style guides by @kennethreitz and google. Read more here.

Citation

(Download BibTex file here)

As text

  • Sergio Correia, Stephan Luck: “Digitizing Historical Balance Sheet Data: A Practitioner's Guide”, 2022; arXiv:2204.00052.

As BibTex

@misc{quipucamayoc,
  Author = {Correia, Sergio and Luck, Stephan},
  Title = {Digitizing Historical Balance Sheet Data: A Practitioner's Guide},
  Year = {2022},
  eprint = {arXiv:2204.00052},
  journal={arXiv preprint arXiv:2204.00052}
}

Acknowledgments

Quipucamayoc is built upon the work and improvements of many users and developers, from which it was heavily inspired, such as:

It is also relies for most of its work on the following open source projects:

License

Quipucamayoc is developed under the GNU Affero GPL v3 license.

Why "quipucamayoc"?

The quipucamayocs were the Inca empire officials in charge of desciphering (amonst other things) accounting information stored in quipus. Our goal for this package is to act as a sort of quipucamayoc, helping researchers in desciphering and extracting historical information, particularly balance sheets and numerical records.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quipucamayoc-0.1.2.tar.gz (34.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quipucamayoc-0.1.2-py3-none-any.whl (33.8 kB view details)

Uploaded Python 3

File details

Details for the file quipucamayoc-0.1.2.tar.gz.

File metadata

  • Download URL: quipucamayoc-0.1.2.tar.gz
  • Upload date:
  • Size: 34.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.3 CPython/3.10.2

File hashes

Hashes for quipucamayoc-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b44b3622973b649c2e51852188c8da41020c06a319394ffda0dc52bb15c18d69
MD5 cb234b96826b70b71c7bc7ef0dc60523
BLAKE2b-256 85ece874ff6f9881afebdc5e2b8cc06961c1335abe06d316cf6bd3ade54bda2c

See more details on using hashes here.

File details

Details for the file quipucamayoc-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: quipucamayoc-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 33.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.3 CPython/3.10.2

File hashes

Hashes for quipucamayoc-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 354be56adb819cfdec86d3ca55118639da622c9a24806ecc98b5312b0f8490e4
MD5 8e150a10646f43b8abfd344b11f5794f
BLAKE2b-256 6fb1857538959afa7ed0c49007806f895454c3d8d4075dce7bdf4e78acfa3692

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page