Tools to extract information from digitized historical documents
Project description
Quipucamayoc: tools for digitizing historical data
quipucamayoc is a Python package that simplifies the extraction of historical data from scanned images and PDFs. It's designed to be modular and so it can be used together with other existing tools, and can be extended easily by users.
For an overview of how to use quipucamayoc
to digitize historical data, see this research article, which amongst other things details the different steps involved, the methods used, and provides practical examples.
For an user guide, documentation, and installation instructions, see http://scorreia.com/software/quipucamayoc/ (TODO).
If you want to contribute by improving the code or extending its functionality (much welcome!), head here.
Installation
Pip
To manage quipucamayoc using pip, open the command line and run:
pip install quipucamayoc
to installpip install "quipucamayoc[dev]"
to include extra dependencies used when developing the code
pip install -U quipucamayoc
to upgradepip uninstall quipucamayoc
to remove
Note that quipucamayoc
has been tested against Python 3.10 and newer versions, but should also work with Python 3.9.
Git Install
After cloning the repo to your computer and navigating to the quipucamayoc folder, run:
pip install .
to install the package locallypip install -e .
to install locally with a symlink so changes are automatically updated (recommended for developers)
After installation
AWS
AWS configuration is quite cumbersome, so it has been automated. To do so, follow these four steps:
- Download and install the
aws
command line interface (CLI). Update:quipucamayoc
installs theawscli
package so this step might not be necessary anymore. - Configure your credentials with
aws configure
. This requires an Amazon/AWS account. - From the command line, run the quipucamayoc command
quipu aws install
Notes:
- You can avoid step 1 by directly [writing your credentials[(https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)] to the
credentials
file. - Steps 3-4 are also available from within Python in the
setup_textract()
andtest_textract()
functions. - If you want to remove all quipucamayoc artifacts from your AWS account, you can run
quipu aws uninstall
from the command line. - The default AWS region is
aws-east-1
. To use other regions, use the--region <name>
option.
Usage
- From the command line, you can extract tables using AWS via
quipu extract-tables --filename <myfile.pdf>
TODO
- Automatically set up Textract pipeline
- Expose key functions as command line tools
- Allow parallel (async?) tasks. Useful for OpenCV (CPU-intensive) and Textract calls (IO-intensive). Consider also uvloop
- Include Poppler by default on Windows
- Add mypy/(flake8|black)
Contributing
Feel free to submit push requests. For consistency, code should comply with pep8 (as long as its reasonable), and with the style guides by @kennethreitz and google. Read more here.
Citation
As text
- Sergio Correia, Stephan Luck: “Digitizing Historical Balance Sheet Data: A Practitioner's Guide”, 2022; arXiv:2204.00052.
As BibTex
@misc{quipucamayoc,
Author = {Correia, Sergio and Luck, Stephan},
Title = {Digitizing Historical Balance Sheet Data: A Practitioner's Guide},
Year = {2022},
eprint = {arXiv:2204.00052},
journal={arXiv preprint arXiv:2204.00052}
}
Acknowledgments
Quipucamayoc is built upon the work and improvements of many users and developers, from which it was heavily inspired, such as:
It is also relies for most of its work on the following open source projects:
License
Quipucamayoc is developed under the GNU Affero GPL v3 license.
Why "quipucamayoc"?
The quipucamayocs were the Inca empire officials in charge of desciphering (amonst other things) accounting information stored in quipus. Our goal for this package is to act as a sort of quipucamayoc, helping researchers in desciphering and extracting historical information, particularly balance sheets and numerical records.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for quipucamayoc-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 354be56adb819cfdec86d3ca55118639da622c9a24806ecc98b5312b0f8490e4 |
|
MD5 | 8e150a10646f43b8abfd344b11f5794f |
|
BLAKE2b-256 | 6fb1857538959afa7ed0c49007806f895454c3d8d4075dce7bdf4e78acfa3692 |