Website scrapers to retrieve HTR training datasets.
Project description
Arkindex Scrapers
Website scrapers to retrieve HTR datasets and publish them to Arkindex.
Installation
To install arkindex-scrapers
, you can do it from Pypi:
- Use a virtualenv (e.g. with virtualenvwrapper
mkvirtualenv -a . scrapers
) - Install scrapers as a package (e.g.
pip install arkindex-scrapers
)
Usage
When arkindex-scrapers
is installed in your environment, the scrapers
command becomes available. This command has 3 subcommands. Learn more about them using:
scrapers -h
Do It Yourself History
The diy
subcommand retrieves images and transcriptions from collections available on the DIY History website.
Provide the ID of a collection as a positional argument and an output directory. This command will generate 1 JSON file per item. Each of these can be uploaded to Arkindex using the publish
subcommand.
Europeana Transcribathon
The eu-trans
subcommand retrieves images and transcriptions from stories available on the Europeana Transcribathon website.
By default, this command will look for stories on the whole website. You can restrict the search to a specific story using the --story_id
argument. This command will generate 1 JSON file per story. Each of these can be uploaded to Arkindex using the publish
subcommand.
Publish to Arkindex
The publish
subcommand publishes local JSON files scraped by other subcommands to an Arkindex instance.
Any JSON file is supported, provided that they respect the following format:
{
"name": "", // Name of the element on Arkindex
"metadata": [ // List of metadata to publish on the element
{
"type": "", // Arkindex type of the metadata
"name": "", // Name of the metadata
"value": "" // Value of the metadata
},
...
],
"items": [ // Elements published as children
{
"name": "", // Name of the element on Arkindex
"metadata": [], // List of metadata to publish on the element
"transcriptions": [ // List of transcriptions to publish on the element
"", // Text of a transcription
...
],
"iiif_url": "", // IIIF URL of the image (optional)
"image_path": "" // Relative path towards the image file (optional)
},
...
]
}
Learn more about all arguments of this subcommand using:
scrapers publish -h
Learn more about Arkindex's metadata, transcription and image system in its documentation.
Contributing
Development
For development and tests purpose it may be useful to install the project as a editable package with pip.
- Use a virtualenv (e.g. with virtualenvwrapper
mkvirtualenv -a . scrapers
) - Install scrapers as a package (e.g.
pip install -e .
)
Linter
Code syntax is analyzed before submitting the code.
To run the linter tools suite you may use pre-commit.
pip install pre-commit
pre-commit run -a
Run tests
Tests are executed with tox
using pytest.
To install tox
,
pip install tox
tox
To reload the test virtual environment you can use tox -r
Run a single test module: tox -- <test_path>
Run a single test: tox -- <test_path>::<test_function>
--
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file arkindex_scrapers-0.2.0rc3.tar.gz
.
File metadata
- Download URL: arkindex_scrapers-0.2.0rc3.tar.gz
- Upload date:
- Size: 19.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6f5e993bb500c398b7d7ac2fc2a991226df7c23b95b4c3292b081b49bb4212da |
|
MD5 | 69e3034fb91b4c03de2f254623627e10 |
|
BLAKE2b-256 | 76ab2d4a21114a4eb651120598509c54135a615803909b0fc3510b6e46355399 |
File details
Details for the file arkindex_scrapers-0.2.0rc3-py3-none-any.whl
.
File metadata
- Download URL: arkindex_scrapers-0.2.0rc3-py3-none-any.whl
- Upload date:
- Size: 23.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd94457a4d0dc545ea2efeafeea18e56761151b7c47e11b538ffeb0293a65bca |
|
MD5 | cf2d630ae6a905d76d17fc8bcaa0d126 |
|
BLAKE2b-256 | 541ca286dc92b3d95f3c9c2a3cd39d739bf61362ca660243f72d3f4f40606b17 |