Skip to main content

Website scrapers to retrieve HTR training datasets.

Project description

Arkindex Scrapers

Website scrapers to retrieve HTR datasets and publish them to Arkindex.

Installation

To install arkindex-scrapers, you can do it from Pypi:

  • Use a virtualenv (e.g. with virtualenvwrapper mkvirtualenv -a . scrapers)
  • Install scrapers as a package (e.g. pip install arkindex-scrapers)

Usage

When arkindex-scrapers is installed in your environment, the scrapers command becomes available. This command has 3 subcommands. Learn more about them using:

scrapers -h

Do It Yourself History

The diy subcommand retrieves images and transcriptions from collections available on the DIY History website.

Provide the ID of a collection as a positional argument and an output directory. This command will generate 1 JSON file per item. Each of these can be uploaded to Arkindex using the publish subcommand.

Europeana Transcribathon

The eu-trans subcommand retrieves images and transcriptions from stories available on the Europeana Transcribathon website.

By default, this command will look for stories on the whole website. You can restrict the search to a specific story using the --story_id argument. This command will generate 1 JSON file per story. Each of these can be uploaded to Arkindex using the publish subcommand.

Publish to Arkindex

The publish subcommand publishes local JSON files scraped by other subcommands to an Arkindex instance. Any JSON file is supported, provided that they respect the following format:

{
    "name": "", // Name of the element on Arkindex
    "metadata": [ // List of metadata to publish on the element
        {
            "type": "", // Arkindex type of the metadata
            "name": "", // Name of the metadata
            "value": "" // Value of the metadata
        },
        ...
    ],
    "items": [ // Elements published as children
        {
            "name": "", // Name of the element on Arkindex
            "metadata": [], // List of metadata to publish on the element
            "transcriptions": [ // List of transcriptions to publish on the element
                "", // Text of a transcription
                ...
            ],
            "iiif_url": "", // IIIF URL of the image (optional)
            "image_path": "" // Relative path towards the image file (optional)
        },
        ...
    ]
}

Learn more about all arguments of this subcommand using:

scrapers publish -h

Learn more about Arkindex's metadata, transcription and image system in its documentation.

Contributing

Development

For development and tests purpose it may be useful to install the project as a editable package with pip.

  • Use a virtualenv (e.g. with virtualenvwrapper mkvirtualenv -a . scrapers)
  • Install scrapers as a package (e.g. pip install -e .)

Linter

Code syntax is analyzed before submitting the code.
To run the linter tools suite you may use pre-commit.

pip install pre-commit
pre-commit run -a

Run tests

Tests are executed with tox using pytest. To install tox,

pip install tox
tox

To reload the test virtual environment you can use tox -r

Run a single test module: tox -- <test_path> Run a single test: tox -- <test_path>::<test_function>

--

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arkindex_scrapers-0.2.0rc3.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

arkindex_scrapers-0.2.0rc3-py3-none-any.whl (23.4 kB view details)

Uploaded Python 3

File details

Details for the file arkindex_scrapers-0.2.0rc3.tar.gz.

File metadata

  • Download URL: arkindex_scrapers-0.2.0rc3.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for arkindex_scrapers-0.2.0rc3.tar.gz
Algorithm Hash digest
SHA256 6f5e993bb500c398b7d7ac2fc2a991226df7c23b95b4c3292b081b49bb4212da
MD5 69e3034fb91b4c03de2f254623627e10
BLAKE2b-256 76ab2d4a21114a4eb651120598509c54135a615803909b0fc3510b6e46355399

See more details on using hashes here.

File details

Details for the file arkindex_scrapers-0.2.0rc3-py3-none-any.whl.

File metadata

File hashes

Hashes for arkindex_scrapers-0.2.0rc3-py3-none-any.whl
Algorithm Hash digest
SHA256 fd94457a4d0dc545ea2efeafeea18e56761151b7c47e11b538ffeb0293a65bca
MD5 cf2d630ae6a905d76d17fc8bcaa0d126
BLAKE2b-256 541ca286dc92b3d95f3c9c2a3cd39d739bf61362ca660243f72d3f4f40606b17

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page