KB lab client

Project description

KB data lab

Note: This repository is under active development, is not fully (or at all) functional, and is provided as-is

About

This repository aims to provide and demo tools for researchers in preparation of gaining access to digital archives on-premise at the National Library, or anyone else wanting to access collections without active copyright. There are two main ways to access digital objects: either by using the HTTP API directly or using the provided client written in Python. You can also create a Docker image based on the one below which will have the client installed, or use the install the client using pip or conda in your own container. The data available outside the National Library, currently on https://betalab.kb.se, does not have active copyright.

Installation

TLDR; - pip or conda

To install the client module using pip simply run

pip install kblab-client

To install using conda instead use the following

conda install kblab-client

Or add it to your dependencies in environment.yml

dependencies:
    - pip:
        kblab-client

Then, see examples below.

TLDR; - Docker version

Start environment using docker. The local directory ./data will be mounted on /data in the container. Any change from within the container will be reflected in the local directory and vice versa.

docker container run -it repository.kb.se/lab/client /bin/bash
d8fg7sjf4i # python

Then, see examples below.

From source

First check out the source code

git clone https://github.com/kungbib/kblab
cd kblab

Then either build and run the Docker image

docker build .
docker run -it <image id> /bin/bash

Or install the required package and python client, optionally creating a virtual environment so as to not mess up you existing one.

python -m venv venv
source venv/bin/activate
pip install -r requirement.txt
(cd client && ./setup.py install)

Then, see examples below.

API

The API is a simple REST-based API that delivers JSON(-LD) describing packages and/or files with the addition of a search endpoint.

URIs

Examples

Finding packages

Packages may contain files of type Structure, Content or Meta which contain structure information, content and metadata respectively (see below for examples). The meta and content files are indexed and can be searched through the API. Content is indexed under content and metadata under meta.* and can be accesed either through the web interface or through the API. For example:

Example: Get all packages tagged with SOU created in 1927: { "tags": "issue", "meta.created": "1927" } or just tags:SOU AND meta.created:1927 in the web interface.

Also: see examples below.

Data model

The National Library uses a package structure modeled on OAIS. A simplified representation in JSON-LD is provided as part of the response in addition to information about the logical structure of the material (e.g pages, covers), some metadata, links to physical object, etc.

Indexing is experimental at this point so verify your results.

Structure documents

{
    "@id": "#1",
    "@type": "Part",
    "derived_from": "https://.../1927_1(librisid_13483334).pdf",
    "has_part": [
        {
            "@id": "#1-1",
            "@type": "Page",
            "has_part": [
                {
                    "@id": "#1-1-1",
                    "@type": "Area"
                    "has_part": [
                        {
                            "@id": "#1-1-1-1",
                            "@type": "Text"
                        }
                    ]
                }
            ]
        }
    ]
}

Content documents

[
    {
        "@id": "#1-1-1-1", 
        "content": "..."
    }
]

Meta documents

{
    "created": "1923",
    "title": "An example"

Python 3.7 client

Initializing archive

from kblab import Archive

# connect to betalab. Use parameter auth=(username, password) for authentication
a = Archive('https://betalab.kb.se')

Caveat: if you get an error about "certificate verify failed" you may need to update the root certificates on you platform. You can also add the following lines to your code. Please not that this is NOT ADVISED, it is better to add the correct root certificates.

import kblab
kblab.VERIFY_CA=False

Searching content and iterating over packages

for package_id in a.search({ 'content': 'test' }):
    package = a.get(package_id)

    # do something with package
    ...

Listing and getting package content

for file in package:
    content = package.get_raw(f).read()

Docker images

Examples

Word count from 25 (unordered) issues of Aftonbladet

from collections import Counter
from kblab import Archive
from json import load

a = Archive('https://betalab.kb.se/')
c = Counter()

# find a specific issue of Aftonbladet
for package_id in a.search({ 'label': 'AFTONBLADET' }, max=25):
    print(package_id)
    p = a.get(package_id)

    if 'content.json' in p:
        for part in load(p.get_raw(fname)):
            c.update(part.get('content', '').toupper().split())

for word,count in c:
    print(word, count, sep='\t')

Parallelization

When processing large result sets parallelization can be crucial. This can be achieved either through using the multiprocessing module or the map method on the search result and parameter multi=True. A parallelized version in the example above could look like:

from collections import Counter
from kblab import Archive
from json import load
import kblab

a = Archive('https://betalab.kb.se/')
c = Counter()

def count(package_id):
    print(package_id)
    c = Counter()
    p = a.get(package_id)

    if 'content.json' in p:
        for part in load(p.get_raw(fname)):
            c.update(part.get('content', '').toupper().split())

    return c

# loop over 25 issues of Aftonbladet
for words in a.search({ 'label': 'AFTONBLADET' }, max=25).map(count, multi=True):
    c.update(words)

for word,count in c.items():
    print(word, count, sep='\t')

The number of processes is specified by the processes parameter, it defaults to the number of cores on the machine running the program. For optimal performance, and if the order of the result is not important, add parameter ordered=False to map(...).

Parallelization using multiprocessing.Pool would look something like this:

...
from multiprocessing import Pool

def f(package_id):
    # same as above
    ...

with Pool() as pool:
    for words in pool.imap(f, a.search({ 'label': 'AFTONBLADET' }, max=25)):
        c.update(words)

...

IIIF support

Images in the archive can either be downloaded and dealt with directly in full resolution or they can be cropped and scaled using the IIIF protocol.

Manifests

For same packages IIIF-manifests can be accessed by adding /_manifest to a URI. See example below.

Project details

Release history Release notifications | RSS feed

This version

0.0.16a0 pre-release

Sep 12, 2020

0.0.15a0 pre-release

Jan 8, 2020

0.0.14a0 pre-release

Jan 8, 2020

0.0.13a0 pre-release

Oct 3, 2019

0.0.12a0 pre-release

Oct 3, 2019

0.0.11a0 pre-release

Sep 23, 2019

0.0.10a0 pre-release

Sep 23, 2019

0.0.9a0 pre-release

Sep 23, 2019

0.0.8a0 pre-release

Sep 10, 2019

0.0.7a0 pre-release

Aug 28, 2019

0.0.6a0 pre-release

Jun 3, 2019

0.0.5a0 pre-release

Jun 3, 2019

0.0.4a0 pre-release

Jun 3, 2019

0.0.3a0 pre-release

Jun 3, 2019

0.0.2a0 pre-release

Apr 9, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kblab-client-0.0.16a0.tar.gz (15.6 kB view details)

Uploaded Sep 12, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kblab_client-0.0.16a0-py3-none-any.whl (15.3 kB view details)

Uploaded Sep 12, 2020 Python 3

File details

Details for the file kblab-client-0.0.16a0.tar.gz.

File metadata

Download URL: kblab-client-0.0.16a0.tar.gz
Upload date: Sep 12, 2020
Size: 15.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.2

File hashes

Hashes for kblab-client-0.0.16a0.tar.gz
Algorithm	Hash digest
SHA256	`1afa89d08f06cb5dae7a4afc9b10f8404278a2088b5abc83169b18febdb41a84`
MD5	`77d6f0b13a1dcea7dd16287d328162dc`
BLAKE2b-256	`20692b88907debc6274b85345f193480e366eff2d4cd0606d0905f8f4f9fd293`

See more details on using hashes here.

File details

Details for the file kblab_client-0.0.16a0-py3-none-any.whl.

File metadata

Download URL: kblab_client-0.0.16a0-py3-none-any.whl
Upload date: Sep 12, 2020
Size: 15.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.2

File hashes

Hashes for kblab_client-0.0.16a0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`57db809a95477e71234afaa215794e1f6a71967be4100a85357db4d7804d348e`
MD5	`8b82b8c61c1cadc8ade2a1bd5c569048`
BLAKE2b-256	`507ea93d5cd5290e855672b6a7cd8ba3fb33b1fe92b74f7d75a138a9889ff5b1`

See more details on using hashes here.

kblab-client 0.0.16a0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

KB data lab

About

Installation

TLDR; - pip or conda

TLDR; - Docker version

From source

API

URIs

Finding packages

Data model

Structure documents

Content documents

Meta documents

Python 3.7 client

Initializing archive

Searching content and iterating over packages

Listing and getting package content

Docker images

Examples

Word count from 25 (unordered) issues of Aftonbladet

Parallelization

IIIF support

Manifests

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes