Skip to main content

A library for working with IAM-OnDB database

Project description

Introduction

iam-ondb-parser is a small python library created to simplify working with The IAM On-Line Handwriting Database (IAM-OnDB). It consists of a set of iterators providing convenient access to strokes, images, transcription text, meta-data, and other information.

The library can be used to:

  • generate training examples
  • explore the database
  • extract data from the database

Pre-requisites

  • Python 3
  • a local copy of The IAM-OnDB dataset

Downloading the dataset

Register an account on the website that hosts the IAM-OnDB:

http://www.fki.inf.unibe.ch/DBs/iamOnDB/iLogin/index.php

Open a terminal, create a folder and change your current working directory to that folder:

mkdir iam_ondb
cd iam_ondb

Download all of the dataset files. You will be prompted to enter your password per each download (replace <user_name> with a user name that you registered with):

curl -u <user_name> -o original-xml-part.tar.gz http://www.fki.inf.unibe.ch/DBs/iamOnDB/data/original-xml-part.tar.gz
curl -u <user_name> -o writers.xml http://www.fki.inf.unibe.ch/DBs/iamOnDB/data/writers.xml
curl -u <user_name> -o lineStrokes-all.tar.gz http://www.fki.inf.unibe.ch/DBs/iamOnDB/data/lineStrokes-all.tar.gz
curl -u <user_name> -o lineImages-all.tar.gz http://www.fki.inf.unibe.ch/DBs/iamOnDB/data/lineImages-all.tar.gz
curl -u <user_name> -o original-xml-all.tar.gz http://www.fki.inf.unibe.ch/DBs/iamOnDB/data/original-xml-all.tar.gz
curl -u <user_name> -o ascii-all.tar.gz http://www.fki.inf.unibe.ch/DBs/iamOnDB/data/ascii-all.tar.gz

Or download files using the web browser from here: http://www.fki.inf.unibe.ch/databases/iam-on-line-handwriting-database/download-the-iam-on-line-handwriting-database

Uncompress gzipped archives:

mkdir -p original-xml-part && tar -zxvf original-xml-part.tar.gz -C original-xml-part
mkdir -p lineStrokes-all && tar -zxvf lineStrokes-all.tar.gz -C lineStrokes-all
mkdir -p lineImages-all && tar -zxvf lineImages-all.tar.gz -C lineImages-all 
mkdir -p original-xml-all && tar -zxvf original-xml-all.tar.gz -C original-xml-all 
mkdir -p ascii-all && tar -zxvf ascii-all.tar.gz -C ascii-all

At the end the layout of the directory that contains the dataset should look as follows:

ascii-all/

    ascii/

lineImages-all/

    lineImages/

lineStrokes-all/

    lineStrokes/

original-xml-all/

    original/

original-xml-part/

    original/

writers.xml

Installation

Without virtualenv (not recommended)

pip install iam-ondb-parser

Using virtualenv

Determine the location of the Python 3 executable. On Linux:

which python3

Create a new virtual environment (make sure to use python with version at least 3.5):

virtualenv --python=<path/to/python3/executable> venv

Activate the environment on Linux:

. venv/bin/activate

Activate the environment in the command prompt on Windows:

ENV\Scripts\activate

Finally, install the library from PyPI:

pip install iam-ondb-parser

Install manually by cloning the repository

Clone the repository

git clone https://github.com/X-rayLaser/iam-ondb-parser
cd iam-ondb-parser

Install dependencies:

pip install -r requirements.txt

Quick start

Create an instance of the IAMonDB class by providing a path to the location of IAM-OnDB data set:

from iam_ondb import IAMonDB, bounded_iterator
path_to_db = 'iam_ondb'
db = IAMonDB(path_to_db)

Iteration

Iterate over training examples in a for loop and exit it after just 1 iteration:

for stroke_set, image, line in db:
    print('transcription line: {}'.format(line))
    image.show()
    break

You can also use the iterator directly:

it = db.__iter__()
stroke_set, image, line = next(it)
stroke_set, image, line = next(it)
stroke_set, image, line = next(it)

The dataset is quite big, so instead of iterating over all examples, you might want to only look at 500 examples. This can be done easily with bounded_iterator wrapper function.

This snippet will iterate over only 5 examples, for each of them it will show an image and print out the line of the transcription text:

for stroke_set, image, line in bounded_iterator(db, stop_index=5):
    print(stroke_set)
    print(line)
    image.show()
    input('Press any key to get to next example')
    print()

Iterate over lines of transcription text:

for line in bounded_iterator(db.get_text_lines(), 10):
    print(line)

Similarly, you can iterate over transcriptions (transcription objects contain additional meta-information):

for transcription in bounded_iterator(db.get_transcriptions(), stop_index=5):
    print(transcription.text + '\n')
    print()

Similarly, iterate over images:

for image in bounded_iterator(db.get_images(), stop_index=5):
    print(image.size)
    image.show()

There are also methods returning ids of different objects.

Get all writer ids:

ids = list(db.get_writer_ids())
print(ids)

Get first 10 ids of images:

ids = list(bounded_iterator(db.get_image_ids(), 10))
print(ids)

Get first 10 ids of stroke sets:

ids = list(bounded_iterator(db.get_stroke_set_ids(), 10))
print(ids)

Get first 10 ids of lines:

ids = list(bounded_iterator(db.get_text_line_ids(), 10))
print(ids)

Random access

Besides iterating over a set of objects, you can get a particular object by its id.

Get an image:

db.get_image('a01-030z-01').show()

Get a stroke set:

db.get_stroke_set('a01-030z-01')

Get a line:

db.get_text_line('a01-030z-01')

Get information about a writer:

db.get_writer('10051')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for iam-ondb-parser, version 0.1.1
Filename, size File type Python version Upload date Hashes
Filename, size iam_ondb_parser-0.1.1-py3-none-any.whl (12.8 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size iam-ondb-parser-0.1.1.tar.gz (11.4 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page