Skip to main content

NUM Miner (Tool to create open dataset for Handwritten Text Recognition)

Project description

NUMiner

Build Status security: bandit PyPI version Coverage Status Code style: black

InstallationHow To UseSheetContributingLicense

This is a Python library that creates MNIST like training dataset for Handwritten Text Recognition related researches

Installation

Use the package manager pip to install numiner.

$ pip install numiner

Use the package manager pipenv to install numiner.

$ pipenv install numiner

Use the package manager poetry to install numiner.

$ poetry add numiner

How To Use

In general, the package has two main modes. One is sheet and another one is letter.

sheet - takes a path called <source> to a folder that's holding all the scanned sheet images or an actual image path and saves the processed images in the <result> path

$ numiner -s/--sheet <source> <result>

letter - takes a path called <source> to a folder that's holding all the cropped raw images or an actual image path and saves the processed images in the <result> path

$ numiner -l/--letter <source> <result>

Also you can override the default sheet labels by giving json file:

$ numiner --labels path/to/labels.json -s path/to/source path/to/result

For sure you can also do this:

$ numiner --help

usage: numiner [-h] [-v] [-s <source> <result>] [-l <source> <result>] [-c <path>]

optional arguments:
  -h, --help                    show this help message and exit
  -v, --version                 show program's version number and exit
  --clean <path>
  -s/--sheet <source> <result>  a path to a folder or file that's holding the <source>
                                sheet image(s) & a path to a folder where all <result>
                                images will be saved
  -l/--letter <source> <result> a path to a folder or a file that's holding the cropped
                                image(s) & a path to a folder where all <result> images
                                will be saved
  --labels <path>               a path to .json file that's holding top to bottom, left
                                to right labels of the sheet with their ids
$ numiner convert --help

usage: numiner convert [-h] -p <src> <dest> SIZE RATIO

positional arguments:
  SIZE                  number of images that each class contains
  RATIO                 test, train or percentage of the test data
                        in that case the rest of it will become
                        train data

optional arguments:
  -h, --help            show this help message and exit
  -p <src> <dest>, --paths <src> <dest>
                        source and destination paths

Sample Sheet image

You can also get the empty sheet file from here.

Extracted letters from the sheet

Final image processing order

Followed the same approach that EMNIST used when they were first creating their dataset from NIST SD images.

  1. Letter extracted from the sheet
  2. Binary version of original image
  3. Letter itself fitted into a square shape plus 2 pixel wide borders on each side without losing the aspect ratio
  4. From previous step, image resized to 28x28 and taken threshold results in final image

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

If you want to read more about how this project came to life, you can check out my thesis report.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

numiner-0.2.1.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

numiner-0.2.1-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file numiner-0.2.1.tar.gz.

File metadata

  • Download URL: numiner-0.2.1.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.8.2 Darwin/19.4.0

File hashes

Hashes for numiner-0.2.1.tar.gz
Algorithm Hash digest
SHA256 21238146950a35021b9a8ae8eebe669df182fcd56cd36c9d4b319474c430d45c
MD5 1ceea9120ba63398d8e670bff09e3cb4
BLAKE2b-256 aebbedcc888b9ffe3b9abc593c1838af8f416b0022fb4f67dd80e7a6a3cb0f13

See more details on using hashes here.

File details

Details for the file numiner-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: numiner-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.8.2 Darwin/19.4.0

File hashes

Hashes for numiner-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 87d1155e404a288feeffdf5592254b4210bc35427475143e9a3a52e61b767c96
MD5 17ca459f44914b12badbccf25c749fc6
BLAKE2b-256 4227b50aa701d9f561d95b909f1e37554e298607a70da8bcdf20ebbb72fb867f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page