Skip to main content

A framework and toolkit for automatic segmentation

Project description

Le Boucher d'Amsterdam

Boudams, or "Le boucher d'Amsterdam", is a deep-learning tool built for tokenizing Latin or Medieval French languages.

How to cite

An article has been published about this work : https://hal.archives-ouvertes.fr/hal-02154122v1

@unpublished{clerice:hal-02154122,
  TITLE = {{Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin}},
  AUTHOR = {Cl{\'e}rice, Thibault},
  URL = {https://hal.archives-ouvertes.fr/hal-02154122},
  NOTE = {working paper or preprint},
  YEAR = {2019},
  MONTH = Jun,
  KEYWORDS = {convolutional network ; scripta continua ; tokenization ; Old French ; word segmentation},
  PDF = {https://hal.archives-ouvertes.fr/hal-02154122/file/Evaluating_Deep_Learning_Methods_for_Tokenization_of_Scripta_Continua_in_Old_French_and_Latin%284%29.pdf},
  HAL_ID = {hal-02154122},
  HAL_VERSION = {v1},
}

How to

Install the usual way you install python stuff: python setup.py install (Python >= 3.6)).

The config file can be kickstarted using boudams template config.json, we recommend using the following settings :

  • linear-conv-no-pos for the model, as it is not limited by the input size;
  • normalize and lower to True depending on your dataset size.

The initial dataset is pretty small but if you want to build with your own, it's fairly simple : you need data in the following shape : "samesentence<TAB>same sentence" where the first element is the same than the second but with no space and they are separated by tabs (\t, marked here as <TAB>).

{
    "name": "model",
    "max_sentence_size": 150,
    "network": {
        "emb_enc_dim": 256,
        "enc_n_layers": 10,
        "enc_kernel_size": 3,
        "enc_dropout": 0.25
    },
    "model": "linear-conv-no-pos",
    "learner": {
        "lr_grace_periode": 2,
        "lr_patience": 2,
        "lr": 0.0001
    },
    "label_encoder": {
        "normalize": true,
        "lower": true
    },
    "datasets": {
        "test": "./test.tsv",
        "train": "./train.tsv",
        "dev": "./dev.tsv",
        "random": true
    }
}

The best architecture I find for medieval French was Conv to Linear without POS using the following setup:

{
    "network": {
        "emb_enc_dim": 256,
        "enc_n_layers": 10,
        "enc_kernel_size": 5,
        "enc_dropout": 0.25
    },
    "model": "linear-conv-no-pos",
    "batch_size": 64,
    "learner": {
        "lr_grace_periode": 2,
        "lr_patience": 2,
        "lr": 0.00005,
        "lr_factor": 0.5
    }
}

Credits

Inspirations, bits of code and source for being able to understand how Seq2Seq words or write my own Torch module come both from Ben Trevett and Enrique Manjavacas.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boudams-0.1.2.tar.gz (27.5 kB view details)

Uploaded Source

Built Distribution

boudams-0.1.2-py2.py3-none-any.whl (33.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file boudams-0.1.2.tar.gz.

File metadata

  • Download URL: boudams-0.1.2.tar.gz
  • Upload date:
  • Size: 27.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.5

File hashes

Hashes for boudams-0.1.2.tar.gz
Algorithm Hash digest
SHA256 188bed46f9fb12e17947abaa17cc87b935a9fcacf9d19767aee2ce4529b5d3b4
MD5 6b823a7f726f4928b535c6ca0f04bd12
BLAKE2b-256 c751633b1249c92accd24e0f4695f6f594f513947c55d3daf0bd1243fa574cf1

See more details on using hashes here.

File details

Details for the file boudams-0.1.2-py2.py3-none-any.whl.

File metadata

  • Download URL: boudams-0.1.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 33.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.5

File hashes

Hashes for boudams-0.1.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 bc43cbaee160571337ad1d0f610deac6e0b5797fe181e6d325c11196232ae7af
MD5 1ffecc732c2f13aa57b1e4fb59303798
BLAKE2b-256 3aa57ffce290e36a3152b23b22157aeac2bcca9d1d5d6f76428361c244cd3647

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page