Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

A framework and toolkit for automatic segmentation

Project description

Le Boucher d'Amsterdam

Boudams, or "Le boucher d'Amsterdam", is a deep-learning tool built for tokenizing Latin or Medieval French languages.

How to cite

An article has been published about this work : https://hal.archives-ouvertes.fr/hal-02154122v1

@unpublished{clerice:hal-02154122,
  TITLE = {{Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin}},
  AUTHOR = {Cl{\'e}rice, Thibault},
  URL = {https://hal.archives-ouvertes.fr/hal-02154122},
  NOTE = {working paper or preprint},
  YEAR = {2019},
  MONTH = Jun,
  KEYWORDS = {convolutional network ; scripta continua ; tokenization ; Old French ; word segmentation},
  PDF = {https://hal.archives-ouvertes.fr/hal-02154122/file/Evaluating_Deep_Learning_Methods_for_Tokenization_of_Scripta_Continua_in_Old_French_and_Latin%284%29.pdf},
  HAL_ID = {hal-02154122},
  HAL_VERSION = {v1},
}

How to

Install the usual way you install python stuff: python setup.py install (Python >= 3.6)).

The config file can be kickstarted using boudams template config.json, we recommend using the following settings :

  • linear-conv-no-pos for the model, as it is not limited by the input size;
  • normalize and lower to True depending on your dataset size.

The initial dataset is pretty small but if you want to build with your own, it's fairly simple : you need data in the following shape : "samesentence<TAB>same sentence" where the first element is the same than the second but with no space and they are separated by tabs (\t, marked here as <TAB>).

{
    "name": "model",
    "max_sentence_size": 150,
    "network": {
        "emb_enc_dim": 256,
        "enc_n_layers": 10,
        "enc_kernel_size": 3,
        "enc_dropout": 0.25
    },
    "model": "linear-conv-no-pos",
    "learner": {
        "lr_grace_periode": 2,
        "lr_patience": 2,
        "lr": 0.0001
    },
    "label_encoder": {
        "normalize": true,
        "lower": true
    },
    "datasets": {
        "test": "./test.tsv",
        "train": "./train.tsv",
        "dev": "./dev.tsv",
        "random": true
    }
}

The best architecture I find for medieval French was Conv to Linear without POS using the following setup:

{
    "network": {
        "emb_enc_dim": 256,
        "enc_n_layers": 10,
        "enc_kernel_size": 5,
        "enc_dropout": 0.25
    },
    "model": "linear-conv-no-pos",
    "batch_size": 64,
    "learner": {
        "lr_grace_periode": 2,
        "lr_patience": 2,
        "lr": 0.00005,
        "lr_factor": 0.5
    }
}

Credits

Inspirations, bits of code and source for being able to understand how Seq2Seq words or write my own Torch module come both from Ben Trevett and Enrique Manjavacas.

Project details


Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for boudams, version 0.1.0
Filename, size File type Python version Upload date Hashes
Filename, size boudams-0.1.0-py2.py3-none-any.whl (3.6 kB) File type Wheel Python version py2.py3 Upload date Hashes View hashes
Filename, size boudams-0.1.0.tar.gz (4.0 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page