A framework and toolkit for automatic segmentation
Project description
Le Boucher d'Amsterdam
Boudams, or "Le boucher d'Amsterdam", is a deep-learning tool built for tokenizing Latin or Medieval French languages.
How to cite
An article has been published about this work : https://hal.archives-ouvertes.fr/hal-02154122v1
@unpublished{clerice:hal-02154122,
TITLE = {{Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin}},
AUTHOR = {Cl{\'e}rice, Thibault},
URL = {https://hal.archives-ouvertes.fr/hal-02154122},
NOTE = {working paper or preprint},
YEAR = {2019},
MONTH = Jun,
KEYWORDS = {convolutional network ; scripta continua ; tokenization ; Old French ; word segmentation},
PDF = {https://hal.archives-ouvertes.fr/hal-02154122/file/Evaluating_Deep_Learning_Methods_for_Tokenization_of_Scripta_Continua_in_Old_French_and_Latin%284%29.pdf},
HAL_ID = {hal-02154122},
HAL_VERSION = {v1},
}
How to
Install the usual way you install python stuff: python setup.py install
(Python >= 3.6)).
The config file can be kickstarted using boudams template config.json
, we recommend using the following settings :
linear-conv-no-pos
for the model, as it is not limited by the input size;normalize
andlower
toTrue
depending on your dataset size.
The initial dataset is pretty small but if you want to build with your own, it's fairly simple : you need data in the
following shape : "samesentence<TAB>same sentence"
where the first element is the same than the second but with no
space and they are separated by tabs (\t
, marked here as <TAB>
).
{
"name": "model",
"max_sentence_size": 150,
"network": {
"emb_enc_dim": 256,
"enc_n_layers": 10,
"enc_kernel_size": 3,
"enc_dropout": 0.25
},
"model": "linear-conv-no-pos",
"learner": {
"lr_grace_periode": 2,
"lr_patience": 2,
"lr": 0.0001
},
"label_encoder": {
"normalize": true,
"lower": true
},
"datasets": {
"test": "./test.tsv",
"train": "./train.tsv",
"dev": "./dev.tsv",
"random": true
}
}
The best architecture I find for medieval French was Conv to Linear without POS using the following setup:
{
"network": {
"emb_enc_dim": 256,
"enc_n_layers": 10,
"enc_kernel_size": 5,
"enc_dropout": 0.25
},
"model": "linear-conv-no-pos",
"batch_size": 64,
"learner": {
"lr_grace_periode": 2,
"lr_patience": 2,
"lr": 0.00005,
"lr_factor": 0.5
}
}
Credits
Inspirations, bits of code and source for being able to understand how Seq2Seq words or write my own Torch module come both from Ben Trevett and Enrique Manjavacas.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for boudams-0.1.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc43cbaee160571337ad1d0f610deac6e0b5797fe181e6d325c11196232ae7af |
|
MD5 | 1ffecc732c2f13aa57b1e4fb59303798 |
|
BLAKE2b-256 | 3aa57ffce290e36a3152b23b22157aeac2bcca9d1d5d6f76428361c244cd3647 |