SacreMoses

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Sacremoses

License

MIT License.

Install

pip install -U sacremoses

NOTE: Sacremoses only supports Python 3 now (sacremoses>=0.0.41). If you're using Python 2, the last possible version is sacremoses==0.0.40.

Usage (Python)

Tokenizer and Detokenizer

>>> from sacremoses import MosesTokenizer, MosesDetokenizer
>>> mt = MosesTokenizer(lang='en')
>>> text = 'This, is a sentence with weird\xbb symbols\u2026 appearing everywhere\xbf'
>>> expected_tokenized = 'This , is a sentence with weird \xbb symbols \u2026 appearing everywhere \xbf'
>>> tokenized_text = mt.tokenize(text, return_str=True)
>>> tokenized_text == expected_tokenized
True


>>> mt, md = MosesTokenizer(lang='en'), MosesDetokenizer(lang='en')
>>> sent = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [ ] & You're gonna shake it off? Don't?"
>>> expected_tokens = ['This', 'ain', '&apos;t', 'funny', '.', 'It', '&apos;s', 'actually', 'hillarious', ',', 'yet', 'double', 'Ls', '.', '&#124;', '&#91;', '&#93;', '&lt;', '&gt;', '&#91;', '&#93;', '&amp;', 'You', '&apos;re', 'gonna', 'shake', 'it', 'off', '?', 'Don', '&apos;t', '?']
>>> expected_detokens = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [] & You're gonna shake it off? Don't?"
>>> mt.tokenize(sent) == expected_tokens
True
>>> md.detokenize(tokens) == expected_detokens
True

Truecaser

>>> from sacremoses import MosesTruecaser, MosesTokenizer

# Train a new truecaser from a 'big.txt' file.
>>> mtr = MosesTruecaser()
>>> mtok = MosesTokenizer(lang='en')

# Save the truecase model to 'big.truecasemodel' using `save_to`
>> tokenized_docs = [mtok.tokenize(line) for line in open('big.txt')]
>>> mtr.train(tokenized_docs, save_to='big.truecasemodel')

# Save the truecase model to 'big.truecasemodel' after training
# (just in case you forgot to use `save_to`)
>>> mtr = MosesTruecaser()
>>> mtr.train('big.txt')
>>> mtr.save_model('big.truecasemodel')

# Truecase a string after training a model.
>>> mtr = MosesTruecaser()
>>> mtr.train('big.txt')
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")
['the', 'adventures', 'of', 'Sherlock', 'Holmes']

# Loads a model and truecase a string using trained model.
>>> mtr = MosesTruecaser('big.truecasemodel')
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")
['the', 'adventures', 'of', 'Sherlock', 'Holmes']
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES", use_known=True)
['the', 'ADVENTURES', 'OF', 'SHERLOCK', 'HOLMES']
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES", return_str=True)
'the adventures of Sherlock Holmes'

Normalizer

>>> from sacremoses import MosesPunctNormalizer
>>> mpn = MosesPunctNormalizer()
>>> mpn.normalize('THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."')
'THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."'

Usage (CLI)

Since version 0.0.42, the pipeline feature for CLI is introduced, thus there are global options that should be set first before calling the commands:

language
processes
encoding
quiet

$ pip install -U sacremoses>=0.1

$ sacremoses --help
Usage: sacremoses [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

Options:
  -l, --language TEXT      Use language specific rules when tokenizing
  -j, --processes INTEGER  No. of processes.
  -e, --encoding TEXT      Specify encoding of file.
  -q, --quiet              Disable progress bar.
  --version                Show the version and exit.
  -h, --help               Show this message and exit.

Commands:
  detokenize
  detruecase
  normalize
  tokenize
  train-truecase
  truecase

Pipeline

Example to chain the following commands:

normalize with -c option to remove control characters.
tokenize with -a option for aggressive dash split rules.
truecase with -a option to indicate that model is for ASR
- if big.truemodel exists, load the model with -m option,
- otherwise train a model and save it with -m option to big.truemodel file.
save the output to console to the big.txt.norm.tok.true file.

cat big.txt | sacremoses -l en -j 4 \
    normalize -c tokenize -a truecase -a -m big.truemodel \
    > big.txt.norm.tok.true

Tokenizer

$ sacremoses tokenize --help
Usage: sacremoses tokenize [OPTIONS]

Options:
  -a, --aggressive-dash-splits   Triggers dash split rules.
  -x, --xml-escape               Escape special characters for XML.
  -p, --protected-patterns TEXT  Specify file with patters to be protected in
                                 tokenisation.
  -c, --custom-nb-prefixes TEXT  Specify a custom non-breaking prefixes file,
                                 add prefixes to the default ones from the
                                 specified language.
  -h, --help                     Show this message and exit.


 $ sacremoses -l en -j 4 tokenize  < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 24363.39it/s

 $ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/basic-protected-patterns
 $ sacremoses -l en -j 4 tokenize -p basic-protected-patterns < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 22183.94it/s

Detokenizer

$ sacremoses detokenize --help
Usage: sacremoses detokenize [OPTIONS]

Options:
  -x, --xml-unescape  Unescape special characters for XML.
  -h, --help          Show this message and exit.

 $ sacremoses -l en -j 4 detokenize < big.txt.tok > big.txt.tok.detok
100%|██████████████████████████████████| 128457/128457 [00:16<00:00, 7931.26it/s]

Truecase

$ sacremoses truecase --help
Usage: sacremoses truecase [OPTIONS]

Options:
  -m, --modelfile TEXT            Filename to save/load the modelfile.
                                  [required]
  -a, --is-asr                    A flag to indicate that model is for ASR.
  -p, --possibly-use-first-token  Use the first token as part of truecase
                                  training.
  -h, --help                      Show this message and exit.

$ sacremoses -j 4 truecase -m big.model < big.txt.tok > big.txt.tok.true
100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 14257.27it/s]

Detruecase

$ sacremoses detruecase --help
Usage: sacremoses detruecase [OPTIONS]

Options:
  -j, --processes INTEGER  No. of processes.
  -a, --is-headline        Whether the file are headlines.
  -e, --encoding TEXT      Specify encoding of file.
  -h, --help               Show this message and exit.

$ sacremoses -j 4 detruecase  < big.txt.tok.true > big.txt.tok.true.detrue
100%|█████████████████████████████████| 128457/128457 [00:04<00:00, 26945.16it/s]

Normalize

$ sacremoses normalize --help
Usage: sacremoses normalize [OPTIONS]

Options:
  -q, --normalize-quote-commas  Normalize quotations and commas.
  -d, --normalize-numbers       Normalize number.
  -p, --replace-unicode-puncts  Replace unicode punctuations BEFORE
                                normalization.
  -c, --remove-control-chars    Remove control characters AFTER normalization.
  -h, --help                    Show this message and exit.

$ sacremoses -j 4 normalize < big.txt > big.txt.norm
100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 13096.23it/s]

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.1

Oct 30, 2023

0.1.0

Oct 30, 2023

0.0.53

May 3, 2022

0.0.52 yanked

May 3, 2022

Reason this release was yanked:

Broken.

0.0.51 yanked

May 2, 2022

Reason this release was yanked:

over-thinking

0.0.50 yanked

May 2, 2022

Reason this release was yanked:

over-thinking

0.0.49

Mar 15, 2022

0.0.48

Mar 15, 2022

0.0.47

Jan 9, 2022

0.0.46

Sep 25, 2021

0.0.45

Apr 19, 2021

0.0.44

Apr 3, 2021

0.0.43

May 4, 2020

0.0.42

May 4, 2020

0.0.41

Apr 14, 2020

0.0.40

Apr 13, 2020

0.0.39

Apr 13, 2020

0.0.38

Jan 6, 2020

0.0.35

Oct 3, 2019

0.0.34

Sep 20, 2019

0.0.33

Aug 14, 2019

0.0.32

Aug 14, 2019

0.0.31

Aug 6, 2019

0.0.30

Aug 6, 2019

0.0.29

Aug 6, 2019

0.0.28

Aug 6, 2019

0.0.27

Aug 6, 2019

0.0.26

Aug 6, 2019

0.0.25

Aug 6, 2019

0.0.24

Jul 29, 2019

0.0.22

Jul 16, 2019

0.0.21

Jul 16, 2019

0.0.20

Jul 16, 2019

0.0.19

Apr 12, 2019

0.0.18

Apr 12, 2019

0.0.17

Apr 12, 2019

0.0.16

Apr 12, 2019

0.0.15

Apr 12, 2019

0.0.14

Apr 12, 2019

0.0.13

Mar 19, 2019

0.0.12

Mar 19, 2019

0.0.11

Mar 19, 2019

0.0.10

Mar 7, 2019

0.0.9

Mar 6, 2019

0.0.8

Mar 6, 2019

0.0.7

Jan 14, 2019

0.0.5

Sep 20, 2018

0.0.4

Aug 7, 2018

0.0.3

Jun 19, 2018

0.0.2

Apr 24, 2018

0.0.1

Apr 20, 2018

0.0.0

Apr 20, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sacremoses-0.1.1.tar.gz (883.2 kB view details)

Uploaded Oct 30, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sacremoses-0.1.1-py3-none-any.whl (897.5 kB view details)

Uploaded Oct 30, 2023 Python 3

File details

Details for the file sacremoses-0.1.1.tar.gz.

File metadata

Download URL: sacremoses-0.1.1.tar.gz
Upload date: Oct 30, 2023
Size: 883.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for sacremoses-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`b6fd5d3a766b02154ed80b962ddca91e1fd25629c0978c7efba21ebccf663934`
MD5	`db513aea014345ad8e76295ba058159f`
BLAKE2b-256	`1d51fbdc4af4f6e85d26169e28be3763fe50ddfd0d4bf8b871422b0788dcc4d2`

See more details on using hashes here.

File details

Details for the file sacremoses-0.1.1-py3-none-any.whl.

File metadata

Download URL: sacremoses-0.1.1-py3-none-any.whl
Upload date: Oct 30, 2023
Size: 897.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for sacremoses-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`31e04c98b169bfd902144824d191825cd69220cdb4ae4bcf1ec58a7db5587b1a`
MD5	`c60f9116eca30734668c38ba1f09fb7f`
BLAKE2b-256	`0bf089ee2bc9da434bd78464f288fdb346bc2932f2ee80a90b2a4bbbac262c74`

See more details on using hashes here.

sacremoses 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Sacremoses

License

Install

Usage (Python)

Tokenizer and Detokenizer

Truecaser

Normalizer

Usage (CLI)

Pipeline

Tokenizer

Detokenizer

Truecase

Detruecase

Normalize

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes