Jason Young's Tools.
This package contains several useful tools, some of which deal with the problems in Natrual Language Processing.
- Through pip
pip install young-tools
- Clone it to local
git clone https://github.com/Jason-Young-NLP/YoungTools.git cd YoungTools python setup.py build develop
All the executable modules can be executed by running command
Until now, young-tools provides three Executable Moduels:
It is a corpus compiler which can be executed by running
young-tools-corpus. The command only recieve 1 argument
--configuration-path that contains all the parameters you set. The configuration file is wrote in a basic configuration language which provides a structure similar to what’s found in Microsoft Windows INI files.
You must provide the
main section in which you should to configure:
Before each running,
young-tools-corpus will read the
configuration-path and parse the
young-tools-corpus can deal with multiple corpus with different settings in one time. The configuration of different corpus in
main section are seperated by seperator
pipeline indicates the running order of the sub-corpus-compiler modules. Each name of different modules are seperated by the seperator
&. If there is another instance of a module have a different configuration, just define a new section whitch name is appended by the suffix
moduel_name_10. module_name must be one of names of sub-corpus-compiler-modules.
corpus_directory specifies where the raw and compiled corpus are.
corpus_directory, there may contains several corpora(
corpora_names), and each corpora may have several languages(
languages) whose compiled file encodings can be detemained by
young-tools-corpus has 5 sub-corpus-compiler-modules:
Which can remove the dumplicate_lines(
remove_dumplicate_lines) and lowercase the corpora(
granularitycan be set as sentence or document. When
granularityis document, the document index which indicates the start point of each document in the corpora are write the
Normalize punctuations of the corpora.
Segment the Chinese sentence using THULAC. If you need POS tagging, set
part_of_speech_taggingto be true.
traditional_to_simplifiedmay useful in some situation.
Tokenize the sentences in different languages, you may need to convert the hyphen
split_aggressive_hyphento be True.
This is a simple encapsulation of subword-nmt.
apply_file_indicesindicate the index of which corpus should be learn/apply in the
subword_indicesindicates which language of the corpora should be executed by BPE.
symbols_numberis the number of the merge operation and
joint_learnis whether learn the BPE jointly among the
Normalizer and Tokenizer are reimplementation of the scripts of the mosesdecoder.
It can generate the manipulation sequences between corpora hypothesis and references by calculating the levenshtein distance , and synthetise the hypothesis of the references by getting the rules of the aligned hypothesis and references. These functions can be executed by running
young-tools-corpus with a subcommand of
young-tools-xml can convert a XML file into a plain file or escape/deescape the file by specifing the subcommand as
To be done.
Using it by simply import the
import young_tools.pedestal as pedestal
The usage of each module in the
pedestal package is described as follows:
Timer record the system/process elapsed time.
Constant is a type of class with which stores unlimited number of constants.
InstancesChecker is a basic decorator that can check whether parameters that are passed to the method is legal.
ANSIFormatter controls the ANSI color string. One use this class to format the terminal output string.
Logger records the logging of the process and sends it to log file or terminal.
Argument is a simple encapsulation of the argparser.
Configurator is a simple encapsulation of the configparser, but Configurator is case sensitive.
UnicodeHandler has several methods that deal with the unicode string and detect the encoding type.
A simple class can redirect the stdout/stderr stream to a file.
Release history Release notifications
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size young_tools-0.0.2a7-py3-none-any.whl (57.8 kB)||File type Wheel||Python version py3||Upload date||Hashes View hashes|
|Filename, size young_tools-0.0.2a7.tar.gz (47.2 kB)||File type Source||Python version None||Upload date||Hashes View hashes|
Hashes for young_tools-0.0.2a7-py3-none-any.whl