A package for semi automatic corpus building
Project description
Corpus Constructor
Tool for semi-auromatic corpus construction for training NLP models. This project allows for a definition of a domain corpus by using a context-free grammar in the BNF form. This grammar is used to produce sample sentences belonging to this defined language. That are later expanded by textual data augmentation to form a specific domain corpus that could be used to train machine learning models such as intent classifier and NER.
Installation
Curenttly the only instalation method is by clonning this repo and installing dependencies manually.
Usage
A simple example with a syntetic language is as follows:
from corpus_builder.rule import Rule
from corpus_builder.grammar import Grammar
from corpus_builder.builder import CorpusBuilder
root_rule = Rule('<S>', ('<A>', '<B>'))
a_rule = Rule('<A>', ['a', '<B>'])
b_rule = Rule('<B>', ('<A>', 'b', '<C>'))
c_rule = Rule('<C>', 'c')
rule_set = [root_rule, a_rule, b_rule, c_rule]
grammar = Grammar(rule_set, '<S>')
builder = CorpusBuilder(grammar, {'<S>': ['A', 'B']}, {'<C>': 'C'})
print(builder.create_sentence())
print(builder.create_corpus(5, 0))
A more complex example with natural language in the AskUbuntu corpus domain is in the ask_ubuntu.py file.
It is also possible to create a grammar from a text file in this format. Then create the corpus builder as use it as follows
from corpus_builder.builder_importer import from_text_file
builder = from_text_file('simple_domain.txt')
print(builder.create_sentence())
print(builder.create_corpus(5, 0))
Or by using the command line utility
python -m corpus_builder --input simple_domain.txt -n 5
Visualization
You can visualize thee grammar you created and produceimages like this
by doing
from corpus_builder.visualize import plot_grammar
plot_grammar(grammar)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for corpus_builder-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7277ea93a5dd281b25ea1b8dc58ebf5f10729ed24b430c0db28f52ace78c4726 |
|
MD5 | f7bea3fa23a57b44e7e2f177f5143402 |
|
BLAKE2b-256 | 8ff2b940edd0fb961caf9cbcc436dd02c296547bfb59ed079117b4135446e73b |