Skip to main content

A package for semi automatic corpus building

Project description

Corpus Constructor

Tool for semi-auromatic corpus construction for training NLP models. This project allows for a definition of a domain corpus by using a context-free grammar in the BNF form. This grammar is used to produce sample sentences belonging to this defined language. That are later expanded by textual data augmentation to form a specific domain corpus that could be used to train machine learning models such as intent classifier and NER.

Installation

Curenttly the only instalation method is by clonning this repo and installing dependencies manually.

Usage

A simple example with a syntetic language is as follows:

from corpus_builder.rule import Rule
from corpus_builder.grammar import Grammar
from corpus_builder.builder import CorpusBuilder

root_rule = Rule('<S>', ('<A>', '<B>'))
a_rule = Rule('<A>', ['a', '<B>'])
b_rule = Rule('<B>', ('<A>', 'b', '<C>'))
c_rule = Rule('<C>', 'c')

rule_set = [root_rule, a_rule, b_rule, c_rule]

grammar = Grammar(rule_set, '<S>')

builder = CorpusBuilder(grammar, {'<S>': ['A', 'B']}, {'<C>': 'C'})
print(builder.create_sentence())
print(builder.create_corpus(5, 0))

A more complex example with natural language in the AskUbuntu corpus domain is in the ask_ubuntu.py file.

It is also possible to create a grammar from a text file in this format. Then create the corpus builder as use it as follows

from corpus_builder.builder_importer import from_text_file

builder = from_text_file('simple_domain.txt')
print(builder.create_sentence())
print(builder.create_corpus(5, 0))

Or by using the command line utility

python -m corpus_builder --input simple_domain.txt -n 5

Visualization

You can visualize thee grammar you created and produceimages like this

by doing

from corpus_builder.visualize import plot_grammar
plot_grammar(grammar)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_builder-0.2.0.tar.gz (8.7 kB view hashes)

Uploaded Source

Built Distribution

corpus_builder-0.2.0-py3-none-any.whl (10.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page