Skip to main content

A dataset generator for Rasa NLU

Project description

Chatette dataset generator

GitHub license

Chatette logo

Chatette is a Python script that helps you generate training datasets for the Rasa NLU Python package. If you want to make large datasets of example data for Natural Language Understanding tasks without too much of a headache, Chatette is a project for you.

Specifically, Chatette implements a Domain Specific Language (DSL) that allows you to define templates to generate a large number of sentences. Those sentences are then saved in the input format of Rasa NLU.

The DSL used is a superset of the excellent project Chatito created by Rodrigo Pimentel. (Note: the DSL is actually a superset of Chatito v2.1.x for Rasa NLU, not for all possible adapters.)

How to use Chatette?

Input and output data

The data that Chatette uses and generates is loaded from and saved to files. We thus have:

  • The input file containing the templates.

    There is no need for a specific file extension. The syntax of the DSL to make those templates is described in the syntax specification. Note that templates can be divided into several files, with one master file linking them all together (described in the syntax specification).

  • The output file, a JSON file containing data that can be directly fed to Rasa NLU.

Running Chatette

To run Chatette, you will need to have Python installed. Chatette works with both Python 2.x and 3.x.

Install Chatette via pip:

pip install chatette

(Alternatively, you can clone the GitHub repository and run the file named run.py.)

Then simply run the following command:

python -m chatette.run <path_to_template>

or

python3 -m chatette.run <path_to_template>

You can specify the name of the output file as follows:

python -m chatette.run <path_to_template> -o <output_path.json>

or

python3 -m chatette.run <path_to_template> --output <output_path.json>

The output file will then be saved in a file named output_path.json within the same directory as the input file. If you didn't specify a name for the output file, the default one is output.json.

You can also set the random generator seed using the program argument -s or --seed. The seed can be any text without spaces. If you execute Chatette twice on the exact same template with the same seed, the generated output is guaranteed to be exactly the same for both executions.

Chatette vs Chatito?

A perfectly legitimate question could be:

Why does Chatette exist when Chatito already fulfills the same purposes?

The reason comes from the different goals of the two projects:

Chatito aims at a generic but powerful DSL, that should stay simple. While it is perfectly fine for small projects, when projects get larger, this simplicity may become a burden: your template file becomes overwhelmingly large, at a point you get lost inside it.

Chatette defines a more complex DSL to be able to manage larger projects. Here is a non-exhaustive list of features that can help with that:

  • Ability to break down templates into multiple files
  • Support for comments inside template files (Note: this is now possible in Chatito v2.1.x too)
  • Word group syntax that allows to define parts of sentences that might not be generated in every example
  • Possibility to specify the probability of generating some parts of the sentences
  • Choice syntax to prevent copy-pasting rules with only a few changes
  • Ability to define the value of each slot whatever the generated example
  • Syntax for generating words with different case for the leading letter
  • Argument support so that some templates may be filled by given words
  • Indentation must simply be somewhat coherent
  • Support for synonyms

As previously mentioned, the DSL used by Chatette is a superset of the one used by Chatito. This means that input files used for Chatito are completely usable with Chatette (not the other way around). Hence, it is easy to get from Chatito to Chatette.

As an example, this Chatito data:

// This template defines different ways to ask for the location of toilets (Chatito version)
%[ask_toilet]('training': '3')
    ~[sorry?] ~[tell me] where the @[toilet#singular] is ~[please?]?
    ~[sorry?] ~[tell me] where the @[toilet#plural] are ~[please?]?

~[sorry]
    sorry
    Sorry
    excuse me
    Excuse me

~[tell me]
    ~[can you?] tell me
    ~[can you?] show me
~[can you]
    can you
    could you
    would you

~[please]
    please

@[toilet#singular]
    toilet
    loo
@[toilet#plural]
    toilets

could be directly given as input to Chatette, but this Chatette template would produce the same thing:

// This template defines different ways to ask for the location of toilets (Chatette version)
%[&ask_toilet](3)
    ~[sorry?] ~[tell me] where the {@[toilet#singular] is/@[toilet#plural] are} [please?]\?

~[sorry]
    sorry
    excuse me

~[tell me]
    ~[can you?] {tell/show} me
~[can you]
    {can/could/would} you

@[toilet#singular]
    toilet
    loo
@[toilet#plural]
    toilets

The Chatito version is arguably easier to read, but the Chatette version is shorter, which may be very useful when dealing with lots of templates and potential repetition.

Beware that, as always with machine learning, having too much data may cause your models to perform less well because of overfitting. While this script can be used to generate thousands upon thousands of examples, it isn't advised for machine learning tasks.

Note that Chatette is named after Chatito, as -ette in French could be translated to -ita or -ito in Spanish.

Contributors

Many thanks to him!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chatette-1.2.2.tar.gz (33.3 kB view hashes)

Uploaded Source

Built Distribution

chatette-1.2.2-py3-none-any.whl (29.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page