Chat bot sentences & story generator.
Project description
cross-words
==========================================
`cross-words` is a python module that allows you to easily create a corpus of documents with parameterized entities.
The main goal of `cross-words` is to offer an easy way to create either sentences or stories for use in chat bot training.
As of May 2018, it is mostly designed to be used with [Rasa NLU/Core](http://rasa.com/)
1. [Installation](#install)
2. [How to use this package](#usage)
# 1. Installation<a name="install"></a>
You can install it with pip:
pip install cross-words
Or directly from github if you want the latest development version
pip install git+https://github.com/data-chirps/cross-words.git
# 2. How to use this package<a name="usage"></a>
## cross-words DSL
`cross-words` is based on a simple yet powerful Domain Specific Language.
When used along with Rasa NLU/Core, it uses 3 concepts:
- **intents:** the objective of the chatbot's user (e.g. ask to book a restaurant, confirm a chatbot inquiry etc.)
- **entities:** specific parts of a sentence containing key information (e.g. which restaurant to book, how many people etc.)
- **aliases:** lists of synonyms that can be used interchangeably
More details are available at [Rasa NLU](https://nlu.rasa.com/tutorial.html)
Given a configuration file (.txt) containing all of the above, `cross-words` is able to generate many training sentences/conversations using combinations of sentence parts.
`cross-words` configuration files look like this:
```
Could I have the number of @[subject_filter] ~[owners] in @[geo_filter] @[time_filter]?
@[time_filter]
this month
this year
LTD
life to date
up to date
since release
since launch
since beginning of fiscal year
@[geo_filter]
France
Germany
US
United States
America
Canada
Italy
@[subject_filter]
birds
parrots
owl
dogs
cats
persian
~[owners]
owners
possessors
```
If asked for sentences, `cross-words` will generate a .md file whose first lines will be :
```
- Could I have the number of [birds](subject_filter) possessors in [Canada](geo_filter) [life to date](time_filter)?
- Could I have the number of [parrots](subject_filter) possessors in [United States](geo_filter) [since release](time_filter)?
- Could I have the number of [owl](subject_filter) possessors in [Italy](geo_filter) [up to date](time_filter)?
- Could I have the number of [owl](subject_filter) possessors in [Italy](geo_filter) [since release](time_filter)?
- Could I have the number of [dogs](subject_filter) owners in [United States](geo_filter) [LTD](time_filter)?
- Could I have the number of [dogs](subject_filter) owners in [Canada](geo_filter) [this year](time_filter)?
- Could I have the number of [cats](subject_filter) owners in [France](geo_filter) [this year](time_filter)?
- Could I have the number of [cats](subject_filter) owners in [US](geo_filter) [since release](time_filter)?
- Could I have the number of [cats](subject_filter) owners in [America](geo_filter) [this month](time_filter)?
- Could I have the number of [cats](subject_filter) owners in [Canada](geo_filter) [life to date](time_filter)?
```
This file is then ready to use as training input to Rasa NLU.
If asked for stories:
```
## Genereated Story 815310784239368
* acquisition{}
- utter_ask_time_filter
* acquisition{"time_filter": "since beginning of fiscal year"}
- slot{"time_filter": "since beginning of fiscal year"}
- utter_ask_geo_filter
* acquisition{"geo_filter": "America"}
- slot{"geo_filter": "America"}
- utter_ask_subject_filter
* acquisition{"subject_filter": "dogs"}
- slot{"subject_filter": "dogs"}
- action_acquisition
## Genereated Story 257661587723758
* acquisition{"time_filter": "since release", "geo_filter": "Germany"}
- slot{"time_filter": "since release"}
- slot{"geo_filter": "Germany"}
- utter_ask_subject_filter
* acquisition{"subject_filter": "owl"}
- slot{"subject_filter": "owl"}
- action_acquisition
## Genereated Story 877699493192194
* acquisition{"subject_filter": "parrots"}
- slot{"subject_filter": "parrots"}
- utter_ask_time_filter
* acquisition{"time_filter": "LTD"}
- slot{"time_filter": "LTD"}
- utter_ask_geo_filter
* acquisition{"geo_filter": "France"}
- slot{"geo_filter": "France"}
- action_acquisition
```
This file is then ready to use for training with Rasa Core.
## Generating files
`cross-words` mainly comes with 2 functions: parse_input and generate. All other functions are implementation details.
### generate(input_path, output_path="./xwords/outputs/", intent_string=None, output_prefix='', training_ratio=1.0, for_story=False, n_sub=None)
This is the main function of `cross-words'.
Given an input configuration file, it outputs all combinations of intents x entities x aliases into a .md file ready for training.
A few arguments allow to tune its behavior:
- **input_path:** path to the configuration file *(string)*
- **output_path:** path to the output folder where train/test files will be written *(string)*
- **intent_string** string to specify intent at the beginning of sentence files (for Rasa NLU) or inside genereated stories (for Rasa Core) *(string)*
- **output_prefix** string to specify beginning of names of files that are written *(string)*
- **training_ratio:** ratio between train and test sets. If .7, 30% of all generated combinations will be reserved into a test file. If 1.0, no test file will be created. *(float)*
- **for_story:** whether to generate sentences (for Rasa NLU) or stories (for Rasa Core) *(bool)*
- **n_sub:** number of sentences/stories (incl. test) to be taken as a subsample of all possible combinations of intents x entities x aliases *(int)* (required when generating stories for Rasa Core)
### parse_input(input_path)
This function is provided as a facilitator for experimentation purposes. It is the first function called by generate.
Given an input configuration file, generates:
- a list of intents in the form
```
['intent_sentence_0', 'intent_sentence_1', ...]
e.g. from above:
['Could I have the number of @[subject_filter] ~[owners] in @[geo_filter] @[time_filter]?']
```
- a dictionnary of entitites in the form
```
{'entity_0': ['alternative_00', 'alternative_01', ...],
'entity_1': ['alternative_10', 'alternative_11', ...], ...}
e.g. from above:
{'time_filter': ['this month', 'this year', ...],
'geo_filter': ['France', 'Germany', ...], ...}
```
- a dictionnary of synonyms in the form
```
{'alias_0': ['alternative_00', 'alternative_01', ...],
'alias_1': ['alternative_10', 'alternative_11', ...], ...}
e.g. from above:
{'owners': ['owners', 'possessors']}
```
## Combination logic
`cross-words` is designed to compute sentences by placing all entities and alias alternative into all intents.
As a rule of thumb, the overall maximum number of generated sentences is in the order of:
nb<sub>intent sentences</sub> × avg. nb<sub>entity placeholders per intent sentence</sub> × avg. nb<sub>alternatives per entity</sub> × avg. nb<sub>alias placeholders per intent sentence</sub> × avg. nb<sub>alternatives per alias</sub>
As such, the created training files grow exponentially, hence the available *n_sub* parameter in **generate**
In the specific case of stories (Rasa Core), `cross-words` will also use *information availability* as an additional combination dimension.
For example, the two stories below are based on a different initially available information set given by the user:
```
## Genereated Story 257661587723758
* acquisition{"time_filter": "since release", "geo_filter": "Germany"}
- slot{"time_filter": "since release"}
- slot{"geo_filter": "Germany"}
- utter_ask_subject_filter
* acquisition{"subject_filter": "owl"}
- slot{"subject_filter": "owl"}
- action_acquisition
## Genereated Story 877699493192194
* acquisition{"time_filter": "since release"}
- slot{"time_filter": "since release"}
- utter_ask_subject_filter
* acquisition{"subject_filter": "owl"}
- slot{"subject_filter": "owl"}
- utter_ask_geo_filter
* acquisition{"geo_filter": "Germany"}
- slot{"geo_filter": "Germany"}
- action_acquisition
```
==========================================
`cross-words` is a python module that allows you to easily create a corpus of documents with parameterized entities.
The main goal of `cross-words` is to offer an easy way to create either sentences or stories for use in chat bot training.
As of May 2018, it is mostly designed to be used with [Rasa NLU/Core](http://rasa.com/)
1. [Installation](#install)
2. [How to use this package](#usage)
# 1. Installation<a name="install"></a>
You can install it with pip:
pip install cross-words
Or directly from github if you want the latest development version
pip install git+https://github.com/data-chirps/cross-words.git
# 2. How to use this package<a name="usage"></a>
## cross-words DSL
`cross-words` is based on a simple yet powerful Domain Specific Language.
When used along with Rasa NLU/Core, it uses 3 concepts:
- **intents:** the objective of the chatbot's user (e.g. ask to book a restaurant, confirm a chatbot inquiry etc.)
- **entities:** specific parts of a sentence containing key information (e.g. which restaurant to book, how many people etc.)
- **aliases:** lists of synonyms that can be used interchangeably
More details are available at [Rasa NLU](https://nlu.rasa.com/tutorial.html)
Given a configuration file (.txt) containing all of the above, `cross-words` is able to generate many training sentences/conversations using combinations of sentence parts.
`cross-words` configuration files look like this:
```
Could I have the number of @[subject_filter] ~[owners] in @[geo_filter] @[time_filter]?
@[time_filter]
this month
this year
LTD
life to date
up to date
since release
since launch
since beginning of fiscal year
@[geo_filter]
France
Germany
US
United States
America
Canada
Italy
@[subject_filter]
birds
parrots
owl
dogs
cats
persian
~[owners]
owners
possessors
```
If asked for sentences, `cross-words` will generate a .md file whose first lines will be :
```
- Could I have the number of [birds](subject_filter) possessors in [Canada](geo_filter) [life to date](time_filter)?
- Could I have the number of [parrots](subject_filter) possessors in [United States](geo_filter) [since release](time_filter)?
- Could I have the number of [owl](subject_filter) possessors in [Italy](geo_filter) [up to date](time_filter)?
- Could I have the number of [owl](subject_filter) possessors in [Italy](geo_filter) [since release](time_filter)?
- Could I have the number of [dogs](subject_filter) owners in [United States](geo_filter) [LTD](time_filter)?
- Could I have the number of [dogs](subject_filter) owners in [Canada](geo_filter) [this year](time_filter)?
- Could I have the number of [cats](subject_filter) owners in [France](geo_filter) [this year](time_filter)?
- Could I have the number of [cats](subject_filter) owners in [US](geo_filter) [since release](time_filter)?
- Could I have the number of [cats](subject_filter) owners in [America](geo_filter) [this month](time_filter)?
- Could I have the number of [cats](subject_filter) owners in [Canada](geo_filter) [life to date](time_filter)?
```
This file is then ready to use as training input to Rasa NLU.
If asked for stories:
```
## Genereated Story 815310784239368
* acquisition{}
- utter_ask_time_filter
* acquisition{"time_filter": "since beginning of fiscal year"}
- slot{"time_filter": "since beginning of fiscal year"}
- utter_ask_geo_filter
* acquisition{"geo_filter": "America"}
- slot{"geo_filter": "America"}
- utter_ask_subject_filter
* acquisition{"subject_filter": "dogs"}
- slot{"subject_filter": "dogs"}
- action_acquisition
## Genereated Story 257661587723758
* acquisition{"time_filter": "since release", "geo_filter": "Germany"}
- slot{"time_filter": "since release"}
- slot{"geo_filter": "Germany"}
- utter_ask_subject_filter
* acquisition{"subject_filter": "owl"}
- slot{"subject_filter": "owl"}
- action_acquisition
## Genereated Story 877699493192194
* acquisition{"subject_filter": "parrots"}
- slot{"subject_filter": "parrots"}
- utter_ask_time_filter
* acquisition{"time_filter": "LTD"}
- slot{"time_filter": "LTD"}
- utter_ask_geo_filter
* acquisition{"geo_filter": "France"}
- slot{"geo_filter": "France"}
- action_acquisition
```
This file is then ready to use for training with Rasa Core.
## Generating files
`cross-words` mainly comes with 2 functions: parse_input and generate. All other functions are implementation details.
### generate(input_path, output_path="./xwords/outputs/", intent_string=None, output_prefix='', training_ratio=1.0, for_story=False, n_sub=None)
This is the main function of `cross-words'.
Given an input configuration file, it outputs all combinations of intents x entities x aliases into a .md file ready for training.
A few arguments allow to tune its behavior:
- **input_path:** path to the configuration file *(string)*
- **output_path:** path to the output folder where train/test files will be written *(string)*
- **intent_string** string to specify intent at the beginning of sentence files (for Rasa NLU) or inside genereated stories (for Rasa Core) *(string)*
- **output_prefix** string to specify beginning of names of files that are written *(string)*
- **training_ratio:** ratio between train and test sets. If .7, 30% of all generated combinations will be reserved into a test file. If 1.0, no test file will be created. *(float)*
- **for_story:** whether to generate sentences (for Rasa NLU) or stories (for Rasa Core) *(bool)*
- **n_sub:** number of sentences/stories (incl. test) to be taken as a subsample of all possible combinations of intents x entities x aliases *(int)* (required when generating stories for Rasa Core)
### parse_input(input_path)
This function is provided as a facilitator for experimentation purposes. It is the first function called by generate.
Given an input configuration file, generates:
- a list of intents in the form
```
['intent_sentence_0', 'intent_sentence_1', ...]
e.g. from above:
['Could I have the number of @[subject_filter] ~[owners] in @[geo_filter] @[time_filter]?']
```
- a dictionnary of entitites in the form
```
{'entity_0': ['alternative_00', 'alternative_01', ...],
'entity_1': ['alternative_10', 'alternative_11', ...], ...}
e.g. from above:
{'time_filter': ['this month', 'this year', ...],
'geo_filter': ['France', 'Germany', ...], ...}
```
- a dictionnary of synonyms in the form
```
{'alias_0': ['alternative_00', 'alternative_01', ...],
'alias_1': ['alternative_10', 'alternative_11', ...], ...}
e.g. from above:
{'owners': ['owners', 'possessors']}
```
## Combination logic
`cross-words` is designed to compute sentences by placing all entities and alias alternative into all intents.
As a rule of thumb, the overall maximum number of generated sentences is in the order of:
nb<sub>intent sentences</sub> × avg. nb<sub>entity placeholders per intent sentence</sub> × avg. nb<sub>alternatives per entity</sub> × avg. nb<sub>alias placeholders per intent sentence</sub> × avg. nb<sub>alternatives per alias</sub>
As such, the created training files grow exponentially, hence the available *n_sub* parameter in **generate**
In the specific case of stories (Rasa Core), `cross-words` will also use *information availability* as an additional combination dimension.
For example, the two stories below are based on a different initially available information set given by the user:
```
## Genereated Story 257661587723758
* acquisition{"time_filter": "since release", "geo_filter": "Germany"}
- slot{"time_filter": "since release"}
- slot{"geo_filter": "Germany"}
- utter_ask_subject_filter
* acquisition{"subject_filter": "owl"}
- slot{"subject_filter": "owl"}
- action_acquisition
## Genereated Story 877699493192194
* acquisition{"time_filter": "since release"}
- slot{"time_filter": "since release"}
- utter_ask_subject_filter
* acquisition{"subject_filter": "owl"}
- slot{"subject_filter": "owl"}
- utter_ask_geo_filter
* acquisition{"geo_filter": "Germany"}
- slot{"geo_filter": "Germany"}
- action_acquisition
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Built Distribution
Close
Hashes for cross_words-0.0.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f54cfa676f5f7d5fe9bfe9241918086b57c1677f534f491b510d6a919137f0f8 |
|
MD5 | 5ce37ca41c80ef87e6598f4c896a0e76 |
|
BLAKE2-256 | 63fd0af5f56f0dd7f499c54b1b309cb0cfcb234bf9e15592afe36a5204d5b351 |