Generate labeled data for conversational AI.

These details have not been verified by PyPI

Project links

Homepage

Project description

About

putput is a library that generates labeled data for conversational AI. It's simple to use, highly customizable, and can handle big data generation on a consumer grade laptop. putput takes minutes to setup and seconds to generate millions of labeled data points.

putput's labeled data could be used to:

train a ML model when you do not have real data.
augment training specific patterns in a ML model.
test existing ML models for specific patterns.

putput provides an API to its Pipeline that specifies how to generate labeled data. It ships with presets that configure the Pipeline for common NLU providers such as LUIS and spaCy. putput excels at generating custom datasets, even for problems that have yet to be solved commercially and for which no publicly available datasets exist. For instance, checkout this jupyter notebook that uses putput to generate a dataset for multi-intent recognition and trains a LSTM with Keras to recognize multi-intent and entities.

Here is an example prediction from the LSTM trained with putput data:

multi intent

Note that the trained LSTM can deal with real life complexity such as handling multi-intent ("add" and "remove" groups) and disambiguating between the same word in different contexts (the quantity "ten" vs. "ten" in the item "ten chicken strips").

Installation

putput currently supports python >= 3.5. To install the production release, execute pip install putput.

Samples

putput ships with several dockerized samples that show how to generate data.

Clone the repo: git clone https://github.com/michaelperel/putput.git
Move into the project directory: cd putput
Ensure docker is running: docker --version
Build the runtime environment: docker build -t putput .
The project ships with several usage samples which you can execute, for example: docker run putput smart_speaker or docker run putput restaurant.

putput also ships with annotated jupyter notebooks in the samples/ directory that use putput to solve real world NLU problems. Note: Github cannot correctly render certain graphics, so the notebooks should be viewed on nbviewer.

Development

There are various checks that Travis (our CI server) executes to ensure code quality. You can also run the checks locally:

Install the development dependencies via: pip install -e .[dev]
Run the linter: python setup.py pylint
Run the type checker: python setup.py mypy
Run the tests: python setup.py test

Alternatively, you can run all the steps via Docker: docker build --target=build -t putput .

Usage

putput is a pipeline that works by reshaping the pattern definition, a user defined yaml file of patterns, into labeled data.

Example

Here is an example of a pattern definition that generates labeled data for a smart speaker.

base_tokens:
  - PERSONAL_PRONOUNS: [he, she]
  - SPEAKER: [cortana, siri, alexa, google]
token_patterns:
  - static:
    - WAKE:
      - [[hi, hey], SPEAKER]
    - PLAY:
      - [PERSONAL_PRONOUNS, [wants, would like], [to], [play]]
      - [[play]]
  - dynamic:
    - ARTIST
    - SONG
groups:
  - PLAY_SONG: [PLAY, SONG]
  - PLAY_ARTIST: [PLAY, ARTIST]
utterance_patterns:
  - [WAKE, PLAY_SONG]
  - [WAKE, PLAY_ARTIST]
  - [WAKE, 1-2, PLAY_SONG]

Focusing on the first utterance_pattern, [WAKE, PLAY_SONG], putput would generate hundreds of utterances, tokens, and groups of the form:

utterance - hi cortana he wants to play here comes the sun

utterance 1

Pattern definition reference

In the pattern definition, the two most important sections are token_patterns and utterance_patterns. A token_pattern describes a sequence of components whose product constitutes a token. For instance, the sole token_pattern for the WAKE token is [[hi, hey], [cortana, siri, alexa, google]] (the base_token, SPEAKER, is replaced with its value [cortana, siri, alexa, google] at runtime). The product of this token_pattern:

hi cortana
hi siri
hi alexa
hi google
hey cortana
hey siri
hey alexa
hey google

represents the WAKE token.

Within the token_patterns section, there are static and dynamic sections. static means all of the token_patterns can be specified before the application runs. dynamic means the token_patterns will be specified at runtime. In our example, WAKE is defined underneath static because all ways to awake the smart speaker are known before runtime. ARTIST and SONG, however, are defined underneath dynamic because the artists and songs in your music catalog may change frequently. The values for these tokens can be passed in as arguments to Pipeline at runtime.

Within each token_pattern, base_tokens may be used to keep yourself from repeating the same components. For instance, in our example, we could potentially use PERSONAL_PRONOUNS in many different places, so we'd like to only have to define it once.

An utterance_pattern describes the product of tokens that make up an utterance. For instance, the first utterance_pattern, [WAKE, PLAY, SONG], is a product of all of the products of token_patterns for WAKE, PLAY, and SONG (the group, PLAY_SONG, is replaced with its value [PLAY, SONG]). Example utterances generated from this utterance_pattern would be:

hi cortana play here comes the sun
hi cortana he would like to play here comes the sun

Within each utterance_pattern, groups may be used to keep yourself from repeating the same tokens. For instance, in our example, we could potentially use PLAY_SONG in many different places, so we'd like to only have to define it once. Unlike base_tokens, putput keeps track of groups. For instance, recall one potential output corresponding to the utterance_pattern, [WAKE, PLAY_SONG]:

utterance 2

Since PLAY_SONG is the only group in the utterance_pattern, the WAKE token is assigned the group NONE whereas the PLAY and SONG tokens are assigned the group PLAY_SONG.

Thinking in terms of commercial NLU providers, groups could be used to match to intents and tokens could be used to match entities.

utterance_patterns and groups support range syntax. Looking at the last utterance_pattern, [WAKE, 1-2, PLAY_SONG], we see the range, 1-2. Putput will expand this utterance_pattern to two utterance_patterns, [WAKE, PLAY_SONG] and [WAKE, WAKE, PLAY_SONG]. Ranges are inclusive and may also be specified as a single number, which would expand into one utterance_pattern.

Finally, groups may be defined within groups. For instance:

- groups:
  - PLAY_SONG: [PLAY, SONG]
  - WAKE_PLAY_SONG: [WAKE, PLAY_SONG, 10]

is valid syntax.

Single Intent Providers (LUIS, Rasa, Lex, etc.)

If your NLU provider only supports single intent utterances you can still use putput to generate utterances in the more familiar intent/entities paradigm. To specify single intents, simply add another level to the utterance patterns with the intent as the key and all it's utterance patterns beneath. To specify entities add a new section called 'entities' with a list of tokens that you want to be picked up as entities. For example:

base_tokens:
  - PERSONAL_PRONOUNS: [he, she]
  - SPEAKER: [cortana, siri, alexa, google]
token_patterns:
  - static:
    - WAKE:
      - [[hi, hey], SPEAKER]
    - PLAY:
      - [PERSONAL_PRONOUNS, [wants, would like], [to], [play]]
      - [[play]]
  - dynamic:
    - ARTIST
    - SONG
entities: [ARTIST, SONG] # Here we specify which tokens are our entities
utterance_patterns:
  - SONG_INTENT: # Here we specify our intents and which utterance patterns belong to them
    - [WAKE, PLAY, SONG]
    - [WAKE, 1-2, PLAY, SONG]
  - ARTIST_INTENT:
    - [WAKE, PLAY, ARTIST]

For a full example using the single intent pattern checkout this LUIS example

Pipeline

After defining the pattern definition, the final step to generating labeled data is instantiating putput's Pipeline and calling flow.

dynamic_token_patterns_map = {
    'SONG': ('here comes the sun', 'stronger'),
    'ARTIST': ('the beatles', 'kanye west')
}
p = Pipeline(pattern_def_path, dynamic_token_patterns_map=dynamic_token_patterns_map)
for utterance, tokens, groups in p.flow():
    print(utterance)
    print(tokens)
    print(groups)

flow yields results one utterance at a time. While the results could be the tuple (utterance, tokens, groups) for each iteration, they could also be customized by specifying arguments to Pipeline. Some common use cases are limiting the size of the output, oversampling/undersampling utterance_patterns, specifying how tokens and groups are tokenized, etc. Customization of the Pipeline is extensive and is covered in the Pipeline's docs. Common preset configurations are covered in the preset docs.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.5.3

Oct 28, 2019

0.5.1

Aug 12, 2019

0.5.0

Jun 19, 2019

0.4.0

Apr 9, 2019

0.3.0

Mar 1, 2019

0.2.1

Feb 11, 2019

0.2.0

Feb 8, 2019

0.1.0

Jan 27, 2019

0.0.9

Jan 27, 2019

0.0.8

Jan 27, 2019

0.0.7

Jan 25, 2019

0.0.6

Jan 15, 2019

0.0.1

Nov 8, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

putput-0.5.3.tar.gz (24.7 kB view details)

Uploaded Oct 28, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

putput-0.5.3-py3-none-any.whl (30.3 kB view details)

Uploaded Oct 28, 2019 Python 3

File details

Details for the file putput-0.5.3.tar.gz.

File metadata

Download URL: putput-0.5.3.tar.gz
Upload date: Oct 28, 2019
Size: 24.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.5.1 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.7

File hashes

Hashes for putput-0.5.3.tar.gz
Algorithm	Hash digest
SHA256	`378a6c12f9012938e2d27554db474f523e3feba67da9815dfd565dddd75ab2ea`
MD5	`4094e144ecdbc2c6ef463de1791d0b64`
BLAKE2b-256	`fbbc2b375b7c3803d137082cc6891f6e5d2699085d5e1247fc4086a682d89686`

See more details on using hashes here.

File details

Details for the file putput-0.5.3-py3-none-any.whl.

File metadata

Download URL: putput-0.5.3-py3-none-any.whl
Upload date: Oct 28, 2019
Size: 30.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.5.1 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.7

File hashes

Hashes for putput-0.5.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`077fc0eafbacb519b5b11560c19d50e1289ad607a233e74c81106f5ae0e2a2ff`
MD5	`23bb1df0e4266dc90a13f818a33aa38c`
BLAKE2b-256	`25989233ca8e6c64b8765c37d2336798c1414fd0ffe8c6e7039eb9d0aea295fa`

See more details on using hashes here.

putput 0.5.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

About

Installation

Samples

Development

Usage

Example

Pattern definition reference

Single Intent Providers (LUIS, Rasa, Lex, etc.)

Pipeline

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes