text generation for product recommendations using OpenCCG
pypolibox is a database-to-text generation (NLG) software built on Python 2.7, NLTK and Nicholas FitzGerald’s pydocplanner.
Using a database of technical books and some user input, pypolibox generates sentences descriptions. These descriptions are then used by the OpenCCG surface realiser to generate written sentences in German.
Install from PyPI
pip install pypolibox # prepend 'sudo' if needed
Install from source
git clone https://github.com/arne-cl/pypolibox.git cd pypolibox python setup.py install # prepend 'sudo' if needed
In order to generate sentences (instead of abstract sentence descriptions), you will need to install OpenCCG (tested with version 0.9.5). Make sure that at least tccg is in your $PATH. Under Linux, you’d have to add something like this to your .bashrc:
export PATH=/home/username/bin/openccg/bin:$PATH export OPENCCG_HOME=/home/username/bin/openccg export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
pypolibox can be used from the command line or from within a Python interpreter. To see all the available options, enter:
python pypolibox.py -h
To find books that are written in German and use the programming language Prolog, type:
python pypolibox.py --language German --proglang Prolog
or, if you prefer short but cryptic commands:
python pypolibox.py -l German -p Prolog
If you’re just interested in text plans (as opposed to generated sentences), add the -x or –xml command line option:
python pypolibox.py --language German --proglang Prolog --xml
Further usage examples can be found in the pypolibox.database.Query class documentation. If you’d like to access pypolibox from within a Python interpreter, you can simply use the same arguments. Instead of a string like -l German -p Prolog, you will have to provide your arguments as a list of strings:
Query(["-l", "German", "-p", "Prolog"])
This query would be equivalent to the command line queries above. pypolibox is built as a pipeline, where each important step is represented by a class. Each of these classes function as the input of the next class in the pipeline, e.g.:
query = Query(["-l", "German", "-p", "Prolog"]) Results(query) Books(Results(query)) ... TextPlans(AllMessages(AllPropositions(AllFacts(Books(Results(query))))))
If you instanciate a Query with your query arguments, you can use this Query instance as the input of a Results instance (which contains the data that the database provided for your query), which in turn can be used as the input of a Books instance etc.
Of course, you wouldn’t want to chain all those classes just to retrieve textplans. To do so, simply use one of the functions provided in the debug module, either by running the debug.py file in the interpreter or by importing it:
import debug debug.gen_textplans(["-l", "German", "-p", "Prolog"])
This function call would return the same results as the aforementioned command line calls. For further testing, try debug.testqueries and debug.error_testqueries, which basically are lists of predefined valid and invalid query arguments and which can be used to query the database (and see how errors are handled).
I used epydoc to document pypolibox. You can generate an HTML or PDF version by running these commands in pypolibox’s main directory:
mkdir -p doc/latex epydoc --pdf --name pypolibox --output doc/latex src/pypolibox
to produce a PDF (doc/latex/api.pdf) and
epydoc --html --name pypolibox --graph all --output doc/html src/pypolibox
to produce a set of HTML files.
The pypolibox package contains the following modules:
- The pypolibox module is the main module, which is invoked from the command line.
- The database module handles the user input, queries the database and returns the results.
- facts converts those results into attribute value matrices.
- The propositions module evaluates those facts (positive, negative, neutral).
- The textplan module takes those propositions and turns them into messages. In contrast to propositions, messages do not contain duplicates and add comparative information. Rules will be used to combine those message into constituent sets and ultimately into one text plan. The textplan module also allows exporting those text plans in XML format.
- The rules module contains the rules used by be the textplan module to combine messages into constituent sets and textplans, respectively.
- The messages module generates messages from propositions, which will be used by the textplan module.
- The lexicalize_messageblocks is the “main” module of the lexicalization. For each message block in a textplan, it generates one or more possible lexicalizations which are then realized by the realization module.
- The lexicalization module generates lexicalizations (in HLDS-XML format) for each message, which are used by the lexicalize_messageblocks module to form lexicalizations of complete message blocks.
- A note on terminology: A message block in pypolibox is basically an instance of the Message class, e.g an “id” message block. This “id” message block in turn consists of several messages, e.g. an “authors” message and a “title” message.
- The realization module takes a lexicalized phrase or sentence (in HLDS-XML format) and converts it into a surface realization (with the help of OpenCCGs tccg executable).
- The hlds module allows to convert textplans from a nltk.featstruct-based format to HLDS-XML and vice versa. In addition, the module can produce attribute-value matrices of these textplans as LaTeX/PDF files.
The code is licensed under GPL Version 3. The grammar fragment is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This software reimplements parts of the Java-based JPolibox text-generation software written by Alexandra Strelakova, Felix Dombek, Mathias Langer and Till Kolter. pypolibox also includes a heavily modified version of Nicholas FitzGerald’s pydocplanner, which he released under a Creative Commons license (not specified further). The German OpenCCG grammar fragment that comes with pypolibox was written by Martin Oltmann.
Release date: 30-Apr-2014
- pypolibox is now licensed under GPLv3
- OpenCCG grammar fragment (CC-BY-NC-SA 4.0 licensed) now shipped with code
- first release via PyPI
- got rid of configuration file
- fixed some errors in the documentation
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.