Skip to main content

Extraction de LExique par Variation d'Entropie - Lexicon extraction based on the variation of entropy

Project description

What is ELeVE ?

ELeVE is a library for calculating a specialized language model from a corpus of text.

It allows you to use statistics from the training corpus to calculate branching entropy, and autonomy measures for n-grams of text. See [MagistrySagot2012] for a definiton of these terms (autonomy is also called « nVBE » for « normalized entropy variation »)

It was mainly developed for segmentation of mandarin Chinese, but was successfully used to research on other tasks like keyphrase extraction.

Full documentation is available on http://pythonhosted.org/eleve/.

In a nutshell

Here is simple “getting started”. First you have to train a model:

>>> from eleve import MemoryStorage
>>>
>>> storage = MemoryStorage()
>>>
>>> # Then the training itself:
>>> storage.add_sentence(["I", "like", "New", "York", "city"])
>>> storage.add_sentence(["I", "like", "potatoes"])
>>> storage.add_sentence(["potatoes", "are", "fine"])
>>> storage.add_sentence(["New", "York", "is", "a", "fine", "city"])

And then you cat query it:

>>> storage.query_autonomy(["New", "York"])
2.0369977951049805
>>> storage.query_autonomy(["like", "potatoes"])
-0.3227022886276245

Eleve also store n-gram’s frequency:

>>> storage.query_count(["New", "York"])
2
>>> storage.query_count(["New", "potatoes"])
0
>>> storage.query_count(["I", "like", "potatoes"])
1
>>> storage.query_count(["potatoes"])
2

The you can use it for segmentation:

>>> from eleve import Segmenter
>>> s = Segmenter(storage)
>>> # segment up to 4-grams, if we used the same storage as before.
>>>
>>> s.segment(["What", "do", "you", "know", "about", "New", "York"])
[['What'], ['do'], ['you'], ['know'], ['about'], ['New', 'York']]

Installation

You will need some dependancies. On ubuntu:

$ sudo apt-get install libboost-python-dev libboost-filesystem-dev libleveldb-dev

Then to install eleve:

$ pip install eleve

or if you have a local clone of source folder:

$ python setup.py install

Get the source

Source are stored on github:

$ git clone https://github.com/kodexlab/eleve

Contribute

Install the development environment:

$ git clone https://github.com/kodexlab/eleve
$ cd eleve
$ virtualenv ENV -p /usr/bin/python3
$ source ENV/bin/activate
$ pip install -r requirements.txt
$ pip install -r requirements.dev.txt

Pull requests are welcomed !

To run tests:

$ make testall

To build the doc:

$ make doc

then open: docs/_build/html/index.html

Warning: You need to have eleve accesible in the python path to run tests (and to build doc). For that you can install eleve as a link in local virtualenv:

$ pip install -e .

(Note: this is indicated in pytest good practice )

References

If you use eleve for an academic word tanks to cite this paper:

[MagistrySagot2012]Magistry, P., & Sagot, B. (2012, July). Unsupervized word segmentation: the case for mandarin chinese. In Proceedings of the 50th Annual Meeting of the ACL: Short Papers-Volume 2 (pp. 383-387). http://www.aclweb.org/anthology/P12-2075

Project details


Release history Release notifications

This version
History Node

15.10.r2

History Node

15.10.r1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
eleve-15.10.r2.tar.gz (24.5 kB) Copy SHA256 hash SHA256 Source None Oct 31, 2015

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page