Skip to main content

Extraction de LExique par Variation d'Entropie - Lexicon extraction based on the variation of entropy

Project description

What is ELeVE ?

ELeVE is a library intended for computing an “autonomy estimate” score for substrings (all n-grams) in a corpus of text.

The autonomy score is based on normalised variation of branching entropies (nVBE) of strings, See [MagistrySagot2012] for a definiton of these terms

It was developed mainly for unsupervised segmentation of mandarin Chinese, but is language independant and was successfully used in research on other tasks like keyphrase extraction.

Full documentation is available on http://pythonhosted.org/eleve/.

In a nutshell

Here is a simple “getting started”. First you have to train a model:

>>> from eleve import MemoryStorage
>>>
>>> storage = MemoryStorage()
>>>
>>> # Then the training itself:
>>> storage.add_sentence(["I", "like", "New", "York", "city"])
>>> storage.add_sentence(["I", "like", "potatoes"])
>>> storage.add_sentence(["potatoes", "are", "fine"])
>>> storage.add_sentence(["New", "York", "is", "a", "fine", "city"])

And then you cat query it:

>>> storage.query_autonomy(["New", "York"])
2.0369977951049805
>>> storage.query_autonomy(["like", "potatoes"])
-0.3227022886276245

Eleve also store n-gram’s occurence count:

>>> storage.query_count(["New", "York"])
2
>>> storage.query_count(["New", "potatoes"])
0
>>> storage.query_count(["I", "like", "potatoes"])
1
>>> storage.query_count(["potatoes"])
2

Then, you can use it for segmentation, using an algorithm that look for the solution which maximize nVBE of resulting words:

>>> from eleve import Segmenter
>>> s = Segmenter(storage)
>>> # segment up to 4-grams, if we used the same storage as before.
>>>
>>> s.segment(["What", "do", "you", "know", "about", "New", "York"])
[['What'], ['do'], ['you'], ['know'], ['about'], ['New', 'York']]

Installation

You will need some dependencies. On Ubuntu:

$ sudo apt-get install python3-dev libboost-python-dev libboost-filesystem-dev libleveldb-dev

Then to install eleve:

$ pip install eleve

or if you have a local clone of source folder:

$ python setup.py install

Get the source

Source are stored on github:

$ git clone https://github.com/kodexlab/eleve

Contribute

Install the development environment:

$ git clone https://github.com/kodexlab/eleve
$ cd eleve
$ virtualenv ENV -p /usr/bin/python3
$ source ENV/bin/activate
$ pip install -r requirements.txt
$ pip install -r requirements.dev.txt

Pull requests are welcome!

To run tests:

$ make testall

To build the doc:

$ make doc

then open: docs/_build/html/index.html

Warning: You need to have eleve accesible in the python path to run tests (and to build doc). For that you can install eleve as a link in local virtualenv:

$ pip install -e .

(Note: this is indicated in pytest good practice )

References

If you use eleve for an academic publication, please cite this paper:

[MagistrySagot2012]

Magistry, P., & Sagot, B. (2012, July). Unsupervized word segmentation: the case for mandarin chinese. In Proceedings of the 50th Annual Meeting of the ACL: Short Papers-Volume 2 (pp. 383-387). http://www.aclweb.org/anthology/P12-2075

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eleve-20.10.5.tar.gz (69.3 kB view details)

Uploaded Source

File details

Details for the file eleve-20.10.5.tar.gz.

File metadata

  • Download URL: eleve-20.10.5.tar.gz
  • Upload date:
  • Size: 69.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.3

File hashes

Hashes for eleve-20.10.5.tar.gz
Algorithm Hash digest
SHA256 bb463211b94fcabcc6460c497723e2e22692ac88e7e27313a209a09f7fd973f6
MD5 576bd02cba30a4abb25c0f0fc25ce429
BLAKE2b-256 543a448b0faef05a3a2f66a9f4a97cf692e6b0167dabe8e793b5975cde7272c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page