Skip to main content

lexicons_builder, a tool to create lexicons

Project description

The lexicons_builder package aims to provide a basic API to create lexicons related to specific words.

Key principle: Given the input words, it will look for synonyms or neighboors in the dictionnaries or in the NLP model. For each of the new retreiven terms, it will look again for its neighboors or synonyms and so on..

The general method is implemented on 3 different supports:

  1. Synonyms dictionnaries (See list of the dictionnaries : ref:here <list_dictionnaries.rst>)

  2. NLP language models

  3. WordNet (or WOLF)

Output can be text file, xlsx file, turtle file or a Graph object. See <Quickstart> section for examples.

Full documentation available on readthedocs

Note

Feel free to raise an issue on GitHub if something isn’t working for you.

Installation

With pip

It is recommanded to use a virtual environment.

$ python -m venv env
$ source env/bin/activate
$ pip install lexicons-builder

From source

To install the module from source:

$ pip install git+git://github.com/GuillaumeLNB/lexicons_builder

Download NLP models (optionnal)

Here’s a non exhaustive list of websites where you can download NLP models manually. The models should be in word2vec or fasttext format.

Link

Language(s)

https://fauconnier.github.io/#data

French

https://wikipedia2vec.github.io/wikipedia2vec/pretrained/

Multilingual

http://vectors.nlpl.eu/repository/

Multilingual

https://github.com/alexandres/lexvec#pre-trained-vectors

Multilingual

https://fasttext.cc/docs/en/english-vectors.html

English / Multilingual

https://github.com/mmihaltz/word2vec-GoogleNews-vectors

English

Download wordnet

>>> import nltk
>>> nltk.download()

Download WOLF (French WordNet) (optionnal)

$ # download WOLF (French wordnet if needed)
$ wget https://gforge.inria.fr/frs/download.php/file/33496/wolf-1.0b4.xml.bz2
$ # (and extract it)
$ bzip2 -d wolf-1.0b4.xml.bz2

QuickStart

Command line interface (CLI)

To get words from input words through CLI, run

$ python -m lexicons_builder <words>  \
      --lang <LANG>                 \
      --out-file <OUTFILE>          \
      --format <FORMAT>             \
      --depth <DEPTH>               \
      --nlp-model <NLP_MODEL_PATHS> \
      --web                         \
      --wordnet                     \
      --wolf-path <WOLF_PATH>       \
      --strict
With:
  • <words> The word(s) we want to get synonyms from

  • <LANG> The word language (eg: fr, en, nl, …)

  • <DEPTH> The depth we want to dig in the models, websites, …

  • <OUTFILE> The file where the results will be stored

  • <FORMAT> The wanted output format (txt with indentation, ttl or xlsx)

At least ONE of the following options is needed:
  • --nlp-model <NLP_MODEL_PATHS> The path to the nlp model(s)

  • --web Search online for synonyms

  • --wordnet Search on WordNet using nltk

  • --wolf-path <WOLF_PATH> The path to WOLF (French wordnet)

Optional
  • --strict remove non relevant words

Eg: if we want to look for related terms linked to ‘eat’ and ‘drink’ on wordnet at a depth of 2, excecute:

$ python -m lexicons_builder eat drink  \
      --lang        en                  \
      --out-file    test_en.txt         \
      --format      txt                 \
      --depth       1                   \
      --wordnet
$ Note the indentation is linked to the depth a which the word was found
$ head test_en.txt
  drink
  eat
    absorb
    ade
    aerophagia
    alcohol
    alcoholic_beverage
    alcoholic_drink
    banquet
    bar_hop
    belt_down
    beverage
    bi
  ...

Python

To get related terms interactively through Python, run

>>> from lexicons_builder import build_lexicon
>>> # search for related terms of 'book' and 'read' in English at depth 1 online
>>> output = build_lexicon(["book", "read"], 'en', 1, web=True)
...
>>> # we then get a graph object
>>> # output as a list
>>> output.to_list()
['PS', 'accept', 'accommodate', 'according to the rules', 'account book', 'accountability', 'accountancy', 'accountant', 'accounting', 'accounts', 'accuse', 'acquire', 'act', 'adjudge', 'admit', 'adopt', 'afl', 'agree', 'aim', "al-qur'an", 'album', 'allege', 'allocate', 'allow', 'analyse', 'analyze', 'annuaire', 'anthology', 'appear in reading', 'apply', 'appropriate', 'arrange', 'arrange for', 'arrest', 'articulate', 'ascertain' ...
>>> # output as rdf/turtle
>>> print(output)
@prefix ns1: <http://taxref.mnhn.fr/lod/property/> .
@prefix ns2: <urn:default:baseUri:#> .
@prefix ns3: <http://www.w3.org/2004/02/skos/core#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ns2:PS ns1:isSynonymOf ns2:root_word_uri ;
    ns3:prefLabel "PS" ;
    ns2:comesFrom <synonyms.com> ;
    ns2:depth 1 .

ns2:accept ns1:isSynonymOf ns2:root_word_uri ;
    ns3:prefLabel "accept" ;
    ns2:comesFrom <synonyms.com> ;
    ns2:depth 1 .
...

>>> # output to an indented file
>>> output.to_text_file("filename.txt")
>>> with open("filename.txt") as f:
...     print(f.read(1000))
...
read
book
  PS
  accept
  accommodate
  according to the rules
  account book
  accountability
...
>>> # output to xslx file
>>> output.to_xlsx_file("results.xlsx")

>>> # full search with 2 nlp models, wordnet and on the web
>>> # download and extract google word2vec model
>>> # from https://github.com/mmihaltz/word2vec-GoogleNews-vectors
>>>
>>> # download and extract FastText models
>>> # from https://fasttext.cc/docs/en/english-vectors.html
>>>
>>> nlp_models = ["GoogleNews-vectors-negative300.bin", "wiki-news-300d-1M.vec"]
>>> output = build_lexicon(["book", "letter"], "en", 1, web=True, wordnet=True, nlp_model_paths=nlp_models)
>>> # can take a while
>>> len(output.to_list())
614
>>> # deleting non relevant words
>>> output.pop_non_relevant_words()
>>> len(output.to_list())
57

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lexicons_builder-0.3.2.tar.gz (240.8 kB view details)

Uploaded Source

Built Distribution

lexicons_builder-0.3.2-py3-none-any.whl (29.5 kB view details)

Uploaded Python 3

File details

Details for the file lexicons_builder-0.3.2.tar.gz.

File metadata

  • Download URL: lexicons_builder-0.3.2.tar.gz
  • Upload date:
  • Size: 240.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.6

File hashes

Hashes for lexicons_builder-0.3.2.tar.gz
Algorithm Hash digest
SHA256 0592bde2b835e186be6f82aefa1aef73a36dfa8ec456f62fbb269be2df4f13a3
MD5 6ba22eb9a0ea6213a6f615c1492313a5
BLAKE2b-256 86dad370f47ec67416e1c36828b929e631499a954e52080a89d93168f3bcdc9a

See more details on using hashes here.

File details

Details for the file lexicons_builder-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: lexicons_builder-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 29.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.6

File hashes

Hashes for lexicons_builder-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0ae3ac64a417b041a81ccb70e2a50b9467b4c9da55c246dd3297546b514020ad
MD5 a27c676466b89cf97c9c7f37095679db
BLAKE2b-256 4f56ddfa8d7d9df6251079ef14eecd676cf17fb7023b768c9db22c1edd89aca5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page