lexicons_builder, a tool to create lexicons
Project description
The lexicons_builder package aims to provide a basic API to create lexicons related to specific words.
Key principle: Given the input words, it will look for synonyms or neighboors in the dictionnaries or in the NLP model. For each of the new retreiven terms, it will look again for its neighboors or synonyms and so on..
The general method is implemented on 3 different supports:
Synonyms dictionnaries (See list of the dictionnaries : ref:here <list_dictionnaries.rst>)
NLP language models
Output can be text file, xlsx file, turtle file or a Graph object. See <Quickstart> section for examples.
Full documentation available on readthedocs
Note
Feel free to raise an issue on GitHub if something isn’t working for you.
Installation
With pip
It is recommanded to use a virtual environment.
$ python -m venv env $ source env/bin/activate $ pip install lexicons-builder
From source
To install the module from source:
$ pip install git+git://github.com/GuillaumeLNB/lexicons_builder
Download NLP models (optionnal)
Here’s a non exhaustive list of websites where you can download NLP models manually. The models should be in word2vec or fasttext format.
Link |
Language(s) |
---|---|
French |
|
Multilingual |
|
Multilingual |
|
Multilingual |
|
English / Multilingual |
|
English |
Download wordnet
>>> import nltk >>> nltk.download()
Download WOLF (French WordNet) (optionnal)
$ # download WOLF (French wordnet if needed) $ wget https://gforge.inria.fr/frs/download.php/file/33496/wolf-1.0b4.xml.bz2 $ # (and extract it) $ bzip2 -d wolf-1.0b4.xml.bz2
QuickStart
Command line interface (CLI)
To get words from input words through CLI, run
$ python -m lexicons_builder <words> \ --lang <LANG> \ --out-file <OUTFILE> \ --format <FORMAT> \ --depth <DEPTH> \ --nlp-model <NLP_MODEL_PATHS> \ --web \ --wordnet \ --wolf-path <WOLF_PATH> \ --strict
- With:
<words> The word(s) we want to get synonyms from
<LANG> The word language (eg: fr, en, nl, …)
<DEPTH> The depth we want to dig in the models, websites, …
<OUTFILE> The file where the results will be stored
<FORMAT> The wanted output format (txt with indentation, ttl or xlsx)
- At least ONE of the following options is needed:
--nlp-model <NLP_MODEL_PATHS> The path to the nlp model(s)
--web Search online for synonyms
--wordnet Search on WordNet using nltk
--wolf-path <WOLF_PATH> The path to WOLF (French wordnet)
- Optional
--strict remove non relevant words
Eg: if we want to look for related terms linked to ‘eat’ and ‘drink’ on wordnet at a depth of 2, excecute:
$ python -m lexicons_builder eat drink \ --lang en \ --out-file test_en.txt \ --format txt \ --depth 1 \ --wordnet $ Note the indentation is linked to the depth a which the word was found $ head test_en.txt drink eat absorb ade aerophagia alcohol alcoholic_beverage alcoholic_drink banquet bar_hop belt_down beverage bi ...
Python
To get related terms interactively through Python, run
>>> from lexicons_builder import build_lexicon >>> # search for related terms of 'book' and 'read' in English at depth 1 online >>> output = build_lexicon(["book", "read"], 'en', 1, web=True) ... >>> # we then get a graph object >>> # output as a list >>> output.to_list() ['PS', 'accept', 'accommodate', 'according to the rules', 'account book', 'accountability', 'accountancy', 'accountant', 'accounting', 'accounts', 'accuse', 'acquire', 'act', 'adjudge', 'admit', 'adopt', 'afl', 'agree', 'aim', "al-qur'an", 'album', 'allege', 'allocate', 'allow', 'analyse', 'analyze', 'annuaire', 'anthology', 'appear in reading', 'apply', 'appropriate', 'arrange', 'arrange for', 'arrest', 'articulate', 'ascertain' ... >>> # output as rdf/turtle >>> print(output) @prefix ns1: <http://taxref.mnhn.fr/lod/property/> . @prefix ns2: <urn:default:baseUri:#> . @prefix ns3: <http://www.w3.org/2004/02/skos/core#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . ns2:PS ns1:isSynonymOf ns2:root_word_uri ; ns3:prefLabel "PS" ; ns2:comesFrom <synonyms.com> ; ns2:depth 1 . ns2:accept ns1:isSynonymOf ns2:root_word_uri ; ns3:prefLabel "accept" ; ns2:comesFrom <synonyms.com> ; ns2:depth 1 . ... >>> # output to an indented file >>> output.to_text_file("filename.txt") >>> with open("filename.txt") as f: ... print(f.read(1000)) ... read book PS accept accommodate according to the rules account book accountability ... >>> # output to xslx file >>> output.to_xlsx_file("results.xlsx") >>> # full search with 2 nlp models, wordnet and on the web >>> # download and extract google word2vec model >>> # from https://github.com/mmihaltz/word2vec-GoogleNews-vectors >>> >>> # download and extract FastText models >>> # from https://fasttext.cc/docs/en/english-vectors.html >>> >>> nlp_models = ["GoogleNews-vectors-negative300.bin", "wiki-news-300d-1M.vec"] >>> output = build_lexicon(["book", "letter"], "en", 1, web=True, wordnet=True, nlp_model_paths=nlp_models) >>> # can take a while >>> len(output.to_list()) 614 >>> # deleting non relevant words >>> output.pop_non_relevant_words() >>> len(output.to_list()) 57
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for lexicons_builder-0.3.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ae3ac64a417b041a81ccb70e2a50b9467b4c9da55c246dd3297546b514020ad |
|
MD5 | a27c676466b89cf97c9c7f37095679db |
|
BLAKE2b-256 | 4f56ddfa8d7d9df6251079ef14eecd676cf17fb7023b768c9db22c1edd89aca5 |