ldt·PyPI

Linguistic diagnostics for word embeddings

These details have not been verified by PyPI

Project links

Homepage

Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python
Topic
- Text Processing :: Linguistic

Project description

TLDR

LDT is a shiny new Python library for doing two things:

querying lots of dictionaries from a unified interface to perform spelling normalization, lemmatization, morphological analysis, retrieving semantic relations from WordNet, Wiktionary, BabelNet, and a lot more.
using the above to explore and profile word embeddings, i.e. the cool distributional representations of words as vectors.

If you have never heard about word embeddings – you’re missing out, here’s an introduction. If you have, head over to the project website for some new research results. And if you don’t care about word embeddings, you can still just use LDT as a supplement to NLTK, SpaCy, and other great NLP tools.

Note: LDT is in active development; all the dictionary functionality is already available. Scripts for running experiments and integration with vecto library are coming in the nearest weeks. Make sure you update your installation often!

Current functionality

LDT provides a unified Python interface for querying a large number of resources for natural language processing, including Wiktionary, BabelNet, WordNet, and a lot of new custom routines. A few quick highlights of the current functionality:

Retrieving related words from WordNet, Wiktionary, Wiktionary Thesaurus and BabelNet:

>>> wiktionary = ldt.dicts.semantics.Wiktionary()
>>> wiktionary.get_relation("white", relation="synonyms")
['pale', 'fair']
>>> wikisaurus = ldt.dicts.semantics.Wikisaurus()
>>> wikisaurus.get_relations("cat", relations="all")
{'synonyms': ['tabby', 'puss', 'cat', 'kitty', 'moggy', 'housecat', 'malkin', 'kitten', 'tom', 'grimalkin', 'pussy-cat', 'mouser', 'pussy', 'queen', 'tomcat', 'mog'],
 'hyponyms': [],
 'hypernyms': ['mammal', 'carnivore', 'vertebrate', 'feline', 'animal', 'creature'],
 'antonyms': [],
 'meronyms': []}
>>> babelnet = ldt.dicts.semantics.BabelNet()
>>> babelnet.get_relations("senator", relations=("hypernyms"))
{'hypernyms': ['legislative_assembly', 'metropolitan_see_of_milan', 'poltician', 'legislative_seat', 'senator_of_rome', 'band', 'the_upper_house', 'polictian', 'patres_conscripti', 'musical_ensemble', 'presbytery', 'politician', 'pol', 'solo_project', 'policymaker', 'political_figure', 'politican', 'policymakers', 'archbishop_emeritus_of_milan', 'deliberative_assemblies', 'ensemble', 'career_politics', 'soloproject', 'list_of_musical_ensembles', 'legislative', 'roman_senators', 'archbishopric_of_milan', 'politicain', 'rock_bands', 'section_leader', 'musical_organisation', 'music_band', 'four-piece', 'roman_catholic_archdiocese_of_milan', 'upper_house', 'archdiocese_of_milan', 'band_man', 'milanese_apostolic_catholic_church', 'legistrative_branch', 'group', 'solo-project', 'music_ensemble', 'law-makers', 'roman_senator', 'legislative_arm_of_government', 'solo_act', 'patronage', 'roman_catholic_archbishop_of_milan', 'bar_band', 'senate_of_rome', 'deliberative_body', 'see_of_milan', 'legislative_fiat', 'musical_group', 'ambrosian_catholic_church', 'legislature_of_orissa', 'legislative_branch_of_government', 'list_of_politicians', 'senatorial_lieutenant', 'roman_catholic_archdiocese_of_milano', 'legislature_of_odisha', 'bandmember', 'assembly', 'archdiocese_of_milano', 'bishop_of_milan', 'ensemble_music', 'solo_musician', 'musical_duo', 'legislative_branch_of_goverment', 'first_chamber', 'politicians', 'legislative_bodies', 'political_leaders', 'politico', 'music_group', 'legislative_body', 'career_politician', 'legislature', 'rock_group', 'legislative_power', 'diocese_of_milan', 'musical_ensembles', 'musical_organization', 'revising_chamber', 'archbishops_of_milan', 'political_leader', 'deliberative_assembly', 'conscript_fathers', 'five-piece', 'catholic_archdiocese_of_milan', 'pop_rock_band', 'senatrix', 'deliberative_organ', 'polit.', 'roman_senate', 'legislative_politics', 'bishopric_of_milan', 'legislative_branch', 'musical_band', 'archbishop_of_milan', 'legislatures', 'general_assembly', 'musical_groups', 'instrumental_ensemble', 'politition', 'patres', 'upper_chamber', 'solo-act', 'conscripti', 'legislator']}

Derivational analysis:

>>> derivation_dict = ldt.dicts.derivation.DerivationAnalyzer()
>>> derivation_dict.analyze("kindness")
{'original_word': ['kindness'],
 'other': [],
  'prefixes': [],
  'related_words': ['kindhearted', 'kindly', 'in kind', 'kindliness', 'kinda', 'many-kinded', 'first-of-its-kind', 'kind of', 'kindful', 'kindless'],
  'roots': ['kind'],
  'suffixes': ['-ness']}

Reliable lemmatization with productive rules and Wiktionary/BabelNet, even for new words:

>>> morph_metadict = ldt.dicts.morphology.MorphMetaDict()
>>> morph_metadict.lemmatize("GPUs")
['GPU']

Correcting (at least some) text pre-processing noise and normalizing the input:

>>> analyzer = ldt.dicts.normalize.Normalization()
>>> analyzer.normalize("%grammar")
{'lemmas': ['grammar'],
 'found_in': ['wordnet'],
 'word_categories': ['Misspellings'],
 'pos': ['noun']}
>>> analyzer.normalize("gram-mar")
{'found_in': ['wordnet'],
 'lemmas': ['grammar'],
 'word_categories': ['Misspellings'],
 'pos': ['noun']}
>>> analyzer.normalize("grammarlexicon")
{'found_in': ['wordnet'],
'lemmas': ['grammar', "lexicon],
'word_categories': ['Misspellings'],
'pos': ['noun']}

Trustworthy correction of frequent misspelling patterns, only for high-certainty cases:

>>> spellchecker_en = ldt.dicts.spellcheck.SpellcheckerEn()
>>> spellchecker_en.spelling_nazi("abritrary")
'arbitrary'

Collecting all the available info about a word with one click:

>>> encapsulation = ldt.Word("encapsulation")
>>> encapsulation.pp_info()
======DERIVATIONAL INFO======
Stems :  capsulate, encapsulate, capsule
Suffixes :  -ion, -ate
Prefixes :  en-
OtherDerivation :
RelatedWords :  encapsulation, capsule review, glissonian capsule, capsular, capsulate
======SEMANTIC INFO======
Synonyms :  encapsulation
Antonyms :
Meronyms :
Hyponyms :
Hypernyms :  physical_process, status, condition, process
======EXTRA WORD CLASSES======
ProperNouns :  False
Noise :  False
Numbers :  False
URLs :  False
Hashtags :  False
Filenames :  False
ForeignWords :  False
Misspellings :  False
Missing :  False

Finding possible relations between a pair of words in one click:

>>> relation_analyzer = ldt.relations.RelationsInPair()
>>> relation_analyzer.analyze("black", "white")
{'Hyponyms': True,
 'SharedMorphForm': True,
 'SharedPOS': True,
 'Synonyms': True,
 'Antonyms': True,
 'ShortestPath': 0.058823529411764705,
 'Associations': True}

The above functionality can be used in many NLP applications and for text pre-processing, large-scale analysis of potential relations between pairs of words. See ldt.experiments.demo file for a toy example of such an analysis.

That last step can help you predict how your model will do on a particular task, and also give some ideas about how it can be improved. Check out the results of a large-scale experiment with 60 embeddings and 21 datasets.

See the Tutorial and API documentation for more details on all of these resources.

Quick links

Installation instructions

Project website

Tutorial

API reference.

Published research results.

Word embeddings leaderboard.

Correlation of LD scores with downstream task performance.

Support

If something doesn’t work, open an issue on GitHub.

Multilinguality

Yes, LDT is multilingual! At least, as far as querying semantic relations goes. LDT supports BabelNet, the largest multilingual dictionary resource available - so everything they have is retrievable. Many of the other LDT modules (particularly morphology) are language-specific, and only English is fully supported at the moment. However, the infrastructure for adding other languages is already in place, so if you can find or create e.g. lists of affixes for your language, development would be easy. Get in touch if you’d like to get involved.

Legal caveat: LDT is open-source free software. No hamsters were harmed in its production, and no harm should come from its usage. However, no guarantees of any kind.

0.3.0, 2018-10-06

experiments package: - extracting vector neighborhoods with optional normalization - annotating vector neighborhoods with linguistic relations - analysing the results - automatically logging metadata for all experiments
bug fixes

v 0.2.1, 2018-09-25.

bug fixes.

v 0.2.0, 2018-09-24.

Tutorial;
19 LD variables, including ontology paths;
detection of antonymy with language-specific derivational patterns;
bug fixes.

v 0.1.0, 2018-08-15 – Initial release.

Retrieving lexicographic information from BabelNet, Wiktionary, Wikisaurus and English WordNet;
Retrieving morphological information from the same resources;
Lemmatization with WordNet and custom rules for English;
Custom rule-based analysis of productive suffixes and prefixes for English;
Parsing Wiktionary etymologies
Custom compound splitting routines with filtering by subword length;
4 custom patterns for fixing frequent spelling mistakes.

Project details

These details have not been verified by PyPI

Project links

Homepage

Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python
Topic
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

0.4.0

Nov 16, 2018

0.3.9

Nov 4, 2018

This version

0.3.0

Oct 9, 2018

0.2.1

Sep 27, 2018

0.2.0

Sep 26, 2018

0.1.0

Sep 26, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ldt-0.3.0.tar.gz (4.6 MB view details)

Uploaded Oct 9, 2018 Source

Built Distribution

ldt-0.3.0-py3-none-any.whl (4.7 MB view details)

Uploaded Oct 9, 2018 Python 3

File details

Details for the file ldt-0.3.0.tar.gz.

File metadata

Download URL: ldt-0.3.0.tar.gz
Upload date: Oct 9, 2018
Size: 4.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.6

File hashes

Hashes for ldt-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`254b35ee0fff7cef216cf3df6af050cdec8416734c2335605e8eef5b8e6912ab`
MD5	`7f9c301a40be66981d68491e864de5ce`
BLAKE2b-256	`50a6ae2731799d0124396e965c9a8ca7242f4f09c53efda22cbeab584595326a`

See more details on using hashes here.

File details

Details for the file ldt-0.3.0-py3-none-any.whl.

File metadata

Download URL: ldt-0.3.0-py3-none-any.whl
Upload date: Oct 9, 2018
Size: 4.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.6

File hashes

Hashes for ldt-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`de084ccb37685c0170a6de21eb9f685a488c04b08bc8e6cbf316f3c3c66fd2a6`
MD5	`b910472f3e199cea99889796b60a6ccb`
BLAKE2b-256	`0c603fb31a0151ce792e853110a07913d6b7828eec7a38b63c9dd8ca176c29af`

See more details on using hashes here.

ldt 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TLDR

Current functionality

Quick links

Support

Multilinguality

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes