Builds a multisource English lexicon
🗽 CityLex: a free multisource English lexical database
CityLex is an English lexical database intended to replace or enhance databases like CELEX. It combines data from up to seven unique sources, including frequency norms, morphological analyses, and pronunciations. Since these have varying license conditions (some are proprietary, others restrict redistribution), we do not provide the database as is. Rather the user must generate a personal copy by executing a Python script, enabling whatever sources they wish to use.
Building your own CityLex
To install CityLex execute
pip install citylex
To see the available data sources and options, execute
To generate the lexicon, execute
citylex with at least one source enabled
using command-line flags. As most of the data is downloaded from outline
sources, an internet connection is normally required. The process takes roughly
four minutes with all sources enabled; much of the time is spent downloading
To generate a lexicon with all the sources that don't require manual downloads, execute
citylex --cmudict \ --elp \ --subtlex_uk \ --subtlex_us \ --udlexicons \ --unimorph \ --wikipron_uk \ --wikipron_us
Two files are produced. The first, by default
citylex.tsv, is a standard
wide-format "tab separated values" (TSV) file, of the sort that can be read into
Excel or R. Some fields (particularly pronunciations and morphological analyses)
can have multiple entries per wordform. In this case, they are separated using
Advanced users may wish to make use of the second file, by default
protocol buffer which
provides a better representation of the repeated fields. To parse this file in
Python, use the following snippet:
import citylex lexicon = citylex.read_textproto("citylex.textproto")
This will parse the text-format data and populate
lexicon. One can then
lexicon.entry like a Python dictionary.
Non-redistributable data sources
Not all CityLex data can be obtained automatically from online sources. If you wish to enable CELEX features, follow the instructions below.
This proprietary resource must be obtained from the Linguistic Data
LDC96L14.tgz. The file
should be decompressed using
tar -xzf LDC96L14.tgz
This will produce a directory named
celex2. To enable CELEX2 features, use
--celex and pass the local path of this directory as an argument to
For more information
citylex.protofor the protocol buffer data structure
citylex.bibfor references to the data sources used
citylex_pb2.py you will need to install the Protocol Buffers
C++ runtime for your platform,
making sure the version number (e.g., the one returned by
matches that of
requirements.txt. Then, run
protoc --python_out=. citylex.proto.
The CityLex codebase are distributed under the Apache 2.0 license. Please see
License.txt for details.
All other data sources bear their original licenses chosen by their creators;
citylex --help for more information.
CityLex was created by Kyle Gorman.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size citylex-0.1.4.tar.gz (14.7 kB)||File type Source||Python version None||Upload date||Hashes View hashes|