Builds a multi-source English lexicon

These details have not been verified by PyPI

Project links

homepage

Project description

🗽 CityLex: a free English lexical database

CityLex is an English lexical database intended to replace or enhance databases like CELEX. It combines data from up to seven unique sources, including frequency norms, morphological analyses, and pronunciations. Since these have varying license conditions (some are proprietary, others restrict redistribution), we do not provide the database as is. Rather the user must generate a personal copy by executing a Python script, enabling whatever sources they wish to use.

Building your own CityLex

To install CityLex execute

pip install citylex

To see the available data sources and options, execute citylex --help.

To generate the lexicon, execute citylex with at least one source enabled using command-line flags. As most of the data is downloaded from outline sources, an internet connection is normally required. The process takes roughly four minutes with all sources enabled; much of the time is spent downloading large files.

To generate a lexicon with all the sources that don't require manual downloads, execute

citylex --all-free

File formats

Two files are produced. The first, by default citylex.tsv, is a standard wide-format "tab separated values" (TSV) file, of the sort that can be read into Excel or R. Some fields (particularly pronunciations and morphological analyses) can have multiple entries per wordform. In this case, they are separated using the ^ character.

Advanced users may wish to make use of the second file, by default citylex.textproto, a text-format protocol buffer which provides a better representation of the repeated fields. To parse this file in Python, use the following snippet:

import citylex

lexicon = citylex.read_textproto("citylex.textproto")

This will parse the text-format data and populate lexicon. One can then iterate over lexicon.entry like a Python dictionary.

Non-redistributable data sources

Not all CityLex data can be obtained automatically from online sources. If you wish to enable CELEX features, follow the instructions below.

This proprietary resource must be obtained from the Linguistic Data Consortium as LDC96L14.tgz. The file should be decompressed using

tar -xzf LDC96L14.tgz

This will produce a directory named celex2. To enable CELEX2 features, use --celex and pass the local path of this directory as an argument to --celex_path.

For more information

citylex.proto for the protocol buffer data structure
citylex.bib for references to the data sources used

For contributors

To regenerate citylex_pb2.py you will need to install the Protocol Buffers C++ runtime for your platform, making sure the version number (e.g., the one returned by protoc --version matches that of protobuf in requirements.txt. Then, run protoc --python_out=. citylex.proto.

License

The CityLex codebase are distributed under the Apache 2.0 license. Please see License.txt for details.

All other data sources bear their original licenses chosen by their creators; see citylex --help for more information.

Author

CityLex was created by Kyle Gorman.

Project details

These details have not been verified by PyPI

Project links

homepage

Release history Release notifications | RSS feed

This version

0.1.15

Feb 20, 2024

0.1.14

Jan 11, 2024

0.1.13

Dec 12, 2023

0.1.12

Oct 11, 2023

0.1.11

May 29, 2023

0.1.10

Dec 27, 2022

0.1.9

Apr 25, 2021

0.1.8

Feb 10, 2021

0.1.7

Aug 7, 2020

0.1.6

May 21, 2020

0.1.5

Mar 9, 2020

0.1.4

Dec 10, 2019

0.1.3

Dec 8, 2019

0.1.2

Oct 14, 2019

0.1.1

Sep 27, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citylex-0.1.15.tar.gz (13.0 kB view details)

Uploaded Feb 20, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

citylex-0.1.15-py3-none-any.whl (13.6 kB view details)

Uploaded Feb 20, 2024 Python 3

File details

Details for the file citylex-0.1.15.tar.gz.

File metadata

Download URL: citylex-0.1.15.tar.gz
Upload date: Feb 20, 2024
Size: 13.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.13

File hashes

Hashes for citylex-0.1.15.tar.gz
Algorithm	Hash digest
SHA256	`a24dfe367968eb433e997975c64dde36bb15da7be227c214de0609645001883b`
MD5	`d210fd2d907b37243f9697b3ee2457c0`
BLAKE2b-256	`2fb94f6cb165637696003c9ec390c9dcb52c620adced56fa970ec31bbe7fcd68`

See more details on using hashes here.

File details

Details for the file citylex-0.1.15-py3-none-any.whl.

File metadata

Download URL: citylex-0.1.15-py3-none-any.whl
Upload date: Feb 20, 2024
Size: 13.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.13

File hashes

Hashes for citylex-0.1.15-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a042a739dce2842acf61d4cf3640efd6fa4a73404c7f4a80a1d2bf4feba6b174`
MD5	`61c40043e11131ed34dc8274096d14dc`
BLAKE2b-256	`46c116cc464e6cb619f347c80ada1e3820cdf8fabeacd1c47150f232893efd67`

See more details on using hashes here.

citylex 0.1.15

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🗽 CityLex: a free English lexical database

Building your own CityLex

File formats

Non-redistributable data sources

For more information

For contributors

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes