Skip to main content

A library for creating JSONs and accessing the data for the CC-Canto and CC-CEDICT open source Chinese dictionaries.

Project description

A Python library to download, update, create and access keyed JSONs for the dictionaries CC-CEDICT and CC-Canto.

Modules

The core Python library consists of three files: parser.py, which handles parsing the raw text files sourced from the CC-CEDICT and CC-Canto websites and creating the JSONs; update.py, which handles fetching the data from those websites and calls functions from parser to generate the JSONS in the right place; and CC_Dict.py, which provides the the class CC_Dict for easier programmatic access of the paths for the JSONs or the data in the JSONs.

The two modules you'll most likely work with are update.py and CC_Dict.py.

CC_Dict.py

Core Class

from py_cc_dicts.CC_Dict import *

c = CC_Dict("CANTO") # Creates a CC_Dict object that can access the JSONs and dictionary data for CC-Canto. 
m = CC_Dict("CEDICT") # Creates a CC_Dict object that can access the JSONs and dictionary data for CC-CEDICT.
r = CC_Dict("READINGS") # Creates a CC_Dict object that can access the JSONs and readings data for the jyutping readings of CC-CEDICT as provided on the CC-Canto website.

# Loads the data from the dictionary website if not already existing into the current directory.

dicts = [CC_Dict("canto"), CC_Dict("cedict"), CC_Dict("readings")]
# Not case sensitive, the above works as well.
c.get_data(key = None) 
m.get_data(key = None)
# Get the dictionary data keyed with input *key* as a dict

c = CC_Dict("CANTO", data_dir = "some dir") # Creates a CC_Dict and stores the loaded data from the website at *data_dir* if it already does not exist in *data_dir*

c = CC_Dict("CANTO", update = True)
m = CC_Dict("CEDICT", data_dir = "some dir", update = True)
# Forcefully update the data by downloading it from the website and regenerating the JSONs, even if they already exists in either the current directory if none entred, or at *data_dir*
c2 = CC_Dict("CANTO", key = "traditional")
# By default load the dictionary data keyed by the input key into the CC_Dict's internal dict

c2.dict # Produces the dict keyed by traditional

# You can also search with dict syntax.
c2["出發"]
# Produces:
{'traditional': '出發', 'simplified': '出发', 'pinyin': 'chu1 fa1', 'jyutping': 'ceot1 faat3', 'definitions': ['to depart']}

c2["貓"]
# Produces (since there are multiple entries for the same key, they're provided as a list):
[{'traditional': '貓', 'simplified': '猫', 'pinyin': 'mao1', 'jyutping': 'maau1', 'definitions': ['cat M: 只zhī [只]', '(dialect) to hide oneself', '(coll.) modem', "to arch one's back", 'to be drunk', 'to be high on drugs']}, 
{'traditional': '貓', 'simplified': '猫', 'pinyin': 'mao1', 'jyutping': 'maau4', 'definitions': ['cat M: 只zhī [只]', '(dialect) to hide oneself', '(coll.) modem', "to arch one's back", 'to be drunk', 'to be high on drugs']}, 
{'traditional': '貓', 'simplified': '猫', 'pinyin': 'mao1', 'jyutping': 'miu4', 'definitions': ['cat M: 只zhī [只]', '(dialect) to hide oneself', '(coll.) modem', "to arch one's back", 'to be drunk', 'to be high on drugs']}]

c2.keys()
c2.values()
c2.items()
# As CC_Dict is an extension of dict, common dict functions also work, although some might have unintended behaviour if key = "definitions" (see below)
c3 = CC_Dict("CANTO", key = "definitions")
# If the key given is "definitions", allows for the search of all definitions via dict syntax.

c3["some string"]
# This would search and return all definitions for at contain the exact substring "some string" (as definitions are stored as strings)

update.py

Core Functions

from py_cc_dicts.update import *

load_latest_data() # Load to current working directory
load_latest_data("*insert path here*") # Load to provided path

# Load the raws, the plain txt files and the JSONS for both CC-CEDICT and CC-Canto to input directory, if provided, else to current working directory.
fetch_raw() 

# Loads the zip files from the CC-CEDICT and CC-CANTO website to the *current working directory*
generate_jsons("path to zip directory")

# Takes the path to the directory where the raw data is stored and outputs the parsed JSONs for each key type to the *current working directory*
get_jsons(dir = "", dict_type = "")
get_raws(dir = "", dict_type = "")

# Search dir for jsons or raw zip files of the input dict_type (CEDICT, CANTO), or both if no dict_type is provided, and returns a list of strings containing the paths to those files.
jsons_exists(dir = "")
raws_exists(dir = "")

# Check if the jsons or raw zip files exist in directory *dir*, or the current working directory if none provided.
clean_raws(dir = "")
clean_jsons(dir = "")

# Delete the raw zip files or JSONs from directory *dir*, or the current working directory if none provided.

parser.py

Constants

from py_cc_dicts.parser import *

DICT_TYPES = ["CEDICT", "CANTO"] # Valid Dictionary Codes, used throughout the program.
VALID_KEYS = {DICT_TYPES[0]: ["traditional", "simplified", "pinyin", "definitions", None],
               DICT_TYPES[1]: ["traditional", "simplified", "pinyin", "jyutping", "definitions", None]}  # Valid keys for CC_Dict, used for creation of JSONs

Core Functions

parse_cc_canto(filepath, key = "traditional")
parse_cc_cedict(filepath, key = "traditional", surnames = True)

# Parse the respective raw text file at *filepath* to produce a JSON with the given *key*. Surnames is currently unused.

Changelog

V 1.1

Can now access the jyupting readings data for CC-CEDICT as provided on the CC-Canto website.

r = CC_Dict("READINGS", key = "traditional")
r["試驗"]

# Returns:
{'traditional': '試驗', 'simplified': '试验', 'pinyin': 'shi4 yan4'}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_cc_dicts-1.1.0.tar.gz (63.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

py_cc_dicts-1.1.0-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file py_cc_dicts-1.1.0.tar.gz.

File metadata

  • Download URL: py_cc_dicts-1.1.0.tar.gz
  • Upload date:
  • Size: 63.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for py_cc_dicts-1.1.0.tar.gz
Algorithm Hash digest
SHA256 efd405203b91c85f674875f5b6898277307608e038d3a165ffd6b1b0554225d4
MD5 02482348bbaf13ff528d216a6cd56743
BLAKE2b-256 acf0fad10f2ee57fa4180a354daae3fc33e85ba79dd1a7e077d0f68f29e08af6

See more details on using hashes here.

File details

Details for the file py_cc_dicts-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: py_cc_dicts-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for py_cc_dicts-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dadb37ea6d31893c5ba48c875bfefa9e4c279394ae390ec1556fcc1187aa18ea
MD5 3383ab5e69c2c2f81e23165f260d6bb1
BLAKE2b-256 6cc026ab0fb1e6f3fba3cd6f380ade855bbe4c8ebbbc8306b982a1bd78a33a72

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page