A simple Python library to scrape and parse Chinese character data from zdic using BeautifulSoup.
Project description
ZDic Parser Tool
A very simple Python library to scrape and parse Chinese character data from ZDic using BeautifulSoup.
This library was developed and tested with Python 3.12, but it may work on other versions as well.
Prerequisites
- Python 3.12 (recommended, but may work on older versions)
- pip (Python package manager)
Installation
To install the package, run:
pip install zdic-parser
Usage
The library provides a class called ZDicCharacterParser, which is used to fetch character data from ZDic.
The two key methods in this class are:
search()→ Synchronous (Blocking)search_async()→ Asynchronous (Non-blocking)
Method Parameters
Both search() and search_async() accept the following parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
character |
str |
Required | The Chinese character to search for. |
mode |
str |
"s" |
Determines whether to return information in Simplified ("s") or Traditional ("t") Chinese. |
timeout |
int |
5 |
The request timeout (in seconds). |
Notes
- The
modeparameter only affects the returned content, such as definitions being in Simplified ("s") or Traditional ("t") Chinese. - You can search for both Simplified and Traditional characters regardless of the
modeselected.
Synchronous search example
To perform a character search synchronously, we can use search():
from zdic_parser import ZDicCharacterParser
# Example character to search
example = "你"
# Create an instance of the parser
parser = ZDicCharacterParser()
# Perform the search (defaults to Simplified Chinese mode)
parser.search(example)
Asynchronous search example
To perform a character search asynchronously, we use search_async():
import asyncio
from zdic_parser import ZDicCharacterParser
# Example character to search
example = "你"
async def main():
# Create an instance of the parser
parser = ZDicCharacterParser()
# Perform the asynchronous search
await parser.search_async(example)
# Print results
print(parser.character_info)
print(parser.definitions)
# Run the asynchronous function
asyncio.run(main())
This is useful if we wish to parse multiple characters:
import asyncio
from zdic_parser import ZDicCharacterParser
# List of characters to search
characters = ["你", "干", "吗"]
async def create_coroutines(character):
parser = ZDicCharacterParser()
await parser.search_async(character)
return parser
async def main():
tasks = [create_coroutines(char) for char in characters]
parsers = await asyncio.gather(*tasks)
# Print results / Do something with the results
for parser in parsers:
print(parser.character_info)
# Run the asynchronous function
asyncio.run(main())
Methods and Fields
Below is a list of the fields the ZDicCharacterParser class contains:
| Field | Data Type | Description |
|---|---|---|
character_info |
dict |
Contains detailed information about a Chinese character. |
definitions |
dict |
Contains definitions of the character. |
character_info structure
| Key | Data Type | Description |
|---|---|---|
img_src |
str (optional) |
SVG of the character. |
pinyin |
str (optional) |
Pinyin representation. |
zhuyin |
str (optional) |
Zhuyin (Bopomofo) notation. |
radical |
str (optional) |
Radical component. |
non_radical_stroke_count |
int (optional) |
Stroke count excluding the radical. |
total_stroke_count |
int (optional) |
Total stroke count. |
simple_trad |
str (optional) |
Simplified and traditional forms. |
variant_characters |
str (optional) |
Alternative character forms. |
unicode |
str (optional) |
Unicode representation. |
character_structure |
str (optional) |
Structural composition. |
stroke_order |
str (optional) |
Stroke order data. |
wubi |
str (optional) |
Wubi input method code. |
cangjie |
str (optional) |
Cangjie input method code. |
zhengma |
str (optional) |
Zhengma input method code. |
fcorners |
int (optional) |
Four-corner input method code. |
definitions structure
| Key | Data Type | Description |
|---|---|---|
simple_defs |
dict |
Basic definitions of the character. |
The ZDicCharacterParser class provides getters for all the aforementioned keys for convenience:
| Method | Returns | Description |
|---|---|---|
get_img_src() |
str (optional) |
SVG of the character. |
get_pinyin() |
str (optional) |
Pinyin representation of the character. |
get_zhuyin() |
str (optional) |
Zhuyin (Bopomofo) notation. |
get_radical() |
str (optional) |
Radical component of the character. |
get_non_radical_stroke_count() |
int (optional) |
Stroke count excluding the radical. |
get_total_stroke_count() |
int (optional) |
Total number of strokes in the character. |
get_simple_trad() |
str (optional) |
Simplified and traditional forms of the character. |
get_variant_characters() |
str (optional) |
Alternative character forms. |
get_unicode() |
str (optional) |
Unicode representation of the character. |
get_character_structure() |
str (optional) |
Structural composition of the character. |
get_stroke_order() |
str (optional) |
Stroke order data. |
get_wubi() |
str (optional) |
Wubi input method code. |
get_cangjie() |
str (optional) |
Cangjie input method code. |
get_zhengma() |
str (optional) |
Zhengma input method code. |
get_fcorners() |
int (optional) |
Four-corner input method code. |
get_simple_defs() |
dict (optional) |
Basic definitions of the character. |
多音字 (Polyphonic Characters)
If a searched character is a 多音字 (polyphonic character), all available Pinyin and Zhuyin pronunciations will be returned as a comma-separated string:
from zdic_parser import ZDicCharacterParser
# Example character to search
example = "和"
# Create an instance of the parser
parser = ZDicCharacterParser()
# Perform the search (defaults to Simplified Chinese mode)
parser.search(example)
print(parser.get_pinyin()) # Expected output: "hé, hè, huó, huò, hú"
print(parser.get_zhuyin()) # Expected output: "ㄏㄜˊ, ㄏㄜˋ, ㄏㄨㄛˊ, ㄏㄨㄛˋ, ㄏㄨˊ"
print(parser.get_variant_characters()) # Expected output: "咊, 咼, 惒, 盉, 訸, 鉌, 龢, 𤧗, 𥤉, 𧇮, 㕿, 𠰓"
Static Methods
ZDicCharacterParser also provides static methods prefixed with fetch to fetch specific bits of information without the need to instantiate a ZDicCharacterParser object.
| Method | Returns | Description |
|---|---|---|
async fetch_img_src() |
str (optional) |
SVG of the character. |
async fetch_pinyin() |
str (optional) |
Pinyin representation of the character. |
async fetch_zhuyin() |
str (optional) |
Zhuyin (Bopomofo) notation. |
async fetch_radical() |
str (optional) |
Radical component of the character. |
async fetch_non_radical_stroke_count() |
int (optional) |
Stroke count excluding the radical. |
async fetch_total_stroke_count() |
int (optional) |
Total number of strokes in the character. |
async fetch_simple_trad() |
str (optional) |
Simplified and traditional forms of the character. |
async fetch_variant_characters() |
str (optional) |
Alternative character forms. |
async fetch_unicode() |
str (optional) |
Unicode representation of the character. |
async fetch_character_structure() |
str (optional) |
Structural composition of the character. |
async fetch_stroke_order() |
str (optional) |
Stroke order data. |
async fetch_wubi() |
str (optional) |
Wubi input method code. |
async fetch_cangjie() |
str (optional) |
Cangjie input method code. |
async fetch_zhengma() |
str (optional) |
Zhengma input method code. |
async fetch_fcorners() |
int (optional) |
Four-corner input method code. |
async fetch_simple_defs() |
dict (optional) |
Basic definitions of the character. |
import asyncio
from zdic_parser import ZDicCharacterParser
# List of characters to search
characters = ["你", "干", "吗"]
async def create_coroutines(character):
pinyin = await ZDicCharacterParser.fetch_pinyin(character)
return pinyin
async def main():
tasks = [create_coroutines(char) for char in characters]
results = await asyncio.gather(*tasks)
# Print results / Do something with the results
for result in results:
print(result)
# Run the asynchronous function
asyncio.run(main())
Important Consideration:
When thesearch(orsearch_async) method is called, an HTTP request is sent to the corresponding ZDic page. The HTML is then scraped for information and collated.However, not all information is always available for every character. To indicate this, all methods are marked as returning optional values (
Nonewhen unavailable).For example, consider the character 𫵷. The only available information includes:
radicalnon_radical_stroke_counttotal_stroke_countunicodecharacter_structurecangjieIn this case, calling any other getter method (e.g.,
get/fetch_pinyin(),get/fetch_zhuyin()) will returnNone, since that data does not exist on the page.
Exceptions
The parser relies on the relatively static nature of ZDic's dictionary entries to extract the necessary information. However, if the structure of the site changes, the parsing algorithm may break.
In such cases, an ElementIsMissingException will be thrown. This exception indicates that one of the following issues has occurred:
- The element's selector has changed.
- The website has been updated.
- The page URL is incorrect.
How to Handle This Exception
If you encounter an ElementIsMissingException:
- Check if ZDic's website structure has changed.
- Verify the page URL to ensure it's correct.
- Update the parser functions inside
src/utils.pyto match the new structure.
I will try my best to consistently monitor for any drastic changes to zdic's page layout and release updates accordingly
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zdic_parser-0.1.1.tar.gz.
File metadata
- Download URL: zdic_parser-0.1.1.tar.gz
- Upload date:
- Size: 380.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c62bca73ec9ebf3f896a88d45d0fc5d95ed71f5c301bddacf208b604a5ec4029
|
|
| MD5 |
417e6fe24e83287496b2ebabb21d6796
|
|
| BLAKE2b-256 |
539d808f2c020ce80d720f4c4f4d07a57b22ff3b24659c31d24813f5375515e0
|
File details
Details for the file zdic_parser-0.1.1-py3-none-any.whl.
File metadata
- Download URL: zdic_parser-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f6a3207876359440a7c4c0bfa93ae99df41cf4f748466980100a28b749df92e
|
|
| MD5 |
14e1525ead06b0ebcd6ceec805733ab3
|
|
| BLAKE2b-256 |
42f88ab32f835bb1108ca66521fbc453680f06aa72b45045d9849f008eafb375
|