A Python package that asynchronously segments JSON data into TEI XML format.
Project description
ckip-2-tei
This project segments the title, body, and comments from a JSON file and writes them to a TEI XML file, and leverages asynchronous programming to achieve high performance and speed.
Installation
The source code is currently hosted on GitHub at: https://github.com/Taiwan-Social-Media-Corpus/ckip-2-tei
Binary installers for the latest released version are available at the Python Package Index (PyPI).
pip install ckip2tei
Documentation
1. Import module
from ckip2tei import generate_tei_xml
If you are working on Jupyter Notebook, you need to add two additional code lines beforehand:
import nest_asyncio
nest_asyncio.apply()
Since ckip2tei
is built with Python asynchronous frameworks, it cannot run properly on Jupyter Notebook due to the fact that Jupyter (IPython ≥ 7.0) is already running an event loop. Visit this question asked in StackOverflow for further details.
2. Run pipeline
Provide the function generate_tei_xml
with two arguments:
post_data
: the data to be segmentedmedia
: the source of the data
The post_data
argument should be in the following format:
{
"board": "Soft_Job",
"id": "ABCD",
"date": "1183186255",
"title": "[請益] 最愛的程式?",
"author": "Retr0327",
"body": "這是一篇測試文章\n我喜歡 Python 和 TypeScript",
"post_vote": {"推 (pos)": 2, "噓 (neg)": 0, "→ (neu)": 0},
"comments": [
{
"type": "pos",
"author": "Uncle",
"content": "我愛 TypeScript",
"order": "1",
},
{
"type": "pos",
"author": "Bob",
"content": "我也很愛 Python",
"order": "2",
},
],
}
After filling the arguments, do it as follows:
post_data = {
"board": "Soft_Job",
"id": "ABCD",
"date": "1183186255",
"title": "[請益] 最愛的程式?",
"author": "Retr0327",
"body": "這是一篇測試文章\n我喜歡 Python 和 TypeScript",
"post_vote": {"推 (pos)": 2, "噓 (neg)": 0, "→ (neu)": 0},
"comments": [
{
"type": "pos",
"author": "Uncle",
"content": "我愛 TypeScript",
"order": "1",
},
{
"type": "pos",
"author": "Bob",
"content": "我也很愛 Python",
"order": "2",
},
],
}
generate_tei_xml(post_data, "ptt")
This prints:
<TEI.2>
<teiHeader>
<metadata name="media">ptt</metadata>
<metadata name="author">Retr0327</metadata>
<metadata name="id">ABCD</metadata>
<metadata name="year">2007</metadata>
<metadata name="board">Soft_Job</metadata>
<metadata name="title">[請益] 最愛的程式?</metadata>
</teiHeader>
<text>
<title author="Retr0327">
<s>
<w type="PARENTHESISCATEGORY">[</w>
<w type="VB">請益</w>
<w type="PARENTHESISCATEGORY">]</w>
<w type="WHITESPACE"> </w>
<w type="Dfa">最</w>
<w type="VL">愛</w>
<w type="DE">的</w>
<w type="Na">程式</w>
<w type="QUESTIONCATEGORY">?</w>
</s>
</title>
<body author="Retr0327">
...
</text>
</TEI.2>
Contact Me
If you have any suggestion or question, please do not hesitate to email me at lixingyang.dev@gmail.com
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ckip2tei-1.1.3.tar.gz
.
File metadata
- Download URL: ckip2tei-1.1.3.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.5 Darwin/23.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e091e0a241b2725cb16d5058c6ef592ab20d4d5a78cbb1f6646b19e3af25e02 |
|
MD5 | de1a894361ce794a4601b485e4b3ae02 |
|
BLAKE2b-256 | cec2670b4c30ae376068aafeb30eac3d2e5ea70f4cc2aa2a0ad78c3e4573454e |
File details
Details for the file ckip2tei-1.1.3-py3-none-any.whl
.
File metadata
- Download URL: ckip2tei-1.1.3-py3-none-any.whl
- Upload date:
- Size: 17.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.5 Darwin/23.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 477d5e3463771e80d8c581217de6de231de245d8acf59ca3693c2fed4a5db89d |
|
MD5 | 3718e7e513194e2cd970da27cdbc9244 |
|
BLAKE2b-256 | 654b1a4b9c460061328f5f2ac2bdf1452f275849888452690bc663ac1811a9b8 |