A Python package that asynchronously segments JSON data into TEI XML format.
Project description
ckip-2-tei
This project segments the title, body, and comments from a JSON file and writes them to a TEI XML file, and leverages asynchronous programming to achieve high performance and speed.
Installation
The source code is currently hosted on GitHub at: https://github.com/Taiwan-Social-Media-Corpus/ckip-2-tei
Binary installers for the latest released version are available at the Python Package Index (PyPI).
pip install ckip2tei
Documentation
1. Import module
from ckip2tei import generate_tei_xml
If you are working on Jupyter Notebook, you need to add two additional code lines beforehand:
import nest_asyncio
nest_asyncio.apply()
Since ckip2tei is built with Python asynchronous frameworks, it cannot run properly on Jupyter Notebook due to the fact that Jupyter (IPython ≥ 7.0) is already running an event loop. Visit this question asked in StackOverflow for further details.
2. Run pipeline
Provide the function generate_tei_xml with two arguments:
post_data: the data to be segmentedmedia: the source of the data
The post_data argument should be in the following format:
{
"board": "Soft_Job",
"id": "ABCD",
"date": "1183186255",
"title": "[請益] 最愛的程式?",
"author": "Retr0327",
"body": "這是一篇測試文章\n我喜歡 Python 和 TypeScript",
"post_vote": {"推 (pos)": 2, "噓 (neg)": 0, "→ (neu)": 0},
"comments": [
{
"type": "pos",
"author": "Uncle",
"content": "我愛 TypeScript",
"order": "1",
},
{
"type": "pos",
"author": "Bob",
"content": "我也很愛 Python",
"order": "2",
},
],
}
After filling the arguments, do it as follows:
post_data = {
"board": "Soft_Job",
"id": "ABCD",
"date": "1183186255",
"title": "[請益] 最愛的程式?",
"author": "Retr0327",
"body": "這是一篇測試文章\n我喜歡 Python 和 TypeScript",
"post_vote": {"推 (pos)": 2, "噓 (neg)": 0, "→ (neu)": 0},
"comments": [
{
"type": "pos",
"author": "Uncle",
"content": "我愛 TypeScript",
"order": "1",
},
{
"type": "pos",
"author": "Bob",
"content": "我也很愛 Python",
"order": "2",
},
],
}
generate_tei_xml(post_data, "ptt")
This prints:
<TEI.2>
<teiHeader>
<metadata name="media">ptt</metadata>
<metadata name="author">Retr0327</metadata>
<metadata name="id">ABCD</metadata>
<metadata name="year">2007</metadata>
<metadata name="board">Soft_Job</metadata>
<metadata name="title">[請益] 最愛的程式?</metadata>
</teiHeader>
<text>
<title author="Retr0327">
<s>
<w type="PARENTHESISCATEGORY">[</w>
<w type="VB">請益</w>
<w type="PARENTHESISCATEGORY">]</w>
<w type="WHITESPACE"> </w>
<w type="Dfa">最</w>
<w type="VL">愛</w>
<w type="DE">的</w>
<w type="Na">程式</w>
<w type="QUESTIONCATEGORY">?</w>
</s>
</title>
<body author="Retr0327">
...
</text>
</TEI.2>
Contact Me
If you have any suggestion or question, please do not hesitate to email me at lixingyang.dev@gmail.com
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ckip2tei-1.1.3.tar.gz.
File metadata
- Download URL: ckip2tei-1.1.3.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.5 Darwin/23.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e091e0a241b2725cb16d5058c6ef592ab20d4d5a78cbb1f6646b19e3af25e02
|
|
| MD5 |
de1a894361ce794a4601b485e4b3ae02
|
|
| BLAKE2b-256 |
cec2670b4c30ae376068aafeb30eac3d2e5ea70f4cc2aa2a0ad78c3e4573454e
|
File details
Details for the file ckip2tei-1.1.3-py3-none-any.whl.
File metadata
- Download URL: ckip2tei-1.1.3-py3-none-any.whl
- Upload date:
- Size: 17.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.5 Darwin/23.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
477d5e3463771e80d8c581217de6de231de245d8acf59ca3693c2fed4a5db89d
|
|
| MD5 |
3718e7e513194e2cd970da27cdbc9244
|
|
| BLAKE2b-256 |
654b1a4b9c460061328f5f2ac2bdf1452f275849888452690bc663ac1811a9b8
|