Skip to main content

A Python package that asynchronously segments JSON data into TEI XML format.

Project description

ckip-2-tei

This project segments the title, body, and comments from a JSON file and writes them to a TEI XML file, and leverages asynchronous programming to achieve high performance and speed.

Installation

The source code is currently hosted on GitHub at: https://github.com/Taiwan-Social-Media-Corpus/ckip-2-tei

Binary installers for the latest released version are available at the Python Package Index (PyPI).

pip install ckip2tei

Documentation

1. Import module

from ckip2tei import generate_tei_xml

If you are working on Jupyter Notebook, you need to add two additional code lines beforehand:

import nest_asyncio
nest_asyncio.apply()

Since ckip2tei is built with Python asynchronous frameworks, it cannot run properly on Jupyter Notebook due to the fact that Jupyter (IPython ≥ 7.0) is already running an event loop. Visit this question asked in StackOverflow for further details.

2. Run pipeline

Provide the function generate_tei_xml with two arguments:

  • post_data: the data to be segmented
  • media: the source of the data

The post_data argument should be in the following format:

{
    "board": "Soft_Job",
    "id": "ABCD",
    "date": "1183186255",
    "title": "[請益] 最愛的程式?",
    "author": "Retr0327",
    "body": "這是一篇測試文章\n我喜歡 Python 和 TypeScript",
    "post_vote": {"推 (pos)": 2, "噓 (neg)": 0, "→ (neu)": 0},
    "comments": [
        {
            "type": "pos",
            "author": "Uncle",
            "content": "我愛 TypeScript",
            "order": "1",
        },
        {
            "type": "pos",
            "author": "Bob",
            "content": "我也很愛 Python",
            "order": "2",
        },
    ],
}

After filling the arguments, do it as follows:

post_data = {
    "board": "Soft_Job",
    "id": "ABCD",
    "date": "1183186255",
    "title": "[請益] 最愛的程式?",
    "author": "Retr0327",
    "body": "這是一篇測試文章\n我喜歡 Python 和 TypeScript",
    "post_vote": {"推 (pos)": 2, "噓 (neg)": 0, "→ (neu)": 0},
    "comments": [
        {
            "type": "pos",
            "author": "Uncle",
            "content": "我愛 TypeScript",
            "order": "1",
        },
        {
            "type": "pos",
            "author": "Bob",
            "content": "我也很愛 Python",
            "order": "2",
        },
    ],
}

generate_tei_xml(post_data, "ptt")

This prints:

<TEI.2>
   <teiHeader>
      <metadata name="media">ptt</metadata>
      <metadata name="author">Retr0327</metadata>
      <metadata name="id">ABCD</metadata>
      <metadata name="year">2007</metadata>
      <metadata name="board">Soft_Job</metadata>
      <metadata name="title">[請益] 最愛的程式?</metadata>
   </teiHeader>
   <text>
      <title author="Retr0327">
         <s>
            <w type="PARENTHESISCATEGORY">[</w>
            <w type="VB">請益</w>
            <w type="PARENTHESISCATEGORY">]</w>
            <w type="WHITESPACE"> </w>
            <w type="Dfa"></w>
            <w type="VL"></w>
            <w type="DE"></w>
            <w type="Na">程式</w>
            <w type="QUESTIONCATEGORY">?</w>
         </s>
      </title>
      <body author="Retr0327">
        ...
   </text>
</TEI.2>

Contact Me

If you have any suggestion or question, please do not hesitate to email me at lixingyang.dev@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ckip2tei-1.1.3.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

ckip2tei-1.1.3-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file ckip2tei-1.1.3.tar.gz.

File metadata

  • Download URL: ckip2tei-1.1.3.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Darwin/23.2.0

File hashes

Hashes for ckip2tei-1.1.3.tar.gz
Algorithm Hash digest
SHA256 6e091e0a241b2725cb16d5058c6ef592ab20d4d5a78cbb1f6646b19e3af25e02
MD5 de1a894361ce794a4601b485e4b3ae02
BLAKE2b-256 cec2670b4c30ae376068aafeb30eac3d2e5ea70f4cc2aa2a0ad78c3e4573454e

See more details on using hashes here.

File details

Details for the file ckip2tei-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: ckip2tei-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.5 Darwin/23.2.0

File hashes

Hashes for ckip2tei-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 477d5e3463771e80d8c581217de6de231de245d8acf59ca3693c2fed4a5db89d
MD5 3718e7e513194e2cd970da27cdbc9244
BLAKE2b-256 654b1a4b9c460061328f5f2ac2bdf1452f275849888452690bc663ac1811a9b8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page