A utility for storing and reading files for Korean LM training.

These details have not been verified by PyPI

Project description

ko_lm_dataformat

한국어 언어모델용 학습 데이터를 저장, 로딩하기 위한 유틸리티
- zstandard, ultrajson 을 사용하여 데이터 로딩, 압축 속도 개선
- 문서에 대한 메타 데이터도 함께 저장
코드는 EleutherAI에서 사용하는 lm_dataformat를 참고하여 제작
- 일부 버그 수정
- 한국어에 맞게 기능 추가 및 수정 (sentence splitter, text cleaner)

Installation

0.3.1 이후의 버전은 Python 3.9 이상을 지원합니다.

pip3 install ko_lm_dataformat

Usage

1. Write Data

1.1. Archive

kss v1 sentence splitter 사용 가능

import ko_lm_dataformat as kldf

ar = kldf.Archive("output_dir")
ar = kldf.Archive("output_dir", sentence_splitter=kldf.KssV1SentenceSplitter()) # Use sentence splitter

1.2. Adding data

meta 데이터를 추가할 수 있음 (e.g. 제목, url)
하나의 document가 들어온다고 가정 (str 이 아닌 List[str] 로 들어오게 되면 여러 개의 sentence가 들어오는 걸로 취급)
split_sent=True이면 document를 여러 개의 문장으로 분리하여 List[str] 으로 저장
clean_sent=True이면 NFC Normalize, control char 제거, whitespace cleanup 적용

for doc in doc_lst:
    ar.add_data(
        data=doc,
        meta={
          "source": "kowiki",
          "meta_key_1": [othermetadata, otherrandomstuff],
          "meta_key_2": True
        },
        split_sent=False,
        clean_sent=False,
    )

# remember to commit at the end!
ar.commit()

2. Read Data

rdr.stream_data(get_meta=True)로 할 시 (doc, meta) 의 튜플 형태로 반환

import ko_lm_dataformat as kldf

rdr = kldf.Reader("output_dir")

for data in rdr.stream_data(get_meta=False):
  print(data)
  # "간단하게 설명하면, 언어를 통해 인간의 삶을 미적(美的)으로 형상화한 것이라고 볼...."


for data in rdr.stream_data(get_meta=True):
  print(data)
  # ("간단하게 설명하면, 언어를 통해 인간의 삶을 미적(美的)으로 형상화한 것이라고 볼....", {"source": "kowiki", ...})

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.1

Oct 15, 2025

0.3.0

Jan 15, 2024

0.2.0

Sep 11, 2021

0.1.0

Jun 29, 2021

0.1.0rc3 pre-release

Jun 24, 2021

0.1.0rc2 pre-release

Jun 23, 2021

0.1.0rc1 pre-release

Jun 19, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ko_lm_dataformat-0.3.1.tar.gz (9.3 kB view details)

Uploaded Oct 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ko_lm_dataformat-0.3.1-py3-none-any.whl (10.0 kB view details)

Uploaded Oct 15, 2025 Python 3

File details

Details for the file ko_lm_dataformat-0.3.1.tar.gz.

File metadata

Download URL: ko_lm_dataformat-0.3.1.tar.gz
Upload date: Oct 15, 2025
Size: 9.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ko_lm_dataformat-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`cd7561a93e8f1fe3ff58233d6f2101175cd0ad4f0d1da6c9533d9b61c28cdece`
MD5	`daa22d1fcfc8b98f30c5ca339597cca2`
BLAKE2b-256	`9d4b57320348d4da80afef5c64a0f71d5babd28266ac92d1d8bdc77fbdbe3e96`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ko_lm_dataformat-0.3.1.tar.gz:

Publisher: release-and-publish-pip.yml on monologg/ko_lm_dataformat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ko_lm_dataformat-0.3.1.tar.gz
- Subject digest: cd7561a93e8f1fe3ff58233d6f2101175cd0ad4f0d1da6c9533d9b61c28cdece
- Sigstore transparency entry: 607815230
- Sigstore integration time: Oct 15, 2025
Source repository:
- Permalink: monologg/ko_lm_dataformat@77d0850047326da42586d248946254c144a4ed44
- Branch / Tag: refs/heads/master
- Owner: https://github.com/monologg
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-and-publish-pip.yml@77d0850047326da42586d248946254c144a4ed44
- Trigger Event: workflow_dispatch

File details

Details for the file ko_lm_dataformat-0.3.1-py3-none-any.whl.

File metadata

Download URL: ko_lm_dataformat-0.3.1-py3-none-any.whl
Upload date: Oct 15, 2025
Size: 10.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ko_lm_dataformat-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2717a2f30e105ef2628667f849516319996d94367959a22dab67af350cc120fd`
MD5	`b75d5d78b1a7c92f3ce4543c07ce8dff`
BLAKE2b-256	`7e36acf6b2dbacfd920e80d934de7dc6eba3f7bdf00fbf09a0c1397ac25f95aa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ko_lm_dataformat-0.3.1-py3-none-any.whl:

Publisher: release-and-publish-pip.yml on monologg/ko_lm_dataformat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ko_lm_dataformat-0.3.1-py3-none-any.whl
- Subject digest: 2717a2f30e105ef2628667f849516319996d94367959a22dab67af350cc120fd
- Sigstore transparency entry: 607815238
- Sigstore integration time: Oct 15, 2025
Source repository:
- Permalink: monologg/ko_lm_dataformat@77d0850047326da42586d248946254c144a4ed44
- Branch / Tag: refs/heads/master
- Owner: https://github.com/monologg
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-and-publish-pip.yml@77d0850047326da42586d248946254c144a4ed44
- Trigger Event: workflow_dispatch

ko-lm-dataformat 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

ko_lm_dataformat

Installation

Usage

1. Write Data

1.1. Archive

1.2. Adding data

2. Read Data

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance