Skip to main content

A utility for storing and reading files for Korean LM training.

Project description

ko_lm_dataformat

PyPI License Code style: black

  • 한국어 언어모델용 학습 데이터를 저장, 로딩하기 위한 유틸리티

    • zstandard, ultrajson 을 사용하여 데이터 로딩, 압축 속도 개선
    • 문서에 대한 메타 데이터도 함께 저장
  • 코드는 EleutherAI에서 사용하는 lm_dataformat를 참고하여 제작

    • 일부 버그 수정
    • 한국어에 맞게 기능 추가 및 수정 (sentence splitter, text cleaner)

Installation

0.3.1 이후의 버전은 Python 3.9 이상을 지원합니다.

pip3 install ko_lm_dataformat

Usage

1. Write Data

1.1. Archive

import ko_lm_dataformat as kldf

ar = kldf.Archive("output_dir")
ar = kldf.Archive("output_dir", sentence_splitter=kldf.KssV1SentenceSplitter()) # Use sentence splitter

1.2. Adding data

  • meta 데이터를 추가할 수 있음 (e.g. 제목, url)
  • 하나의 document가 들어온다고 가정 (str 이 아닌 List[str] 로 들어오게 되면 여러 개의 sentence가 들어오는 걸로 취급)
  • split_sent=True이면 document를 여러 개의 문장으로 분리하여 List[str] 으로 저장
  • clean_sent=True이면 NFC Normalize, control char 제거, whitespace cleanup 적용
for doc in doc_lst:
    ar.add_data(
        data=doc,
        meta={
          "source": "kowiki",
          "meta_key_1": [othermetadata, otherrandomstuff],
          "meta_key_2": True
        },
        split_sent=False,
        clean_sent=False,
    )

# remember to commit at the end!
ar.commit()

2. Read Data

  • rdr.stream_data(get_meta=True)로 할 시 (doc, meta) 의 튜플 형태로 반환
import ko_lm_dataformat as kldf

rdr = kldf.Reader("output_dir")

for data in rdr.stream_data(get_meta=False):
  print(data)
  # "간단하게 설명하면, 언어를 통해 인간의 삶을 미적(美的)으로 형상화한 것이라고 볼...."


for data in rdr.stream_data(get_meta=True):
  print(data)
  # ("간단하게 설명하면, 언어를 통해 인간의 삶을 미적(美的)으로 형상화한 것이라고 볼....", {"source": "kowiki", ...})

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ko_lm_dataformat-0.3.1.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ko_lm_dataformat-0.3.1-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file ko_lm_dataformat-0.3.1.tar.gz.

File metadata

  • Download URL: ko_lm_dataformat-0.3.1.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ko_lm_dataformat-0.3.1.tar.gz
Algorithm Hash digest
SHA256 cd7561a93e8f1fe3ff58233d6f2101175cd0ad4f0d1da6c9533d9b61c28cdece
MD5 daa22d1fcfc8b98f30c5ca339597cca2
BLAKE2b-256 9d4b57320348d4da80afef5c64a0f71d5babd28266ac92d1d8bdc77fbdbe3e96

See more details on using hashes here.

Provenance

The following attestation bundles were made for ko_lm_dataformat-0.3.1.tar.gz:

Publisher: release-and-publish-pip.yml on monologg/ko_lm_dataformat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ko_lm_dataformat-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ko_lm_dataformat-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2717a2f30e105ef2628667f849516319996d94367959a22dab67af350cc120fd
MD5 b75d5d78b1a7c92f3ce4543c07ce8dff
BLAKE2b-256 7e36acf6b2dbacfd920e80d934de7dc6eba3f7bdf00fbf09a0c1397ac25f95aa

See more details on using hashes here.

Provenance

The following attestation bundles were made for ko_lm_dataformat-0.3.1-py3-none-any.whl:

Publisher: release-and-publish-pip.yml on monologg/ko_lm_dataformat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page