A utility for storing and reading files for Korean LM training.
Project description
ko_lm_dataformat
-
한국어 언어모델용 학습 데이터를 저장, 로딩하기 위한 유틸리티
-
코드는 EleutherAI에서 사용하는 lm_dataformat를 참고하여 제작
- 일부 버그 수정
- 한국어에 맞게 기능 추가 및 수정 (sentence splitter, text cleaner)
Installation
0.3.1 이후의 버전은 Python 3.9 이상을 지원합니다.
pip3 install ko_lm_dataformat
Usage
1. Write Data
1.1. Archive
- kss v1 sentence splitter 사용 가능
import ko_lm_dataformat as kldf
ar = kldf.Archive("output_dir")
ar = kldf.Archive("output_dir", sentence_splitter=kldf.KssV1SentenceSplitter()) # Use sentence splitter
1.2. Adding data
meta데이터를 추가할 수 있음 (e.g. 제목, url)- 하나의 document가 들어온다고 가정 (
str이 아닌List[str]로 들어오게 되면 여러 개의 sentence가 들어오는 걸로 취급) split_sent=True이면 document를 여러 개의 문장으로 분리하여List[str]으로 저장clean_sent=True이면 NFC Normalize, control char 제거, whitespace cleanup 적용
for doc in doc_lst:
ar.add_data(
data=doc,
meta={
"source": "kowiki",
"meta_key_1": [othermetadata, otherrandomstuff],
"meta_key_2": True
},
split_sent=False,
clean_sent=False,
)
# remember to commit at the end!
ar.commit()
2. Read Data
rdr.stream_data(get_meta=True)로 할 시(doc, meta)의 튜플 형태로 반환
import ko_lm_dataformat as kldf
rdr = kldf.Reader("output_dir")
for data in rdr.stream_data(get_meta=False):
print(data)
# "간단하게 설명하면, 언어를 통해 인간의 삶을 미적(美的)으로 형상화한 것이라고 볼...."
for data in rdr.stream_data(get_meta=True):
print(data)
# ("간단하게 설명하면, 언어를 통해 인간의 삶을 미적(美的)으로 형상화한 것이라고 볼....", {"source": "kowiki", ...})
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ko_lm_dataformat-0.3.1.tar.gz.
File metadata
- Download URL: ko_lm_dataformat-0.3.1.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd7561a93e8f1fe3ff58233d6f2101175cd0ad4f0d1da6c9533d9b61c28cdece
|
|
| MD5 |
daa22d1fcfc8b98f30c5ca339597cca2
|
|
| BLAKE2b-256 |
9d4b57320348d4da80afef5c64a0f71d5babd28266ac92d1d8bdc77fbdbe3e96
|
Provenance
The following attestation bundles were made for ko_lm_dataformat-0.3.1.tar.gz:
Publisher:
release-and-publish-pip.yml on monologg/ko_lm_dataformat
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ko_lm_dataformat-0.3.1.tar.gz -
Subject digest:
cd7561a93e8f1fe3ff58233d6f2101175cd0ad4f0d1da6c9533d9b61c28cdece - Sigstore transparency entry: 607815230
- Sigstore integration time:
-
Permalink:
monologg/ko_lm_dataformat@77d0850047326da42586d248946254c144a4ed44 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/monologg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-and-publish-pip.yml@77d0850047326da42586d248946254c144a4ed44 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file ko_lm_dataformat-0.3.1-py3-none-any.whl.
File metadata
- Download URL: ko_lm_dataformat-0.3.1-py3-none-any.whl
- Upload date:
- Size: 10.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2717a2f30e105ef2628667f849516319996d94367959a22dab67af350cc120fd
|
|
| MD5 |
b75d5d78b1a7c92f3ce4543c07ce8dff
|
|
| BLAKE2b-256 |
7e36acf6b2dbacfd920e80d934de7dc6eba3f7bdf00fbf09a0c1397ac25f95aa
|
Provenance
The following attestation bundles were made for ko_lm_dataformat-0.3.1-py3-none-any.whl:
Publisher:
release-and-publish-pip.yml on monologg/ko_lm_dataformat
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ko_lm_dataformat-0.3.1-py3-none-any.whl -
Subject digest:
2717a2f30e105ef2628667f849516319996d94367959a22dab67af350cc120fd - Sigstore transparency entry: 607815238
- Sigstore integration time:
-
Permalink:
monologg/ko_lm_dataformat@77d0850047326da42586d248946254c144a4ed44 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/monologg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-and-publish-pip.yml@77d0850047326da42586d248946254c144a4ed44 -
Trigger Event:
workflow_dispatch
-
Statement type: