A utility for storing and reading files for Korean LM training.
Project description
ko_lm_dataformat
- Utilities for storing data for Korean PLM.
- Code is based on lm_dataformat.
What have been changed
기능 추가
- Sentence splitter
kss v1.3.1
로직 변경
- 기존과 달리
json
의"text"
는 무조건 하나의 document만 받음.str
이 아닌List[str]
로 들어오게 되면 기존에는 각 str이 document였으나, 여기서는 sentence로 취급.- 기존에는 여러 document를
\n\n
으로 join 하였지만,ko_lm_dataformat
에서는 해당 로직을 없앰.
Basic Usage
To write:
import ko_lm_dataformat as kldf
ar = kldf.Archive('output_dir')
for x in something():
# do other stuff
ar.add_data(somedocument, meta={
'example': stuff,
'someothermetadata': [othermetadata, otherrandomstuff],
'otherotherstuff': True
})
# remember to commit at the end!
ar.commit()
To read:
import ko_lm_dataformat as kldf
rdr = kldf.Reader('input_dir_or_file')
for doc in rdr.stream_data(get_meta=False):
# do something with the document
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for ko_lm_dataformat-0.1.0rc2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | e921951989a6a40c2dc8fc76b15cfca73d6b32d87e7137ad01ec969240f1f95b |
|
MD5 | d23928ce9c0c9fc5ceda40e86d963482 |
|
BLAKE2b-256 | a9a39f9c1be3a5aa8c86290daf80b9df7903133c755caaa7dd562104f3d35eb3 |
Close
Hashes for ko_lm_dataformat-0.1.0rc2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b7b15daf3797c5c5138233f795ddbea58b49aff15f75f5c7a54b29bc5ff313e9 |
|
MD5 | bbb8a48c326b475e9ac343a4f127ea88 |
|
BLAKE2b-256 | b6308b588ff2b111782263a7e89f94da009184352101cf5c078d5c6d8ce75dcf |