LM Ops Tool for Korean

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

English | 한국어

Data-Modori: LM Ops Tool for Korean

TEAMREBOOTT website ^HOT TeamAR platform ^{TRY IT OUT}

📘Documentation | 🛠️Installation | 🤔Reporting Issues

Data-Modori is a creative and advanced platform that guides you into the realm of data, offering diverse possibilities by collecting information from various sources. We gather all the puzzle pieces of data, assemble them into one, and invite you into the world of the information you desire.

Data Integration: We collect data from various sources, integrating them into one central hub for your convenience.
Flexible Analysis: Utilize advanced analysis tools to delve into your data, gaining new insights and perspectives.
Customized Results: Organize and present data according to your requirements, delivering tailored results.
User-Friendly Interface: An intuitive and easy-to-use interface allows users to harness the power of data without requiring advanced knowledge.

Data-Modori: LM Ops Tool for Korean
Table of Contents

Installation

Get source code from Github

git clone https://github.com/teamreboott/data-modori
cd data-modori

Run the following commands to install the latest basic data_juicer version in editable mode:

pip install -v -e .

Some OPs rely on some other too large or low-platform-compatibility third-party libraries. You can install optional dependencies as needed:

pip install -v -e .  # install a minimal dependencies, which support the basic functions
pip install -v -e .[tools] # install a subset of tools dependencies

The dependency options are listed below:

Tag	Description
`.` or `.[mini]`	Install minimal dependencies for basic Data-Modori.
`.[all]`	Install all optional dependencies (including minimal dependencies and all of the following).
`.[sci]`	Install all dependencies for all OPs.
`.[dist]`	Install dependencies for distributed data processing. (Experimental)
`.[dev]`	Install dependencies for developing the package as contributors.
`.[tools]`	Install dependencies for dedicated tools, such as quality classifiers.

Data Processing

Run process_data.py tool with your config as the argument to process your dataset.

python tools/process_data.py --config configs/process.yaml

Note: For some operators that involve third-party models or resources which are not stored locally on your computer, it might be slow for the first running because these ops need to download corresponding resources into a directory first. The default download cache directory is ~/.cache/data_juicer. Change the cache location by setting the shell environment variable, DATA_JUICER_CACHE_HOME to another directory, and you can also change DATA_JUICER_MODELS_CACHE or DATA_JUICER_ASSETS_CACHE in the same way:

# cache home
export DATA_JUICER_CACHE_HOME="/path/to/another/directory"
# cache models
export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models"
# cache assets
export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets"

Data Analysis

Run analyze_data.py tool with your config as the argument to analyse your dataset.

python tools/analyze_data.py --config configs/analyser.yaml

Note: Analyser only compute stats of Filter ops. So extra Mapper or Deduplicator ops will be ignored in the analysis process.

Data Visualization

Run app.py tool to visualize your dataset in your browser.
Note: only available for installation from source.

streamlit run app.py

Build Up Config Files

Config files specify some global arguments, and an operator list for the data process. You need to set:
- Global arguments: input/output dataset path, number of workers, etc.
- Operator list: list operators with their arguments used to process the dataset.
You can build up your own config files by:
- ➖：Modify from our example config file config_all.yaml which includes all ops and default arguments. You just need to remove ops that you won't use and refine some arguments of ops.
- ➕：Build up your own config files from scratch. You can refer our example config file config_all.yaml, op documents, and advanced Build-Up Guide for developers.
- Besides the yaml files, you also have the flexibility to specify just one (of several) parameters on the command line, which will override the values in yaml files.

python xxx.py --config configs/process.yaml --language_id_score_filter.lang=ko

# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: './data/test.json'  # path to your dataset directory or file
export_path: './output/test.jsonl'

np: 4  # number of subprocess to process your dataset
text_keys: 'content'

# process schedule
# a list of several process operators with their arguments
process:
  - language_id_score_filter:
      lang: 'en'

Documentation

License

Data-Modori is released under Apache License 2.0.

Contributing

We are in a rapidly developing field and greatly welcome contributions of new features, bug fixes and better documentations. Please refer to How-to Guide for Developers.

Acknowledgement

Data-Modori is used across various LLM products and research initiatives, including industrial LLMs from Teamreboott AI TEAM(AR), such as AUT for trade and AUW for work.

We look forward to more of your experience, suggestions and discussions for collaboration!

Data-Modori thanks and refers to several community projects, such as data-juicer, Huggingface-Datasets, Bloom, Pile, Megatron-LM, DeepSpeed, Arrow, Ray, Beam, LM-Harness, HELM, ....

References

If you find our work useful for your research or development, please kindly cite the following paper.

@misc{chen2023datajuicer,
title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pan and Ce Ge and Dawei Gao and Yuexiang Xie and Zhaoyang Liu and Jinyang Gao and Yaliang Li and Bolin Ding and Jingren Zhou},
year={2023},
eprint={2309.02033},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.1

Dec 14, 2023

This version

0.1.0

Dec 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py-data-modori-0.1.0.tar.gz (81.9 kB view details)

Uploaded Dec 14, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

py_data_modori-0.1.0-py3-none-any.whl (12.4 kB view details)

Uploaded Dec 14, 2023 Python 3

File details

Details for the file py-data-modori-0.1.0.tar.gz.

File metadata

Download URL: py-data-modori-0.1.0.tar.gz
Upload date: Dec 14, 2023
Size: 81.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for py-data-modori-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`19c8fcfe9c104aac95f31502f50f054767722a3a6ee7424c39db292f36b19d3c`
MD5	`7e0ada55a09ed70b873ddb7b1dc87852`
BLAKE2b-256	`7aaf30fe4cbf37de14fe5f0b3ef6bbcb3098906e81188ae30c2e9bb8a7c91888`

See more details on using hashes here.

File details

Details for the file py_data_modori-0.1.0-py3-none-any.whl.

File metadata

Download URL: py_data_modori-0.1.0-py3-none-any.whl
Upload date: Dec 14, 2023
Size: 12.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for py_data_modori-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c76b2a3ee3bcac2f12b18ff50989cb2a6610bcec83cb4e127122f4b7ec202c8`
MD5	`8da255976e8cabfbb634ce33b879ea05`
BLAKE2b-256	`8e4fa8d2d3ebec6a133b48980b518653ed1e92b933169aab530be9d8fd1fca8b`

See more details on using hashes here.

py-data-modori 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data-Modori: LM Ops Tool for Korean

Table of Contents

Installation

Data Processing

Data Analysis

Data Visualization

Build Up Config Files

Documentation

License

Contributing

Acknowledgement

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes