A lightweight data management and preprocessing tool.
Project description
Cogdata
Install
pip install cogdata --index-url https://test.pypi.org/simple
sudo install_unrarlib.sh
Directory Structure
.
├── cogdata_task_task1
│ ├── cogdata_config.json (indicating a task path)
│ ├── merged.bin
│ ├── dataset1
│ │ ├── dataset1.bin
│ │ └── meta_info.json
│ └── dataset2
│ ├── dataset2.bin
│ └── meta_info.json
├── dataset1
│ ├── cogdata_info.json (indicating a dataset path)
│ ├── dataset1.json
│ └── dataset1.rar
└── dataset2
├── cogdata_info.json
├── dataset2.json
└── dataset2.zip
Pipeline
The motivation of this project is to provide lightweight APIs for large-scale NN-based data-processing, e.g. ImageTokenization. The abstraction has 3 parts:
- Dataset: Raw dataset from other organization in various formats, e.g. rar, zip, etc. The information are recorded at
cogdata_info.jsonin its split folder. - Task: A task is a collection of "configs, results for different datsets, logs, merged results, and evenly split results". The config of a task are recorded in
cogdata_info.json. The states (processed, hanging/running, unprocessed)of a dataset in this tasks are inmeta_info.json. - DataSaver: The format of saved results. The first option is our
BinSaver, which saves plain bytes with fixed length. It can be read or memmap very fast. The config of DataSaver are also with the task incogdata_info.json.
Commands
cogdata create_dataset [-h] [--description DESCRIPTION] --data_files DATA_FILES [DATA_FILES ...] --data_format DATA_FORMAT [--text_files TEXT_FILES [TEXT_FILES ...]] [--text_format TEXT_FORMAT] name
Alias: cogdata data .... data_format is chosen from class names in cogdata.datasets, e.g. StreamingRarDataset. Texts related options are optional for text-image datasets.
cogdata create_task [-h] [--description DESCRIPTION] --task_type TASK_TYPE --saver_type SAVER_TYPE [--length_per_sample LENGTH_PER_SAMPLE] [--img_sizes IMG_SIZES [IMG_SIZES ...]] [--txt_len TXT_LEN]
[--dtype {int32,int64,float32,uint8,bool}]
task_id
Customized Tasks
Add --extra_code PATH_TO_CODE after cogdata (e.g., cogdata --extra_code ../examples/convert2tar_task.py [task or process] to execute and register your own task before running the command. See examples/ for details.
TODO List
- 验证create task任务对应task和saver的参数是否传全且合理 [wendi]
- 将现有的cogview数据纳入管理,并测试 [zhuoyi]
- 增加tokenization task中多个imgsize的处理 [mingding]
- 增加在不修改源代码的基础上register args task saver dataset的功能 [mingding]
- 上传至真实的pypi,公开仓库 [mingding]
- sphinx 注释文档撰写 [yuxiang]
- 整理单元测试,只使用小的testcase [wendi]
- PPT [yuxiang]
- 视频介绍 [yuxiang]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cogdata-0.0.4.tar.gz.
File metadata
- Download URL: cogdata-0.0.4.tar.gz
- Upload date:
- Size: 553.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a842991b7c1da3f2fda8e38d0728f7e420d8c99cbe26442ba4d12c2458ce884
|
|
| MD5 |
0108b47b477e042f74ec601155e2ce58
|
|
| BLAKE2b-256 |
ae824624b29d050c607325cf22035b5460f94c878ad2a0814d586b962fbf3d4b
|
File details
Details for the file cogdata-0.0.4-py3-none-any.whl.
File metadata
- Download URL: cogdata-0.0.4-py3-none-any.whl
- Upload date:
- Size: 561.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19033718e3256b0516f4f63f557110847073fd7692026c77182c20ad9958acd7
|
|
| MD5 |
a6595d3b2733215c72e85c6b023045df
|
|
| BLAKE2b-256 |
cfbc336f4ad5ad0788930ec6e477557f25b6ab2a6117bbb7e15eaeab61397516
|