A lightweight data management and preprocessing tool.
Project description
Cogdata
Install
pip install cogdata
sudo `which install_unrarlib.sh`
Directory Structure
.
├── cogdata_task_task1
│ ├── cogdata_config.json (indicating a task path)
│ ├── merged.bin
│ ├── dataset1
│ │ ├── dataset1.bin
│ │ └── meta_info.json
│ └── dataset2
│ ├── dataset2.bin
│ └── meta_info.json
├── dataset1
│ ├── cogdata_info.json (indicating a dataset path)
│ ├── dataset1.json
│ └── dataset1.rar
└── dataset2
├── cogdata_info.json
├── dataset2.json
└── dataset2.zip
Pipeline
The motivation of this project is to provide lightweight APIs for large-scale NN-based data-processing, e.g. ImageTokenization. The abstraction has 3 parts:
- Dataset: Raw dataset from other organization in various formats, e.g. rar, zip, etc. The information are recorded at
cogdata_info.json
in its split folder. - Task: A task is a collection of "configs, results for different datsets, logs, merged results, and evenly split results". The config of a task are recorded in
cogdata_info.json
. The states (processed, hanging/running, unprocessed)of a dataset in this tasks are inmeta_info.json
. - DataSaver: The format of saved results. The first option is our
BinSaver
, which saves plain bytes with fixed length. It can be read or memmap very fast. The config of DataSaver are also with the task incogdata_info.json
.
Commands
cogdata create_dataset [-h] [--description DESCRIPTION] --data_files DATA_FILES [DATA_FILES ...] --data_format DATA_FORMAT [--text_files TEXT_FILES [TEXT_FILES ...]] [--text_format TEXT_FORMAT] name
Alias: cogdata data ...
. data_format
is chosen from class names in cogdata.datasets, e.g. StreamingRarDataset
. Texts related options are optional for text-image datasets.
cogdata create_task [-h] [--description DESCRIPTION] --task_type TASK_TYPE --saver_type SAVER_TYPE [--length_per_sample LENGTH_PER_SAMPLE] [--img_sizes IMG_SIZES [IMG_SIZES ...]] [--txt_len TXT_LEN]
[--dtype {int32,int64,float32,uint8,bool}]
task_id
Alias: cogdata task ...
. task_type
and saver_type
is chosen from class names in cogdata, e.g. ImageTextTokenizationTask
or BinarySaver
.
cogdata process [-h] --task_id TASK_ID [--nproc NPROC] [--dataloader_num_workers DATALOADER_NUM_WORKERS]
[--batch_size BATCH_SIZE] [--ratio RATIO]
[datasets [datasets ...]]
The i-th proc will be binded to the i-th GPU.
cogdata merge [-h] --task_id TASK_ID
Merge all the processed data.
cogdata list [-h] [--task_id TASK_ID]
List all the current datasets in this folder.
cogdata clean [-h] [--task_id TASK_ID]
Clean the unfinished states of the task.
Customized Tasks
Add --extra_code PATH_TO_CODE
after cogdata
(e.g., cogdata --extra_code ../examples/convert2tar_task.py [task or process]
to execute and register your own task before running the command. See examples/
for details.
TODO List
- 支持多种不同格式文本处理
- sphinx 注释文档更详细撰写
- 更精细化的参数管理,将tokenization一般化
- PPT & 视频介绍
- Merge 视频处理 [Wenyi]
- Merge Object detection [Zhuoyi]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cogdata-0.0.7.tar.gz
.
File metadata
- Download URL: cogdata-0.0.7.tar.gz
- Upload date:
- Size: 555.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a853a76e353e6204fd4d059f9ec776bd6144fc0baf5a549566e6d1b0578587c1 |
|
MD5 | f3772b196d2776774ea3ca6ba274a131 |
|
BLAKE2b-256 | 3c656bcf27ea2715fc1ae0bf409b0d43044e20e9aa9c7ebdbfb672b5e3aa64bd |
File details
Details for the file cogdata-0.0.7-py3-none-any.whl
.
File metadata
- Download URL: cogdata-0.0.7-py3-none-any.whl
- Upload date:
- Size: 568.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 090c6e2663812f0da9ca3cdaf99fc5ba665bf339e374794f0d62e87ab5b8ad47 |
|
MD5 | 0254908103e82ebe22022249fa574c81 |
|
BLAKE2b-256 | a3b9fd36a7646f66bab597df3753fd49c45531565214c8e8b7daa9b3f3e53657 |