Skip to main content

A lightweight data management and preprocessing tool.

Project description

Cogdata

Install

pip install cogdata
sudo `which install_unrarlib.sh`

Directory Structure

.
├── cogdata_task_task1
│   ├── cogdata_config.json (indicating a task path)
│   ├── merged.bin
│   ├── dataset1
│   │   ├── dataset1.bin
│   │   └── meta_info.json
│   └── dataset2
│       ├── dataset2.bin
│       └── meta_info.json
├── dataset1
│   ├── cogdata_info.json (indicating a dataset path)
│   ├── dataset1.json
│   └── dataset1.rar
└── dataset2
    ├── cogdata_info.json
    ├── dataset2.json
    └── dataset2.zip

Pipeline

The motivation of this project is to provide lightweight APIs for large-scale NN-based data-processing, e.g. ImageTokenization. The abstraction has 3 parts:

  • Dataset: Raw dataset from other organization in various formats, e.g. rar, zip, etc. The information are recorded at cogdata_info.json in its split folder.
  • Task: A task is a collection of "configs, results for different datsets, logs, merged results, and evenly split results". The config of a task are recorded in cogdata_info.json. The states (processed, hanging/running, unprocessed)of a dataset in this tasks are in meta_info.json.
  • DataSaver: The format of saved results. The first option is our BinSaver, which saves plain bytes with fixed length. It can be read or memmap very fast. The config of DataSaver are also with the task in cogdata_info.json.

Commands

cogdata create_dataset  [-h] [--description DESCRIPTION] --data_files DATA_FILES [DATA_FILES ...] --data_format DATA_FORMAT [--text_files TEXT_FILES [TEXT_FILES ...]] [--text_format TEXT_FORMAT] name

Alias: cogdata data .... data_format is chosen from class names in cogdata.datasets, e.g. StreamingRarDataset. Texts related options are optional for text-image datasets.

cogdata create_task [-h] [--description DESCRIPTION] --task_type TASK_TYPE --saver_type SAVER_TYPE [--length_per_sample LENGTH_PER_SAMPLE] [--img_sizes IMG_SIZES [IMG_SIZES ...]] [--txt_len TXT_LEN]
                           [--dtype {int32,int64,float32,uint8,bool}]
                           task_id

Alias: cogdata task .... task_type and saver_type is chosen from class names in cogdata, e.g. ImageTextTokenizationTask or BinarySaver.

cogdata process [-h] --task_id TASK_ID [--nproc NPROC] [--dataloader_num_workers DATALOADER_NUM_WORKERS]
                       [--batch_size BATCH_SIZE] [--ratio RATIO]
                       [datasets [datasets ...]]

The i-th proc will be binded to the i-th GPU.

cogdata merge [-h] --task_id TASK_ID

Merge all the processed data.

cogdata list [-h] [--task_id TASK_ID]

List all the current datasets in this folder.

cogdata clean [-h] [--task_id TASK_ID]

Clean the unfinished states of the task.

Customized Tasks

Add --extra_code PATH_TO_CODE after cogdata (e.g., cogdata --extra_code ../examples/convert2tar_task.py [task or process] to execute and register your own task before running the command. See examples/ for details.

TODO List

  • 支持多种不同格式文本处理
  • sphinx 注释文档更详细撰写
  • 更精细化的参数管理,将tokenization一般化
  • PPT & 视频介绍
  • Merge 视频处理 [Wenyi]
  • Merge Object detection [Zhuoyi]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cogdata-0.0.7.tar.gz (555.2 kB view details)

Uploaded Source

Built Distribution

cogdata-0.0.7-py3-none-any.whl (568.0 kB view details)

Uploaded Python 3

File details

Details for the file cogdata-0.0.7.tar.gz.

File metadata

  • Download URL: cogdata-0.0.7.tar.gz
  • Upload date:
  • Size: 555.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for cogdata-0.0.7.tar.gz
Algorithm Hash digest
SHA256 a853a76e353e6204fd4d059f9ec776bd6144fc0baf5a549566e6d1b0578587c1
MD5 f3772b196d2776774ea3ca6ba274a131
BLAKE2b-256 3c656bcf27ea2715fc1ae0bf409b0d43044e20e9aa9c7ebdbfb672b5e3aa64bd

See more details on using hashes here.

File details

Details for the file cogdata-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: cogdata-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 568.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for cogdata-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 090c6e2663812f0da9ca3cdaf99fc5ba665bf339e374794f0d62e87ab5b8ad47
MD5 0254908103e82ebe22022249fa574c81
BLAKE2b-256 a3b9fd36a7646f66bab597df3753fd49c45531565214c8e8b7daa9b3f3e53657

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page