Skip to main content

A lightweight data management and preprocessing tool.

Project description

Cogdata

Install

pip install cogdata --index-url https://test.pypi.org/simple
sudo install_unrarlib.sh

Directory Structure

.
├── cogdata_task_task1
│   ├── cogdata_config.json (indicating a task path)
│   ├── merged.bin
│   ├── dataset1
│   │   ├── dataset1.bin
│   │   └── meta_info.json
│   └── dataset2
│       ├── dataset2.bin
│       └── meta_info.json
├── dataset1
│   ├── cogdata_info.json (indicating a dataset path)
│   ├── dataset1.json
│   └── dataset1.rar
└── dataset2
    ├── cogdata_info.json
    ├── dataset2.json
    └── dataset2.zip

Pipeline

The motivation of this project is to provide lightweight APIs for large-scale NN-based data-processing, e.g. ImageTokenization. The abstraction has 3 parts:

  • Dataset: Raw dataset from other organization in various formats, e.g. rar, zip, etc. The information are recorded at cogdata_info.json in its split folder.
  • Task: A task is a collection of "configs, results for different datsets, logs, merged results, and evenly split results". The config of a task are recorded in cogdata_info.json. The states (processed, hanging/running, unprocessed)of a dataset in this tasks are in meta_info.json.
  • DataSaver: The format of saved results. The first option is our BinSaver, which saves plain bytes with fixed length. It can be read or memmap very fast. The config of DataSaver are also with the task in cogdata_info.json.

Commands

cogdata create_dataset  [-h] [--description DESCRIPTION] --data_files DATA_FILES [DATA_FILES ...] --data_format DATA_FORMAT [--text_files TEXT_FILES [TEXT_FILES ...]] [--text_format TEXT_FORMAT] name

Alias: cogdata data .... data_format is chosen from class names in cogdata.datasets, e.g. StreamingRarDataset. Texts related options are optional for text-image datasets.

cogdata create_task [-h] [--description DESCRIPTION] --task_type TASK_TYPE --saver_type SAVER_TYPE [--length_per_sample LENGTH_PER_SAMPLE] [--img_sizes IMG_SIZES [IMG_SIZES ...]] [--txt_len TXT_LEN]
                           [--dtype {int32,int64,float32,uint8,bool}]
                           task_id

Customized Tasks

Add --extra_code PATH_TO_CODE after cogdata (e.g., cogdata --extra_code ../examples/convert2tar_task.py [task or process] to execute and register your own task before running the command. See examples/ for details.

TODO List

  • 验证create task任务对应task和saver的参数是否传全且合理 [wendi]
  • 将现有的cogview数据纳入管理,并测试 [zhuoyi]
  • 增加tokenization task中多个imgsize的处理 [mingding]
  • 增加在不修改源代码的基础上register args task saver dataset的功能 [mingding]
  • 上传至真实的pypi,公开仓库 [mingding]
  • sphinx 注释文档撰写 [yuxiang]
  • 整理单元测试,只使用小的testcase [wendi]
  • PPT [yuxiang]
  • 视频介绍 [yuxiang]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cogdata-0.0.4.tar.gz (553.6 kB view hashes)

Uploaded Source

Built Distribution

cogdata-0.0.4-py3-none-any.whl (561.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page