Skip to main content

A lightweight data management and preprocessing tool.

Project description

Cogdata

Install

pip install cogdata --index-url https://test.pypi.org/simple
sudo install_unrarlib.sh

Directory Structure

.
├── cogdata_task_task1
│   ├── cogdata_config.json (indicating a task path)
│   ├── merged.bin
│   ├── dataset1
│   │   ├── dataset1.bin
│   │   └── meta_info.json
│   └── dataset2
│       ├── dataset2.bin
│       └── meta_info.json
├── dataset1
│   ├── cogdata_info.json (indicating a dataset path)
│   ├── dataset1.json
│   └── dataset1.rar
└── dataset2
    ├── cogdata_info.json
    ├── dataset2.json
    └── dataset2.zip

Pipeline

The motivation of this project is to provide lightweight APIs for large-scale NN-based data-processing, e.g. ImageTokenization. The abstraction has 3 parts:

  • Dataset: Raw dataset from other organization in various formats, e.g. rar, zip, etc. The information are recorded at cogdata_info.json in its split folder.
  • Task: A task is a collection of "configs, results for different datsets, logs, merged results, and evenly split results". The config of a task are recorded in cogdata_info.json. The states (processed, hanging/running, unprocessed)of a dataset in this tasks are in meta_info.json.
  • DataSaver: The format of saved results. The first option is our BinSaver, which saves plain bytes with fixed length. It can be read or memmap very fast. The config of DataSaver are also with the task in cogdata_info.json.

Commands

cogdata create_dataset  [-h] [--description DESCRIPTION] --data_files DATA_FILES [DATA_FILES ...] --data_format DATA_FORMAT [--text_files TEXT_FILES [TEXT_FILES ...]] [--text_format TEXT_FORMAT] name

Alias: cogdata data .... data_format is chosen from class names in cogdata.datasets, e.g. StreamingRarDataset. Texts related options are optional for text-image datasets.

cogdata create_task [-h] [--description DESCRIPTION] --task_type TASK_TYPE --saver_type SAVER_TYPE [--length_per_sample LENGTH_PER_SAMPLE] [--img_sizes IMG_SIZES [IMG_SIZES ...]] [--txt_len TXT_LEN]
                           [--dtype {int32,int64,float32,uint8,bool}]
                           task_id

Customized Tasks

Add --extra_code PATH_TO_CODE after cogdata (e.g., cogdata --extra_code ../examples/convert2tar_task.py [task or process] to execute and register your own task before running the command. See examples/ for details.

TODO List

  • 验证create task任务对应task和saver的参数是否传全且合理 [wendi]
  • 将现有的cogview数据纳入管理,并测试 [zhuoyi]
  • 增加tokenization task中多个imgsize的处理 [mingding]
  • 增加在不修改源代码的基础上register args task saver dataset的功能 [mingding]
  • 上传至真实的pypi,公开仓库 [mingding]
  • sphinx 注释文档撰写 [yuxiang]
  • 整理单元测试,只使用小的testcase [wendi]
  • PPT [yuxiang]
  • 视频介绍 [yuxiang]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cogdata-0.0.4.tar.gz (553.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cogdata-0.0.4-py3-none-any.whl (561.3 kB view details)

Uploaded Python 3

File details

Details for the file cogdata-0.0.4.tar.gz.

File metadata

  • Download URL: cogdata-0.0.4.tar.gz
  • Upload date:
  • Size: 553.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.8

File hashes

Hashes for cogdata-0.0.4.tar.gz
Algorithm Hash digest
SHA256 4a842991b7c1da3f2fda8e38d0728f7e420d8c99cbe26442ba4d12c2458ce884
MD5 0108b47b477e042f74ec601155e2ce58
BLAKE2b-256 ae824624b29d050c607325cf22035b5460f94c878ad2a0814d586b962fbf3d4b

See more details on using hashes here.

File details

Details for the file cogdata-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: cogdata-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 561.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.8

File hashes

Hashes for cogdata-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 19033718e3256b0516f4f63f557110847073fd7692026c77182c20ad9958acd7
MD5 a6595d3b2733215c72e85c6b023045df
BLAKE2b-256 cfbc336f4ad5ad0788930ec6e477557f25b6ab2a6117bbb7e15eaeab61397516

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page