Skip to main content

Scalable SWE datasets based on OSS

Project description

ossdata 🚀

ossdata 是一个命令行工具,用于将 Hugging Face 上的 SWE(Software Engineering)类数据集(如 SWE-bench)高效同步到私有 OSS(对象存储服务),并提供便捷的查询与统一的读取接口。每条数据将夹带 docker_image 字段,便于各个 agent 框架统一 rollout。


🌟 功能特性

  • 一键上传:从 Hugging Face / JSON lines 加载数据集并推送到 OSS。
  • 版本管理:使用 split@revision 构成唯一 version,支持多版本管理。
  • 灵活查询
    • 列出所有数据集
    • 查看某个数据集的所有版本
    • 查看某版本下的所有 instance_id
    • 根据 instance_idkey 快速获取字段值(如 problem_statement, patch 等)
  • 结构化存储:在 OSS 中按 /{name}/{version}/{instance_id}.json 组织数据,便于集成训练/评测流水线。

📦 安装与配置

pip install git+http://gitlab.alibaba-inc.com/Qwen-Coder/ossdata.git

export OSS_ACCESS_KEY_ID=""
export OSS_ACCESS_KEY_SECRET=""
export OSS_REGION="ap-southeast-1"
export OSS_ENDPOINT="https://oss-ap-southeast-1-internal.aliyuncs.com"

🛠️ 使用方法

1. 上传数据集到 OSS

ossdata upload \
  --name "princeton-nlp/SWE-bench" \
  --split "test" \
  [--docker-image-prefix "code-agi-sg-docker-registry-vpc.ap-southeast-1.cr.aliyuncs.com/eflops/swe-rebench:"] \
  [--revision "{revision}"]
  • 如果 name 以 .jsonl 结尾,则认为这是个 jsonl 文件;否则,从 HuggingFace 读取。
  • 数据将被分片上传至 OSS,并按照 /{name}/{version}/{instance_id}.json 建立索引。如果提供了 revision,则version 将被记录为:{split}@{revision},否则只有 {split}
  • 如果提供了docker-image-prefix,则每条数据将带有 docker_image 字段,内容是 {docker-image-prefix}{instance_id}

2. 查看数据集列表

ossdata ls

输出示例:

princeton-nlp/SWE-bench
swebench/verified

3. 查看某个数据集的所有版本

ossdata ls --name "SWE-Env/SWE-Env"

输出示例:

v1
test

4. 查看某版本下的所有 instance ID

ossdata ls --name "princeton-nlp/SWE-bench_Verified" --version "test"

输出示例:

pandas__pandas-44271
scipy__scipy-16864
numpy__numpy-12039

5. 获取某个实例

支持数据集的各种字段,如 problem_statement, patch, docker_image

ossdata get \
  --instance-id "pandas__pandas-44271" \
  --name "princeton-nlp/SWE-bench" \
  --version "test" \
  [--key "problem_statement"]
  • 如果提供了 key,则输出指定字段;否则,输出所有 json 内容。

Built with ❤️ for AI coding research.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ossdata-0.1.0.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ossdata-0.1.0-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file ossdata-0.1.0.tar.gz.

File metadata

  • Download URL: ossdata-0.1.0.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.15

File hashes

Hashes for ossdata-0.1.0.tar.gz
Algorithm Hash digest
SHA256 18234288831bd944b9b6c1b283ad513e6ddc3e388c30b40158b1d32e706d97e3
MD5 29039464bc0088fc7100175b6e40a387
BLAKE2b-256 6307da1b9a8c2e84a6a191356d7854ad8a42f5a34985758214ed966c64a12b58

See more details on using hashes here.

File details

Details for the file ossdata-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ossdata-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.15

File hashes

Hashes for ossdata-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4f86bbe9547fa6da56b60ff23cefdd9d508abe7e41a9ef0d397656740d553837
MD5 eb5c0a65732bbc65f519f29c0976fbe1
BLAKE2b-256 bf349e78091077d39a74f01916af8586e29452da347b8f068aa2a68e1c111b1b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page