Scalable SWE datasets based on OSS
Project description
ossdata 🚀
ossdata 是一个命令行工具,用于将 Hugging Face 上的 SWE(Software Engineering)类数据集(如 SWE-bench)高效同步到私有 OSS(对象存储服务),并提供便捷的查询与统一的读取接口。每条数据将夹带 docker_image 字段,便于各个 agent 框架统一 rollout。
🌟 功能特性
- 一键上传:从 Hugging Face / JSON lines 加载数据集并推送到 OSS。
- 版本管理:使用
split@revision构成唯一version,支持多版本管理。 - 灵活查询:
- 列出所有数据集
- 查看某个数据集的所有版本
- 查看某版本下的所有
instance_id - 根据
instance_id和key快速获取字段值(如problem_statement,patch等)
- 结构化存储:在 OSS 中按
/{name}/{version}/{instance_id}.json组织数据,便于集成训练/评测流水线。
📦 安装与配置
pip install git+http://gitlab.alibaba-inc.com/Qwen-Coder/ossdata.git
export OSS_ACCESS_KEY_ID=""
export OSS_ACCESS_KEY_SECRET=""
export OSS_REGION="ap-southeast-1"
export OSS_ENDPOINT="https://oss-ap-southeast-1-internal.aliyuncs.com"
🛠️ 使用方法
1. 上传数据集到 OSS
ossdata upload \
--name "princeton-nlp/SWE-bench" \
--split "test" \
[--docker-image-prefix "code-agi-sg-docker-registry-vpc.ap-southeast-1.cr.aliyuncs.com/eflops/swe-rebench:"] \
[--revision "{revision}"]
- 如果
name以 .jsonl 结尾,则认为这是个 jsonl 文件;否则,从 HuggingFace 读取。 - 数据将被分片上传至 OSS,并按照
/{name}/{version}/{instance_id}.json建立索引。如果提供了revision,则version将被记录为:{split}@{revision},否则只有{split}。 - 如果提供了
docker-image-prefix,则每条数据将带有docker_image字段,内容是{docker-image-prefix}{instance_id}。
2. 查看数据集列表
ossdata ls
输出示例:
princeton-nlp/SWE-bench
swebench/verified
3. 查看某个数据集的所有版本
ossdata ls --name "SWE-Env/SWE-Env"
输出示例:
v1
test
4. 查看某版本下的所有 instance ID
ossdata ls --name "princeton-nlp/SWE-bench_Verified" --version "test"
输出示例:
pandas__pandas-44271
scipy__scipy-16864
numpy__numpy-12039
5. 获取某个实例
支持数据集的各种字段,如 problem_statement, patch, docker_image
ossdata get \
--instance-id "pandas__pandas-44271" \
--name "princeton-nlp/SWE-bench" \
--version "test" \
[--key "problem_statement"]
- 如果提供了 key,则输出指定字段;否则,输出所有 json 内容。
Built with ❤️ for AI coding research.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ossdata-0.1.0.tar.gz.
File metadata
- Download URL: ossdata-0.1.0.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18234288831bd944b9b6c1b283ad513e6ddc3e388c30b40158b1d32e706d97e3
|
|
| MD5 |
29039464bc0088fc7100175b6e40a387
|
|
| BLAKE2b-256 |
6307da1b9a8c2e84a6a191356d7854ad8a42f5a34985758214ed966c64a12b58
|
File details
Details for the file ossdata-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ossdata-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f86bbe9547fa6da56b60ff23cefdd9d508abe7e41a9ef0d397656740d553837
|
|
| MD5 |
eb5c0a65732bbc65f519f29c0976fbe1
|
|
| BLAKE2b-256 |
bf349e78091077d39a74f01916af8586e29452da347b8f068aa2a68e1c111b1b
|