DSLighting 2.3.2 - Bug Fix: load_data() now properly supports built-in dataset names
Project description
# DSLighting
**全流程数据科学智能助手 - End-to-End Data Science Agent**
[](https://www.python.org/downloads/)
[](https://pypi.org/project/dslighting/)
[](https://pypi.org/project/dslighting/)
[](LICENSE)
[📚 完整文档](https://luckyfan-cs.github.io/dslighting-web/api/getting-started.html) |
[🚀 快速上手](#-快速上手) |
[💻 GitHub](https://github.com/usail-hkust/dslighting) |
[🐛 问题反馈](https://github.com/usail-hkust/dslighting/issues)
</div>
---
## ✨ 特性
- 🤖 **智能 Agent 工作流**:自动化数据科学任务执行
- 🔍 **Discovery API**:探索和学习所有可用的 prompts 和 operators
- 📊 **数据管理**:统一的数据加载和任务配置系统
- 🔧 **灵活配置**:支持多种 LLM 模型(OpenAI, GLM, DeepSeek, Qwen 等)
- 📝 **完整追踪**:自动记录任务执行过程和结果
- 🧩 **可扩展架构**:轻松添加自定义任务和工作流
- 🎯 **完整 DSAT 继承**:继承所有 DSAT workflow prompts 和 operators
---
## 🚀 快速上手
### 1. 安装
```bash
pip install dslighting python-dotenv
```
### 2. 配置环境变量
创建 `.env` 文件:
```bash
# .env
# 指定默认使用的模型(必须设置!)
LLM_MODEL=glm-4
# 多模型配置(JSON 格式)
LLM_MODEL_CONFIGS='{
"glm-4": {
"api_key": ["your-key-1", "your-key-2"],
"api_base": "https://open.bigmodel.cn/api/paas/v4",
"temperature": 0.7,
"provider": "openai"
},
"openai/deepseek-ai/DeepSeek-V3": {
"api_key": ["sk-siliconflow-key-1", "sk-siliconflow-key-2"],
"api_base": "https://api.siliconflow.cn/v1",
"temperature": 1.0
},
"gpt-4o": {
"api_key": "sk-your-openai-api-key",
"api_base": "https://api.openai.com/v1",
"temperature": 0.7
}
}'
```
**支持的模型提供商:**
- OpenAI (GPT-4, GPT-3.5)
- 智谱 AI (GLM-4)
- SiliconFlow (DeepSeek, Qwen, Kimi 等)
- 任何兼容 OpenAI API 的服务
### 3. 运行任务
**方式 1:全局配置(推荐用于多任务)**
```python
from dotenv import load_dotenv
load_dotenv()
import dslighting
# 配置一次,全局生效
dslighting.setup(
data_parent_dir="/path/to/data/competitions",
registry_parent_dir="/path/to/registry"
)
# 创建 Agent
agent = dslighting.Agent()
# 运行任务(只需 task_id)
result = agent.run(task_id="bike-sharing-demand")
print(f"✅ 任务完成!")
print(f"结果: {result}")
```
**方式 2:直接路径(明确清晰)**
```python
from dotenv import load_dotenv
load_dotenv()
import dslighting
agent = dslighting.Agent()
result = agent.run(
task_id="bike-sharing-demand",
data_dir="/path/to/data/competitions/bike-sharing-demand",
registry_dir="/path/to/registry/bike-sharing-demand"
)
```
**方式 3:内置数据集(最简单)**
```python
from dotenv import load_dotenv
load_dotenv()
import dslighting
# 无需配置,直接使用
result = dslighting.run_agent(task_id="bike-sharing-demand")
```
**方式 4:先加载数据(灵活检查)**
```python
from dotenv import load_dotenv
load_dotenv()
import dslighting
# 先加载数据并检查
data = dslighting.load_data(
"/path/to/data/competitions/bike-sharing-demand",
registry_dir="/path/to/registry/bike-sharing-demand"
)
# 检查数据
print(data.show())
# 确认无误后运行
agent = dslighting.Agent()
result = agent.run(data)
```
### 4. 查看结果
```python
print(f"Workspace: {result.workspace_path}")
print(f"Score: {result.score}")
```
---
## 🔍 Discovery API - 探索可用组件
DSLighting 2.0 提供了强大的 Discovery API,帮助你探索和了解所有可用的 prompts 和 operators。
### 快速探索
```python
import dslighting
# 一键查看所有可用组件
dslighting.explore()
```
输出示例:
```
================================================================================
DSLighting 2.0 - Component Explorer
================================================================================
🗣️ Available Prompts
--------------------------------------------------------------------------------
NATIVE (8 items):
- PromptBuilder
- StructuredPromptBuilder
- create_modeling_prompt
- create_eda_prompt
...
AIDE (2 items):
- create_improve_prompt
- create_debug_prompt
AUTOKAGGLE (7 items):
- get_deconstructor_prompt
- get_phase_planner_prompt
...
💪 Available Operators
--------------------------------------------------------------------------------
LLM (4 items):
- GenerateCodeAndPlanOperator
- PlanOperator
- ReviewOperator
- SummarizeOperator
CODE (1 items):
- ExecuteAndTestOperator
```
### 列出指定类别的组件
```python
# 列出所有 prompts
all_prompts = dslighting.list_prompts()
for category, functions in all_prompts.items():
print(f"{category}: {len(functions)} prompts")
# 列出特定类别的 prompts
aide_prompts = dslighting.list_prompts(category="aide")
print(f"AIDE prompts: {aide_prompts['aide']}")
# 列出所有 operators
all_ops = dslighting.list_operators()
for category, names in all_ops.items():
print(f"{category}: {len(names)} operators")
# 列出特定类别的 operators
llm_ops = dslighting.list_operators(category="llm")
print(f"LLM operators: {llm_ops['llm']}")
```
### 获取详细信息
```python
# 获取 prompt 的详细信息
from dslighting.prompts import get_prompt_info
info = get_prompt_info("create_improve_prompt")
print(f"Name: {info['name']}")
print(f"Category: {info['category']}")
print(f"Description: {info['description']}")
print(f"Inputs:")
for input_param in info['inputs']:
print(f" - {input_param['name']} ({input_param['type']})")
print(f" {input_param['description']}")
print(f" Required: {input_param['required']}")
print(f"\nExample:\n{info['example']}")
```
输出示例:
```python
{
"name": "create_improve_prompt",
"category": "aide",
"description": "Create improvement prompt for AIDE workflow iteration",
"workflow": "AIDE - Iterative code generation with review",
"inputs": [
{
"name": "task_context",
"type": "Dict[str, Any]",
"description": "Task context containing goal and I/O requirements",
"required": True,
"fields": {
"goal_and_data": "str - Task goal and data overview",
"io_instructions": "str - Critical I/O requirements"
}
},
{
"name": "memory_summary",
"type": "str",
"description": "Summary of past attempts from memory",
"required": True
}
# ... 更多输入参数
],
"outputs": "A formatted prompt string",
"output_format": "str - Structured prompt with role, context, and instructions",
"example": """
from dslighting.prompts.aide_prompt import create_improve_prompt
# Input
task_context = {
"goal_and_data": "Predict bike rental demand using historical data",
"io_instructions": "Output must be saved to 'predictions.csv' with columns: datetime, count"
}
memory_summary = "Attempt 1 used linear regression with RMSE 0.65"
previous_code = "import pandas as pd\\nmodel = LinearRegression()..."
previous_analysis = "The model achieved RMSE 0.65 but underpredicts peak hours"
# Call
prompt = create_improve_prompt(
task_context=task_context,
memory_summary=memory_summary,
previous_code=previous_code,
previous_analysis=previous_analysis
)
# Returns formatted prompt string with all context
"""
}
```
```python
# 获取 operator 的详细信息
from dslighting.operators import get_operator_info
info = get_operator_info("PlanOperator")
print(f"Name: {info['name']}")
print(f"Category: {info['category']}")
print(f"Description: {info['description']}")
print(f"Async: {info.get('async', False)}")
print(f"Required Services: {info.get('requires_services', [])}")
print(f"\nExample:\n{info['example']}")
```
### 使用场景
**场景 1: 探索可用的 workflow prompts**
```python
# 查看所有 AIDE workflow 的 prompts
from dslighting.prompts import get_prompt_info
aide_prompts = [
"create_improve_prompt",
"create_debug_prompt"
]
for prompt_name in aide_prompts:
info = get_prompt_info(prompt_name)
print(f"\n{prompt_name}:")
print(f" Description: {info['description']}")
print(f" Inputs: {[inp['name'] for inp in info['inputs']]}")
```
**场景 2: 选择合适的 operator**
```python
# 比较 LLM operators
from dslighting.operators import get_operator_info
llm_ops = ["PlanOperator", "GenerateCodeAndPlanOperator", "ReviewOperator"]
for op_name in llm_ops:
info = get_operator_info(op_name)
print(f"\n{op_name}:")
print(f" Description: {info['description']}")
print(f" Input: {info['inputs']}")
print(f" Output: {info['outputs']}")
```
**场景 3: 学习如何使用组件**
```python
# 获取完整的使用示例
info = get_prompt_info("create_improve_prompt")
print(info['example']) # 复制粘贴即可运行
info = get_operator_info("ReviewOperator")
print(info['example']) # 包含完整的初始化和调用代码
```
---
## 📖 核心概念
### 数据系统
DSLighting 使用统一的数据管理系统:
- **LoadedData**:核心数据容器,封装数据集和任务配置
- **TaskDetection**:自动识别任务类型(kaggle, open_ended, datasci)
- **Registry**:管理任务配置和评分规则
**查看数据结构:**
```python
data = dslighting.load_data(...)
print(data.show())
```
输出包括:
- 任务 ID 和类型
- 数据目录结构
- CSV 文件信息
- 任务描述和评估指标
### Agent 配置
```python
# 使用默认配置
agent = dslighting.Agent()
# 等价于:
agent = dslighting.Agent(
workflow="aide", # 工作流类型
model="gpt-4o-mini", # LLM 模型(从 .env 读取)
temperature=0.7, # 生成温度
max_iterations=5 # 最大迭代次数
)
```
---
## 🔧 高级配置
### 自定义任务
创建自己的数据科学任务:
**目录结构:**
```
your-project/
├── data/competitions/
│ └── your-task-name/
│ └── prepared/
│ ├── public/ # train.csv, test.csv, sampleSubmission.csv
│ └── private/ # test_answer.csv
│
└── registry/
└── your-task-name/
├── config.yaml # 任务配置
├── description.md # 任务描述
└── grade.py # 评分脚本(可选)
```
**config.yaml 示例:**
```yaml
id: your-task-name
name: Your Task Display Name
competition_type: simple
awards_medals: false
description: your-task-name/description.md
dataset:
answers: your-task-name/prepared/private/test_answer.csv
sample_submission: your-task-name/prepared/public/sampleSubmission.csv
grader:
name: rmsle # 或 accuracy, f1, mae 等
```
**运行自定义任务:**
```python
result = agent.run(
task_id="your-task-name",
data_dir="/path/to/data/competitions",
registry_dir="/path/to/registry"
)
```
### 常见问题
**Q: 为什么显示 "Score: N/A"?**
A: 这是 DSLighting 的已知问题。自动评分功能当前未启用,需要手动评分:
```python
from pathlib import Path
from mlebench.grade import grade_csv
from dsat.benchmark.mle import MLEBenchmarkRegistry
registry_dir = Path(dslighting.__file__).parent / "registry"
registry = MLEBenchmarkRegistry(registry_dir=str(registry_dir))
competition = registry.get_competition("bike-sharing-demand")
submission_files = list(result.workspace_path.glob("sandbox/submission_*.csv"))
if submission_files:
report = grade_csv(submission_files[0], competition)
print(f"✅ 实际 Score: {report.score}")
```
**Q: `load_dotenv()` 是必须的吗?**
A: 是的!必须在导入 `dslighting` 之前调用 `load_dotenv()` 来加载 `.env` 配置。
---
## 📚 完整文档
详细文档请访问:
- **[快速上手指南](https://luckyfan-cs.github.io/dslighting-web/api/getting-started.html)** - 完整的安装、配置和使用教程
- **[Discovery API 指南](DISCOVERY_API_GUIDE.md)** - 探索和学习所有可用的 prompts 和 operators
- **[数据系统文档](https://luckyfan-cs.github.io/dslighting-web/api/data-system.html)** - 深入了解数据管理和核心组件
- **[GitHub 项目](https://github.com/usail-hkust/dslighting)** - 源代码和问题反馈
- **[发布说明](RELEASE_NOTES_2.1.0.md)** - DSLighting 2.1.0 更新内容
---
## 🤝 贡献
欢迎贡献代码、报告问题或提出建议!
1. Fork 项目
2. 创建特性分支 (`git checkout -b feature/AmazingFeature`)
3. 提交更改 (`git commit -m 'Add some AmazingFeature'`)
4. 推送到分支 (`git push origin feature/AmazingFeature`)
5. 开启 Pull Request
---
## 📄 许可证
本项目基于 [AGPL-3.0 许可证](LICENSE) 发布。
---
## 📞 联系方式
- **问题反馈**: [GitHub Issues](https://github.com/usail-hkust/dslighting/issues)
- **文档**: [https://luckyfan-cs.github.io/dslighting-web/](https://luckyfan-cs.github.io/dslighting-web/)
- **PyPI**: [https://pypi.org/project/dslighting/](https://pypi.org/project/dslighting/)
---
<div align="center">
**如果这个项目对你有帮助,请给个 ⭐️**
Made with ❤️ by [USAIL Lab](https://github.com/usail-hkust)
</div>
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dslighting-2.3.2.tar.gz
(220.9 kB
view details)
File details
Details for the file dslighting-2.3.2.tar.gz.
File metadata
- Download URL: dslighting-2.3.2.tar.gz
- Upload date:
- Size: 220.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3eef01cd794a2aad6c915a69f0e406e4a942817e33fd093bc31a6cc2e7f6de6f
|
|
| MD5 |
88126500f63535e681fe93d6e24ae35d
|
|
| BLAKE2b-256 |
7ab9a72957d8a54a439bd7eb9c264ce07a76315b499439961118136cc1a066e6
|