Python SDK for Dify Dataset API

These details have not been verified by PyPI

Project links

Project description

Dify Knowledge Base SDK

A comprehensive Python SDK for interacting with Dify's Knowledge Base API. This SDK provides easy-to-use methods for managing datasets (knowledge bases), documents, segments, and metadata through Dify's REST API.

Features

📚 Complete API Coverage: Support for all Dify Knowledge Base API endpoints
🔐 Authentication: Secure API key-based authentication
📄 Document Management: Create, update, delete documents from text or files
🗂️ Dataset Operations: Full CRUD operations for knowledge bases
✂️ Segment Control: Manage document segments (chunks) with fine-grained control
🏷️ Knowledge Tags: Create and manage knowledge tags for dataset organization
📊 Metadata Support: Create and manage custom metadata fields
🔍 Advanced Retrieval: Multiple search methods (semantic, full-text, hybrid)
🔗 Batch Operations: Efficient batch processing for documents and metadata
🌐 HTTP Client: Built on httpx for reliable and fast HTTP communications
⚠️ Error Handling: Comprehensive error handling with custom exceptions
📈 Progress Monitoring: Track document indexing progress with detailed status
🛡️ Retry Mechanisms: Built-in retry logic for network resilience
🔒 Type Safety: Full type hints with Pydantic models
📱 Rich Examples: Comprehensive examples covering all use cases

Installation

pip install dify-dataset-sdk

Quick Start

from dify_dataset_sdk import DifyDatasetClient

# Initialize the client
client = DifyDatasetClient(api_key="your-api-key-here")

# Create a new dataset (knowledge base)
dataset = client.create_dataset(
    name="My Knowledge Base",
    permission="only_me"
)

# Create a document from text
doc_response = client.create_document_by_text(
    dataset_id=dataset.id,
    name="Sample Document",
    text="This is a sample document for the knowledge base.",
    indexing_technique="high_quality"
)

# List all documents
documents = client.list_documents(dataset.id)
print(f"Total documents: {documents.total}")

# Close the client
client.close()

Configuration

API Key

Get your API key from the Dify knowledge base API page:

Go to your Dify knowledge base
Navigate to the API section in the left sidebar
Generate or copy your API key from the API Keys section

Base URL

By default, the SDK uses https://api.dify.ai as the base URL. You can customize this:

client = DifyDatasetClient(
    api_key="your-api-key",
    base_url="https://your-custom-dify-instance.com",
    timeout=60.0  # Custom timeout in seconds
)

Core Features

Dataset Management

# Create a dataset
dataset = client.create_dataset(
    name="Technical Documentation",
    permission="only_me",
    description="Internal technical docs"
)

# List datasets with pagination
datasets = client.list_datasets(page=1, limit=20)

# Delete a dataset
client.delete_dataset(dataset_id)

Document Operations

From Text

# Create document from text
doc_response = client.create_document_by_text(
    dataset_id=dataset_id,
    name="API Documentation",
    text="Complete API documentation content...",
    indexing_technique="high_quality",
    process_rule_mode="automatic"
)

From File

# Create document from file
doc_response = client.create_document_by_file(
    dataset_id=dataset_id,
    file_path="./documentation.pdf",
    indexing_technique="high_quality"
)

Custom Processing Rules

# Custom processing configuration
process_rule_config = {
    "rules": {
        "pre_processing_rules": [
            {"id": "remove_extra_spaces", "enabled": True},
            {"id": "remove_urls_emails", "enabled": True}
        ],
        "segmentation": {
            "separator": "###",
            "max_tokens": 500
        }
    }
}

doc_response = client.create_document_by_file(
    dataset_id=dataset_id,
    file_path="document.txt",
    process_rule_mode="custom",
    process_rule_config=process_rule_config
)

Segment Management

# Create segments
segments_data = [
    {
        "content": "First segment content",
        "answer": "Answer for first segment",
        "keywords": ["keyword1", "keyword2"]
    },
    {
        "content": "Second segment content",
        "answer": "Answer for second segment",
        "keywords": ["keyword3", "keyword4"]
    }
]

segments = client.create_segments(dataset_id, document_id, segments_data)

# List segments
segments = client.list_segments(dataset_id, document_id)

# Update a segment
client.update_segment(
    dataset_id=dataset_id,
    document_id=document_id,
    segment_id=segment_id,
    segment_data={
        "content": "Updated content",
        "keywords": ["updated", "keywords"],
        "enabled": True
    }
)

# Delete a segment
client.delete_segment(dataset_id, document_id, segment_id)

Knowledge Tags Management

# Create knowledge tags
tag = client.create_knowledge_tag(name="Technical Documentation")
dept_tag = client.create_knowledge_tag(name="Engineering Department")

# Bind datasets to tags
client.bind_dataset_to_tag(dataset_id, [tag.id, dept_tag.id])

# List all knowledge tags
tags = client.list_knowledge_tags()

# Get tags for a specific dataset
dataset_tags = client.get_dataset_tags(dataset_id)

# Filter datasets by tags
filtered_datasets = client.list_datasets(tag_ids=[tag.id])

Metadata Management

# Create metadata fields
category_field = client.create_metadata_field(
    dataset_id=dataset_id,
    field_type="string",
    name="category"
)

priority_field = client.create_metadata_field(
    dataset_id=dataset_id,
    field_type="number",
    name="priority"
)

# Update document metadata
metadata_operations = [
    {
        "document_id": document_id,
        "metadata_list": [
            {
                "id": category_field.id,
                "value": "technical",
                "name": "category"
            },
            {
                "id": priority_field.id,
                "value": "5",
                "name": "priority"
            }
        ]
    }
]

client.update_document_metadata(dataset_id, metadata_operations)

Advanced Retrieval

# Semantic search
results = client.retrieve(
    dataset_id=dataset_id,
    query="How to implement authentication?",
    retrieval_config={
        "search_method": "semantic_search",
        "top_k": 5,
        "score_threshold": 0.7
    }
)

# Hybrid search (combining semantic and full-text)
results = client.retrieve(
    dataset_id=dataset_id,
    query="API documentation",
    retrieval_config={
        "search_method": "hybrid_search",
        "top_k": 10,
        "rerank_model": {
            "model": "rerank-multilingual-v2.0",
            "mode": "reranking_model"
        }
    }
)

# Full-text search
results = client.retrieve(
    dataset_id=dataset_id,
    query="database configuration",
    retrieval_config={"search_method": "full_text_search", "top_k": 5}
)

Progress Monitoring

# Monitor document indexing progress
status = client.get_document_indexing_status(dataset_id, batch_id)

if status.data:
    indexing_info = status.data[0]
    print(f"Status: {indexing_info.indexing_status}")
    print(f"Progress: {indexing_info.completed_segments}/{indexing_info.total_segments}")

Error Handling

The SDK provides comprehensive error handling with specific exception types:

from dify_dataset_sdk.exceptions import (
    DifyAPIError,
    DifyAuthenticationError,
    DifyValidationError,
    DifyNotFoundError,
    DifyConflictError,
    DifyServerError,
    DifyConnectionError,
    DifyTimeoutError
)

try:
    dataset = client.create_dataset(name="Test Dataset")
except DifyAuthenticationError:
    print("Invalid API key")
except DifyValidationError as e:
    print(f"Validation error: {e}")
except DifyConflictError as e:
    print(f"Conflict: {e}")  # e.g., duplicate dataset name
except DifyAPIError as e:
    print(f"API error: {e}")
    print(f"Status code: {e.status_code}")
    print(f"Error code: {e.error_code}")

Advanced Usage

For more advanced scenarios, see the examples directory:

Basic Usage - Simple operations and getting started
Advanced Usage - Complex workflows and custom processing
Knowledge Tag Management - Tag-based dataset organization
Batch Document Processing - Parallel processing and batch operations
Advanced Retrieval Analysis - Retrieval method comparison and analysis
Error Handling and Monitoring - Production-ready error handling and monitoring

Key Advanced Features

Batch Processing

Process multiple documents efficiently with parallel operations:

from concurrent.futures import ThreadPoolExecutor

def upload_document(file_path):
    return client.create_document_by_file(
        dataset_id=dataset_id,
        file_path=file_path,
        indexing_technique="high_quality"
    )

# Parallel document upload
with ThreadPoolExecutor(max_workers=3) as executor:
    futures = [executor.submit(upload_document, file) for file in file_list]
    results = [future.result() for future in futures]

Error Handling with Retry

Implement robust error handling with automatic retry:

from dify_dataset_sdk.exceptions import DifyTimeoutError, DifyConnectionError
import time

def safe_operation_with_retry(operation, max_retries=3):
    for attempt in range(max_retries):
        try:
            return operation()
        except (DifyTimeoutError, DifyConnectionError) as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                time.sleep(wait_time)
                continue
            raise e

Health Monitoring

Monitor SDK performance and API health:

class SDKMonitor:
    def __init__(self, client):
        self.client = client
        self.metrics = {"requests": 0, "errors": 0, "avg_response_time": 0}

    def health_check(self):
        try:
            start_time = time.time()
            self.client.list_datasets(limit=1)
            response_time = time.time() - start_time
            return {"status": "healthy", "response_time": response_time}
        except Exception as e:
            return {"status": "unhealthy", "error": str(e)}

API Reference

Client Configuration

DifyDatasetClient(
    api_key: str,           # Required: Your Dify API key
    base_url: str,          # Optional: API base URL (default: "https://api.dify.ai")
    timeout: float          # Optional: Request timeout in seconds (default: 30.0)
)

Supported File Types

The SDK supports uploading the following file types:

txt - Plain text files
md, markdown - Markdown files
pdf - PDF documents
html - HTML files
xlsx - Excel spreadsheets
docx - Word documents
csv - CSV files

Rate Limits

Please respect Dify's API rate limits. The SDK includes automatic error handling for rate limit responses.

Development

Setup

# Clone the repository
git clone https://github.com/LeekJay/dify-dataset-sdk.git
cd dify-dataset-sdk

# Install dependencies
pip install -e ".[dev]"

Running Tests

# Run all tests
pytest

# Run specific test file
python tests/test_all_39_apis.py

# Run with verbose output
pytest -v

Code Formatting

# Format code
ruff format dify_dataset_sdk/

# Check and fix issues
ruff check --fix dify_dataset_sdk/

# Type checking
mypy dify_dataset_sdk/

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Changelog

v0.3.0

Initial Release Features:
- Full Dify Knowledge Base API support (39 endpoints)
- Complete CRUD operations for datasets, documents, segments, and metadata
- Knowledge tags management for dataset organization
- Advanced retrieval methods (semantic, full-text, hybrid)
- Comprehensive error handling with custom exceptions
- Type-safe models with Pydantic
- File upload support for multiple formats
- Progress monitoring and indexing status tracking
- Batch processing capabilities
- Retry mechanisms and connection resilience
- Rich example collection covering all use cases
- Production-ready monitoring and health checks
- Multi-language documentation (English and Chinese)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

May 9, 2026

0.5.0

Dec 27, 2025

0.4.1

Dec 26, 2025

0.4.0

Dec 24, 2025

This version

0.3.0

Sep 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dify_dataset_sdk-0.3.0.tar.gz (128.4 kB view details)

Uploaded Sep 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dify_dataset_sdk-0.3.0-py3-none-any.whl (19.3 kB view details)

Uploaded Sep 5, 2025 Python 3

File details

Details for the file dify_dataset_sdk-0.3.0.tar.gz.

File metadata

Download URL: dify_dataset_sdk-0.3.0.tar.gz
Upload date: Sep 5, 2025
Size: 128.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dify_dataset_sdk-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`a90ba1c6ad6c98505e07f41bb08f674dca326be65d91eb18dbcf84837f1702bb`
MD5	`12fbe03aeb8abaafa9b2ac9777ad49f7`
BLAKE2b-256	`b31dd545799f4894a2b0b4927722ffa06fbc9b879afc7f0a006cdbf47b978f9a`

See more details on using hashes here.

File details

Details for the file dify_dataset_sdk-0.3.0-py3-none-any.whl.

File metadata

Download URL: dify_dataset_sdk-0.3.0-py3-none-any.whl
Upload date: Sep 5, 2025
Size: 19.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dify_dataset_sdk-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`50e3ae0d4a34b61c428ef5e3d00cfe75adb9efea6bc24cbd6096de11d5dbc2da`
MD5	`4e03c71fa82c13e13fcb12a386cd6333`
BLAKE2b-256	`4b11bac69ab78e07c50cdc9aa232d424b438aa8c50a912552f64ee3155dd0e4f`

See more details on using hashes here.

dify-dataset-sdk 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Dify Knowledge Base SDK

Features

Installation

Quick Start

Configuration

API Key

Base URL

Core Features

Dataset Management

Document Operations

From Text

From File

Custom Processing Rules

Segment Management

Knowledge Tags Management

Metadata Management

Advanced Retrieval

Progress Monitoring

Error Handling

Advanced Usage

Key Advanced Features

Batch Processing

Error Handling with Retry

Health Monitoring

API Reference

Client Configuration

Supported File Types

Rate Limits

Development

Setup

Running Tests

Code Formatting

Contributing

License

Support

Changelog

v0.3.0

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes