Skip to main content

Ad-hoc Dagster pipelines for data fetching using AI/LLM prompts and agentic AI

Project description

Khora

Ad-hoc Dagster pipelines for data fetching using AI/LLM prompts and agentic AI.

Overview

Khora is a Python package that enables the creation of dynamic data pipelines using Dagster, powered by AI agents built with LangGraph and LangChain. It can fetch data from various sources including:

  • APIs (REST endpoints with full HTTP method support)
  • Websites (advanced web scraping using Playwright - handles JavaScript, takes screenshots, executes custom scripts)
  • Google Docs/Sheets (with service account authentication)

Features

  • 🤖 AI-powered data fetching using natural language prompts
  • 🔄 Dynamic pipeline generation based on descriptions
  • 🛠️ Support for multiple data sources:
    • APIs (REST endpoints)
    • Web scraping with Playwright (handles JavaScript-rendered content)
    • Google Docs and Sheets
  • 🎭 Advanced web scraping capabilities:
    • JavaScript execution
    • Screenshot capture
    • Custom selectors
    • Wait conditions
  • 📊 Integration with Dagster for orchestration
  • 🐳 Docker support for easy deployment
  • ✅ Comprehensive test coverage

Installation

Using uv (recommended)

uv pip install khora

Using pip

pip install khora

Development Installation

git clone https://github.com/yourusername/khora.git
cd khora
uv pip install -e ".[dev]"

Configuration

  1. Copy the environment template:
cp .env.example .env
  1. Edit .env and add your credentials:
  • OPENAI_API_KEY: Your OpenAI API key
  • GOOGLE_CREDENTIALS_PATH: Path to Google service account credentials (for Google Docs/Sheets)

Usage

Basic Example

from khora.agents import DataFetcherAgent, PipelineBuilderAgent
from khora.utils.data_models import DataRequest, DataSourceType

# Initialize agents
fetcher = DataFetcherAgent(openai_api_key="your-key")
builder = PipelineBuilderAgent(openai_api_key="your-key")

# Create a data request
request = DataRequest(
    source_type=DataSourceType.API,
    prompt="Fetch current weather data for San Francisco",
    source_config={
        "url": "https://api.weather.com/v1/current"
    }
)

# Fetch data
response = await fetcher.fetch_data(request)
print(response.data)

Creating Dynamic Pipelines

# Describe your pipeline in natural language
description = """
Create a pipeline that:
1. Fetches cryptocurrency prices from CoinGecko API
2. Scrapes latest crypto news from CoinDesk
3. Reads analysis from a Google Sheet
"""

# Generate pipeline configuration
config = builder.analyze_pipeline_request(description)

# Build and execute the pipeline
pipeline = builder.build_pipeline(config)

Running Dagster UI

dagster dev -f src/khora/pipelines/definitions.py

Then navigate to http://localhost:3000 to see the Dagster UI.

Docker Usage

Build the image

docker build -t khora:latest .

Run the container

docker run -p 3000:3000 \
  -e OPENAI_API_KEY=your-key \
  -v $(pwd)/.env:/app/.env \
  khora:latest

Testing

Run the test suite:

pytest tests/

With coverage:

pytest tests/ --cov=khora --cov-report=html

Requirements

  • Python 3.12 (required)
  • Playwright browsers (automatically installed during setup)

CI/CD

The project uses GitHub Actions for CI/CD with two main workflows:

Main CI Workflow (ci.yml)

  1. Runs tests on Python 3.12
  2. Checks code formatting with Black and Ruff
  3. Performs type checking with mypy
  4. Builds and tests the Docker image
  5. Uploads coverage reports to Codecov

Publish Workflow (publish.yml)

Automatically publishes to PyPI when version tags are pushed:

  • Triggered by pushing tags matching v* pattern (e.g., v0.0.2)
  • Runs full test suite and quality checks
  • Builds and publishes package to PyPI
  • Uses PYPI_API_TOKEN secret for authentication

Project Structure

khora/
├── src/khora/
│   ├── agents/         # AI agents for data fetching and pipeline building
│   ├── pipelines/      # Dagster pipeline definitions
│   ├── tools/          # Tools for different data sources
│   └── utils/          # Utilities and data models
├── tests/              # Test suite
├── .github/workflows/  # CI/CD configuration
├── Dockerfile          # Container definition
└── pyproject.toml      # Project configuration

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes and add tests
  4. Run tests and linting: pytest && black . && ruff check .
  5. Commit your changes: git commit -m "Add feature"
  6. Push to your fork: git push origin feature-name
  7. Create a pull request

License

MIT License - see LICENSE file for details.

Support

For issues and questions:

  • Open an issue on GitHub
  • Check the documentation
  • Review existing discussions

Roadmap

  • Add support for more data sources (databases, S3, etc.)
  • Implement data transformation capabilities
  • Add scheduling and monitoring features
  • Create a web UI for pipeline management
  • Support for more LLM providers

Releasing

Quick Release (Recommended)

Use the automated release script:

# Create and push a patch release (0.0.1 -> 0.0.2)
python scripts/create_release.py patch --push

# Create a minor release (0.0.1 -> 0.1.0)
python scripts/create_release.py minor

# Create a major release (0.0.1 -> 1.0.0)
python scripts/create_release.py major

# Preview what would happen
python scripts/create_release.py patch --dry-run

Step-by-Step Release

  1. Bump version:

    python scripts/bump_version.py patch
    
  2. Create git tag and push:

    git add .
    git commit -m "Bump version to 0.0.2"
    git tag v0.0.2
    git push origin main --tags
    
  3. Automatic publishing: The publish workflow will automatically:

    • Run all tests and quality checks
    • Build the package
    • Publish to PyPI

Setup PyPI Token

To enable publishing, add your PyPI API token as a GitHub secret:

  1. Create an API token on PyPI
  2. Add it as PYPI_API_TOKEN in your repository secrets

Version

Current version: 0.0.1

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khora-0.0.1.tar.gz (108.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

khora-0.0.1-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file khora-0.0.1.tar.gz.

File metadata

  • Download URL: khora-0.0.1.tar.gz
  • Upload date:
  • Size: 108.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for khora-0.0.1.tar.gz
Algorithm Hash digest
SHA256 6ed512cbeae83aa29f53a88734cb9a76d66a4bf7a1e33630ab6a4aad29e5c0b5
MD5 668c705db552be71172a8934a38461ee
BLAKE2b-256 ec3069278465ccba91d560713074b9926b0650a679ca73417858edfa336679bf

See more details on using hashes here.

File details

Details for the file khora-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: khora-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 20.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for khora-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d2ca01c97352d48ee21eb1e2c981d1986f0d0c446d84e2a928d568de9573c781
MD5 9f939fd4e37fe547471d9c274cb16180
BLAKE2b-256 0c86bef330658238390a631992642d87960cdf6dd903225a69682ca937913617

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page