Ad-hoc Dagster pipelines for data fetching using AI/LLM prompts and agentic AI
Project description
Khora
Ad-hoc Dagster pipelines for data fetching using AI/LLM prompts and agentic AI.
Overview
Khora is a Python package that enables the creation of dynamic data pipelines using Dagster, powered by AI agents built with LangGraph and LangChain. It can fetch data from various sources including:
- APIs (REST endpoints with full HTTP method support)
- Websites (advanced web scraping using Playwright - handles JavaScript, takes screenshots, executes custom scripts)
- Google Docs/Sheets (with service account authentication)
Features
- 🤖 AI-powered data fetching using natural language prompts
- 🔄 Dynamic pipeline generation based on descriptions
- 🛠️ Support for multiple data sources:
- APIs (REST endpoints)
- Web scraping with Playwright (handles JavaScript-rendered content)
- Google Docs and Sheets
- 🎭 Advanced web scraping capabilities:
- JavaScript execution
- Screenshot capture
- Custom selectors
- Wait conditions
- 📊 Integration with Dagster for orchestration
- 🐳 Docker support for easy deployment
- ✅ Comprehensive test coverage
Installation
Using uv (recommended)
uv pip install khora
Using pip
pip install khora
Development Installation
git clone https://github.com/yourusername/khora.git
cd khora
uv pip install -e ".[dev]"
Configuration
- Copy the environment template:
cp .env.example .env
- Edit
.envand add your credentials:
OPENAI_API_KEY: Your OpenAI API keyGOOGLE_CREDENTIALS_PATH: Path to Google service account credentials (for Google Docs/Sheets)
Usage
Basic Example
from khora.agents import DataFetcherAgent, PipelineBuilderAgent
from khora.utils.data_models import DataRequest, DataSourceType
# Initialize agents
fetcher = DataFetcherAgent(openai_api_key="your-key")
builder = PipelineBuilderAgent(openai_api_key="your-key")
# Create a data request
request = DataRequest(
source_type=DataSourceType.API,
prompt="Fetch current weather data for San Francisco",
source_config={
"url": "https://api.weather.com/v1/current"
}
)
# Fetch data
response = await fetcher.fetch_data(request)
print(response.data)
Creating Dynamic Pipelines
# Describe your pipeline in natural language
description = """
Create a pipeline that:
1. Fetches cryptocurrency prices from CoinGecko API
2. Scrapes latest crypto news from CoinDesk
3. Reads analysis from a Google Sheet
"""
# Generate pipeline configuration
config = builder.analyze_pipeline_request(description)
# Build and execute the pipeline
pipeline = builder.build_pipeline(config)
Running Dagster UI
dagster dev -f src/khora/pipelines/definitions.py
Then navigate to http://localhost:3000 to see the Dagster UI.
Docker Usage
Build the image
docker build -t khora:latest .
Run the container
docker run -p 3000:3000 \
-e OPENAI_API_KEY=your-key \
-v $(pwd)/.env:/app/.env \
khora:latest
Testing
Run the test suite:
pytest tests/
With coverage:
pytest tests/ --cov=khora --cov-report=html
Requirements
- Python 3.12 (required)
- Playwright browsers (automatically installed during setup)
CI/CD
The project uses GitHub Actions for CI/CD with two main workflows:
Main CI Workflow (ci.yml)
- Runs tests on Python 3.12
- Checks code formatting with Black and Ruff
- Performs type checking with mypy
- Builds and tests the Docker image
- Uploads coverage reports to Codecov
Publish Workflow (publish.yml)
Automatically publishes to PyPI when version tags are pushed:
- Triggered by pushing tags matching
v*pattern (e.g.,v0.0.2) - Runs full test suite and quality checks
- Builds and publishes package to PyPI
- Uses
PYPI_API_TOKENsecret for authentication
Project Structure
khora/
├── src/khora/
│ ├── agents/ # AI agents for data fetching and pipeline building
│ ├── pipelines/ # Dagster pipeline definitions
│ ├── tools/ # Tools for different data sources
│ └── utils/ # Utilities and data models
├── tests/ # Test suite
├── .github/workflows/ # CI/CD configuration
├── Dockerfile # Container definition
└── pyproject.toml # Project configuration
Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and add tests
- Run tests and linting:
pytest && black . && ruff check . - Commit your changes:
git commit -m "Add feature" - Push to your fork:
git push origin feature-name - Create a pull request
License
MIT License - see LICENSE file for details.
Support
For issues and questions:
- Open an issue on GitHub
- Check the documentation
- Review existing discussions
Roadmap
- Add support for more data sources (databases, S3, etc.)
- Implement data transformation capabilities
- Add scheduling and monitoring features
- Create a web UI for pipeline management
- Support for more LLM providers
Releasing
Quick Release (Recommended)
Use the automated release script:
# Create and push a patch release (0.0.1 -> 0.0.2)
python scripts/create_release.py patch --push
# Create a minor release (0.0.1 -> 0.1.0)
python scripts/create_release.py minor
# Create a major release (0.0.1 -> 1.0.0)
python scripts/create_release.py major
# Preview what would happen
python scripts/create_release.py patch --dry-run
Step-by-Step Release
-
Bump version:
python scripts/bump_version.py patch
-
Create git tag and push:
git add . git commit -m "Bump version to 0.0.2" git tag v0.0.2 git push origin main --tags
-
Automatic publishing: The publish workflow will automatically:
- Run all tests and quality checks
- Build the package
- Publish to PyPI
Setup PyPI Token
To enable publishing, add your PyPI API token as a GitHub secret:
- Create an API token on PyPI
- Add it as
PYPI_API_TOKENin your repository secrets
Version
Current version: 0.0.1
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file khora-0.0.1.tar.gz.
File metadata
- Download URL: khora-0.0.1.tar.gz
- Upload date:
- Size: 108.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ed512cbeae83aa29f53a88734cb9a76d66a4bf7a1e33630ab6a4aad29e5c0b5
|
|
| MD5 |
668c705db552be71172a8934a38461ee
|
|
| BLAKE2b-256 |
ec3069278465ccba91d560713074b9926b0650a679ca73417858edfa336679bf
|
File details
Details for the file khora-0.0.1-py3-none-any.whl.
File metadata
- Download URL: khora-0.0.1-py3-none-any.whl
- Upload date:
- Size: 20.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2ca01c97352d48ee21eb1e2c981d1986f0d0c446d84e2a928d568de9573c781
|
|
| MD5 |
9f939fd4e37fe547471d9c274cb16180
|
|
| BLAKE2b-256 |
0c86bef330658238390a631992642d87960cdf6dd903225a69682ca937913617
|