Skip to main content

SQLAlchemy dialect for Arc time-series database with multi-database support for Apache Superset

Project description

Arc + Apache Superset Integration

This directory contains the custom SQLAlchemy dialect and configuration for integrating Apache Superset with the Arc time-series data warehouse.

Features

  • Custom SQLAlchemy dialect for Arc API
  • Authentication via API keys (set and forget)
  • Full SQL support for dashboard creation
  • Auto-discovery of Arc tables and measurements
  • Docker Compose integration

Quick Start

Option 1: Install in Existing Superset

# Install the Arc dialect package
pip install arc-superset-dialect

# Restart Superset
superset run -h 0.0.0.0 -p 8088

Option 2: Docker Image (Recommended for Production)

Use the included Dockerfile to build a Superset image with Arc JSON dialect pre-installed:

# Build the image
docker build -t superset-arc-json:latest .

# Run Superset with Arc JSON support
docker run -d \
  -p 8088:8088 \
  -v superset_home:/app/superset_home \
  --name superset-arc \
  superset-arc-json:latest

# Check logs
docker logs -f superset-arc

The Dockerfile includes:

  • Arc JSON dialect pre-installed in Superset's venv
  • Pip properly configured in the virtual environment
  • Uses Superset's default initialization

Connect to Arc

  1. Access Superset: http://localhost:8088 (admin/admin)

  2. Create Arc Connection:

    • Go to SettingsDatabase Connections
    • Click + Database
    • Select Other as database type
    • Use this connection string:
      arc+json://YOUR_API_KEY@arc-api:8000/default
      
  3. Replace YOUR_API_KEY with your Arc token (see below)

Getting Your API Key

Arc creates an initial admin token on first startup. Check the logs:

# Docker
docker logs arc-api | grep "Initial admin token"

# Native/systemd
journalctl -u arc | grep "Initial admin token"

Or create a new token using the Arc CLI or Python:

# Using Python directly
DB_PATH="./data/arc.db" python3 -c "
from api.auth import AuthManager
auth = AuthManager(db_path='./data/arc.db')
token = auth.create_token('superset-integration', description='Superset dashboard access')
print(f'Token: {token}')
"

Save the returned token - it's only shown once!

Multi-Database Support

Arc supports multiple databases (namespaces) within a single instance. In Superset, databases are exposed as schemas:

Database Structure

Schema: default
  ├── cpu (CPU metrics)
  ├── mem (Memory metrics)
  └── disk (Disk metrics)

Schema: production
  ├── cpu
  ├── mem
  └── disk

Schema: staging
  ├── cpu
  ├── mem
  └── disk

Using Databases in Superset

  1. View All Databases (Schemas):

    • When creating a dataset, select the schema (database) from the dropdown
    • Each database appears as a separate schema in Superset
  2. Query Specific Database:

    -- Query default database
    SELECT * FROM cpu WHERE timestamp > NOW() - INTERVAL 1 HOUR
    
    -- Query specific database
    SELECT * FROM production.cpu WHERE timestamp > NOW() - INTERVAL 1 HOUR
    
    -- Cross-database joins
    SELECT
        p.timestamp,
        p.usage_idle as prod_cpu,
        s.usage_idle as staging_cpu
    FROM production.cpu p
    JOIN staging.cpu s ON p.timestamp = s.timestamp
    WHERE p.timestamp > NOW() - INTERVAL 1 HOUR
    

Available Commands

-- List all databases
SHOW DATABASES;

-- List all tables in current database
SHOW TABLES;

Available Tables

The dialect auto-discovers all tables using SHOW TABLES. Common examples:

  • cpu - CPU metrics
  • mem - Memory metrics
  • disk - Disk metrics
  • Any custom measurements you've ingested

Example Queries

Basic Queries

-- Recent CPU metrics (use 'timestamp' column)
SELECT timestamp, host, usage_idle, usage_user
FROM cpu
WHERE timestamp > NOW() - INTERVAL 1 HOUR
ORDER BY timestamp DESC
LIMIT 100;

-- Average CPU usage by host
SELECT host, AVG(usage_idle) as avg_idle
FROM cpu
WHERE timestamp > NOW() - INTERVAL 24 HOUR
GROUP BY host
ORDER BY avg_idle DESC;

Time-Series Aggregation

Arc supports DuckDB's powerful time functions:

-- Time bucket aggregation (5-minute intervals)
SELECT
    time_bucket(INTERVAL '5 minutes', timestamp) as bucket,
    host,
    AVG(usage_idle) as avg_idle,
    MAX(usage_user) as max_user
FROM cpu
WHERE timestamp > NOW() - INTERVAL 6 HOUR
GROUP BY bucket, host
ORDER BY bucket DESC;

-- Daily aggregation with DATE_TRUNC
SELECT
    DATE_TRUNC('day', timestamp) as day,
    host,
    AVG(usage_idle) as avg_cpu_idle,
    COUNT(*) as samples
FROM cpu
WHERE timestamp > NOW() - INTERVAL 7 DAY
GROUP BY day, host
ORDER BY day DESC;

Join Queries

Join multiple measurements for correlated analysis:

-- Correlate CPU and Memory usage
SELECT
    c.timestamp,
    c.host,
    c.usage_idle as cpu_idle,
    m.used_percent as mem_used
FROM cpu c
JOIN mem m ON c.timestamp = m.timestamp AND c.host = m.host
WHERE c.timestamp > NOW() - INTERVAL 10 MINUTE
ORDER BY c.timestamp DESC
LIMIT 1000;

Window Functions

-- Moving average over last 6 data points
SELECT
    timestamp,
    host,
    usage_idle,
    AVG(usage_idle) OVER (
        PARTITION BY host
        ORDER BY timestamp
        ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
    ) as moving_avg
FROM cpu
ORDER BY timestamp DESC
LIMIT 100;

Architecture

Superset → Custom Dialect → HTTP Requests → Arc API → DuckDB → Parquet Files → MinIO/S3
    ↓           ↓               ↓           ↓         ↓           ↓              ↓
Dashboard    SQL Query      API Key Auth  Query    Columnar    Compacted    Object
                                          Engine    Storage      Files      Storage

Connection String Format

arc+json://API_KEY@HOST:PORT/DATABASE

Examples:

# Local development
arc+json://your-api-key@localhost:8000/default

# Docker Compose
arc+json://your-api-key@arc-api:8000/default

# Remote server
arc+json://your-api-key@arc.example.com:8000/default

Troubleshooting

Connection Issues

  • Verify your API key is correct: Test with curl -H "Authorization: Bearer YOUR_KEY" http://host:8000/health
  • Check that Arc API is running: docker logs arc-api or check systemd logs
  • Ensure network connectivity: Superset and Arc must be on the same network or accessible via hostname

Query Issues

  • Check Arc API logs: docker logs arc-api to see query execution
  • Verify table names exist: Use SHOW TABLES in Arc to list available measurements
  • Use LIMIT clauses: Always limit large queries for performance
  • Add time filters: Time-series queries should filter by time range

Performance Tips

  • Always use LIMIT for exploratory queries to avoid loading millions of rows
  • Add time range filters for time-series data: WHERE timestamp > NOW() - INTERVAL 1 HOUR
  • Use column names correctly: Arc stores timestamps in timestamp column (not time)
  • Leverage query cache: Arc caches query results for 60 seconds by default
  • Use compacted partitions: Arc automatically merges small files for 10-50x faster queries
  • Optimize aggregations: Use time_bucket() for time-series bucketing instead of DATE_TRUNC when possible
  • Partition filtering: Include host or other tag filters to reduce data scanned

Package Contents

  • arc_dialect.py - Custom SQLAlchemy dialect for Arc API
  • setup.py - PyPI package configuration
  • MANIFEST.in - Package file manifest
  • README.md - This documentation
  • PUBLISHING.md - Publishing guide for PyPI

Security Notes

  • API keys are stored encrypted in Superset's database
  • All queries go through Arc's authentication system
  • Change the default Superset admin password in production
  • Set a strong SUPERSET_SECRET_KEY in production
  • Use HTTPS for production deployments

Advanced Features

Schema Support

Arc supports multiple schemas/databases:

-- Query specific database
SELECT * FROM my_database.cpu LIMIT 10;

-- List all databases
SHOW DATABASES;

Advanced DuckDB Features

Arc leverages DuckDB's full SQL capabilities:

-- List all measurements (tables)
SHOW TABLES;

-- Get table schema
DESCRIBE cpu;

-- Count records by measurement
SELECT COUNT(*) as total_records FROM cpu;

-- Complex aggregations with FILTER
SELECT
    host,
    COUNT(*) FILTER (WHERE usage_idle > 90) as idle_count,
    COUNT(*) FILTER (WHERE usage_idle < 50) as busy_count,
    AVG(usage_idle) as avg_idle
FROM cpu
WHERE timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY host;

Development

To modify or extend the dialect:

  1. Clone the repository:

    git clone https://github.com/basekick-labs/arc-superset-dialect.git
    cd arc-superset-dialect
    
  2. Edit arc_dialect.py

  3. Install locally for testing:

    pip install -e .
    
  4. Test with Superset:

    superset run -h 0.0.0.0 -p 8088
    
  5. Submit a PR if you add improvements!

License

Same license as Arc Core (AGPL-3.0)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arc_superset_dialect-1.3.3.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arc_superset_dialect-1.3.3-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file arc_superset_dialect-1.3.3.tar.gz.

File metadata

  • Download URL: arc_superset_dialect-1.3.3.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for arc_superset_dialect-1.3.3.tar.gz
Algorithm Hash digest
SHA256 25e257711a79e66c617620953b02d858c5cd7d1632c17647bc709ab159403067
MD5 fc9eb06cbedbfecae31bf38b34adc297
BLAKE2b-256 ec6ef6cfe2ac1a7a8c525ee7ed744e9cfb27d813531f361dcb57798d6d12a895

See more details on using hashes here.

File details

Details for the file arc_superset_dialect-1.3.3-py3-none-any.whl.

File metadata

File hashes

Hashes for arc_superset_dialect-1.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 74614fe5b260779762cea792e869c0a653b9bfe04b77fb14c2eb4023366dd4d8
MD5 7a0f9ce371c28a42fc0893deaf47be5e
BLAKE2b-256 b9d1ee9a3b6ea2010ec6f737db14d497f38685cf5ac873fdb065d37b6eb5a07e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page