Skip to main content

Extract events from Splunk journal archives to raw format (JSON, CSV, Parquet)

Project description

Splunk DDSS Extractor

Convert Splunk self-hosted storage archives from compressed journal format to raw format.

Overview

Splunk DDSS Extractor is a Python library (with a Rust native extension for performance) that processes Splunk journal archives, extracts events, and converts them to common formats for analysis and long-term storage.

Key highlights:

  • Rust-accelerated decoding via PyO3 native extension (~2.7x faster than pure Python)
  • Automatic compression detection (.zst, .gz, uncompressed)
  • Multiple output formats: JSON Lines, CSV, Parquet
  • Streaming S3 support (read/write directly, no temp files)
  • CLI tool and Python API

Based on the concept from fionera/splunker, extended with streaming S3 support, multiple output formats, Rust performance, and production-ready features.

Installation

pip install splunk-ddss-extractor

With optional extras:

pip install splunk-ddss-extractor[s3]       # S3 streaming support (boto3)
pip install splunk-ddss-extractor[parquet]   # Parquet output (pyarrow)
pip install splunk-ddss-extractor[cli]       # CLI dependencies (click)

Pre-built wheels with the native Rust extension are available for Linux x86_64 and aarch64 (Python 3.10-3.13). The library falls back to a pure-Python decoder if the native extension is unavailable.

Quick Start

CLI

# Extract to JSON Lines
splunk-extract -i journal.zst -o output.json -f ndjson

# Extract to CSV
splunk-extract -i journal.zst -o output.csv -f csv

# Extract to Parquet
splunk-extract -i journal.zst -o output.parquet -f parquet

# S3 to local (streaming)
splunk-extract -i s3://bucket/path/journal.zst -o output.json

# S3 to S3 (streaming, no temp files)
splunk-extract -i s3://bucket/input/journal.zst -o s3://bucket/output/data.json

# Stdin/stdout
cat journal.zst | splunk-extract > output.json

# Enable debug tracing
splunk-extract -i journal.zst -o output.json --trace

Python API

from splunk_ddss_extractor.extractor import Extractor

extractor = Extractor()

# All compression formats are auto-detected
extractor.extract("journal.zst", "output.json", "ndjson")    # Zstandard
extractor.extract("journal.gz", "output.csv", "csv")          # Gzip
extractor.extract("journal", "output.parquet", "parquet")      # Uncompressed

# Streaming S3 (no temp files)
extractor.extract("s3://bucket/journal.zst", "s3://bucket/output.json", "ndjson")

Low-level Decoder (Advanced)

The native decoder provides direct access to the binary journal format:

from splunk_ddss_extractor import NativeJournalDecoder
import zstandard as zstd

with open("journal.zst", "rb") as f:
    dctx = zstd.ZstdDecompressor()
    with dctx.stream_reader(f) as reader:
        decoder = NativeJournalDecoder(reader=reader)
        while decoder.scan():
            event = decoder.get_event()
            print(f"Host: {decoder.host()}")
            print(f"Source: {decoder.source()}")
            print(f"Sourcetype: {decoder.source_type()}")
            print(f"Timestamp: {event.index_time}")
            print(f"Message: {event.message_string()}")

A pure-Python JournalDecoder is also available with the same interface, useful when the native extension cannot be built.

Architecture

splunk_ddss_extractor/
├── src/splunk_ddss_extractor/    # Python package
│   ├── __init__.py
│   ├── decoder.py                # Pure-Python journal decoder
│   ├── native_decoder.py         # Rust-backed decoder (NativeJournalDecoder)
│   ├── extractor.py              # High-level extraction API
│   ├── main.py                   # CLI entry point
│   └── utils.py                  # Output writers, logging
├── rust/                         # Native extension (PyO3 + maturin)
│   ├── Cargo.toml
│   └── src/
│       ├── lib.rs                # PyO3 module
│       ├── decoder.rs            # Rust journal decoder
│       └── varint.rs             # Variable-length integer parsing
├── tests/
├── docker/
└── scripts/

Components:

  1. Native Decoder (rust/ + native_decoder.py) - Rust extension for high-throughput decoding via scan_batch, wrapped in Python for ergonomic use
  2. Python Decoder (decoder.py) - Pure-Python fallback with identical interface
  3. Extractor (extractor.py) - High-level API with auto-compression, S3 streaming, multi-format output
  4. CLI (main.py) - Command-line interface via splunk-extract

Output Formats

Format Extension Use Case
JSON Lines .json / .jsonl Streaming, log aggregation
CSV .csv Spreadsheets, simple analytics
Parquet .parquet Columnar analytics, data lakes

Example JSON Lines output:

{"timestamp": 1761643257, "host": "server01", "source": "s3://logs/app.log", "sourcetype": "aws:cloudtrail", "message": "{\"eventVersion\":\"1.08\",...}"}

Integration Examples

AWS Lambda (streaming S3-to-S3):

from splunk_ddss_extractor.extractor import Extractor

def lambda_handler(event, context):
    extractor = Extractor()
    count = extractor.extract(
        input_path="s3://input-bucket/journal.zst",
        output_path="s3://output-bucket/output.json",
        output_format="ndjson",
    )
    return {"statusCode": 200, "events_extracted": count}

Docker:

make docker
docker run -v /path/to/data:/data splunk-ddss-extractor:latest \
    splunk-extract -i /data/journal.zst -o /data/output.json

Development

Setup

# Full setup (venv + deps + Rust extension)
make dev-setup

# Or step by step:
make venv
source venv/bin/activate
make install
make rust-build-release

Commands

make test              # Run Python tests
make test-coverage     # Tests with coverage report
make rust-test         # Run Rust unit tests
make rust-build        # Build Rust extension (debug)
make rust-build-release # Build Rust extension (release)
make docker            # Build Docker image
make check             # Run all checks
make env               # Show all available commands

Versioning and Releases

Version is tracked in three files kept in sync by the bump script:

  • pyproject.toml (Python package version)
  • rust/Cargo.toml (Rust crate version)
  • src/splunk_ddss_extractor/__init__.py (runtime __version__)
make version           # Show current version
make bump-patch        # 0.3.0 -> 0.3.1
make bump-minor        # 0.3.0 -> 0.4.0
make bump-major        # 0.3.0 -> 1.0.0
make release           # Bump patch, commit, tag, and push
make release-minor     # Bump minor, commit, tag, and push
make release-major     # Bump major, commit, tag, and push

The publish.yml GitHub Actions workflow builds native wheels for Linux x86_64 and aarch64 (Python 3.10-3.13) and publishes to PyPI when a GitHub release is created.

Performance

Decoder Throughput
Pure Python (JournalDecoder) ~29K events/s
Rust native (NativeJournalDecoder) ~80K events/s

The Extractor class uses the native decoder by default when available.

Credits

Based on the concept from fionera/splunker (Go). This Python/Rust implementation adds streaming S3 support, multiple output formats, native performance, and a production-ready API.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splunk_ddss_extractor-0.4.1.tar.gz (21.2 kB view details)

Uploaded Source

File details

Details for the file splunk_ddss_extractor-0.4.1.tar.gz.

File metadata

  • Download URL: splunk_ddss_extractor-0.4.1.tar.gz
  • Upload date:
  • Size: 21.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for splunk_ddss_extractor-0.4.1.tar.gz
Algorithm Hash digest
SHA256 52b4bdbc9756be2a09a2c28f1cf251e6b001d10075b69a51e42d0a13e6f87dd1
MD5 5fd1e309005d3a34cc2ecb4aa913c0ce
BLAKE2b-256 604057bef37c7023506c86fda8f459d2a40f1b2d467268c9b23d4f4fefbc06ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for splunk_ddss_extractor-0.4.1.tar.gz:

Publisher: publish.yml on ponquersohn/splunk_ddss_extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page