Skip to main content

Extract events from Splunk journal archives to raw format (JSON, CSV, Parquet)

Project description

Splunk DDSS Extractor

Convert Splunk self-hosted storage archives from compressed journal format to raw format.

Overview

Splunk DDSS Extractor is a Python library that processes Splunk journal archives, extracts events, and converts them to raw format for easier analysis and long-term storage. Use it in your own applications, data pipelines, or as a CLI tool.

Note: This project is based on the concept from fionera/splunker, reimplemented in Python with additional features for production use.

Features

  • Automatic compression detection (.zst, .gz, uncompressed)
  • Extract events with full metadata (host, source, sourcetype, timestamps)
  • Multiple output formats (JSON Lines, CSV, Parquet)
  • Streaming processing for memory efficiency
  • Simple Python API and CLI interface
  • Docker support for containerized deployments
  • Integrates with AWS Lambda, ECS, or any Python environment

Quick Start

Using the Makefile (Recommended)

# Show all available commands
make env

# Complete development setup (venv + dependencies)
make dev-setup

# Run tests
make test

# Build Docker image
make docker

Manual Setup

Installation

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt
pip install -e .

# Optional: Install Parquet support
pip install pyarrow

Basic Usage

Extract a journal file:

from splunk_ddss_extractor.extractor import Extractor

extractor = Extractor()

# Extract to JSON Lines
extractor.extract(
    input_path='/path/to/journal.zst',
    output_path='output.json',
    output_format='ndjson'
)

# Extract to CSV
extractor.extract(
    input_path='/path/to/journal.zst',
    output_path='output.csv',
    output_format='csv'
)

# Extract to Parquet
extractor.extract(
    input_path='/path/to/journal.zst',
    output_path='output.parquet',
    output_format='parquet'
)

# Extract from S3 to local file (streaming, no download)
extractor.extract(
    input_path='s3://bucket/path/journal.zst',
    output_path='output.json',
    output_format='ndjson'
)

# Extract from local to S3
extractor.extract(
    input_path='/path/to/journal.zst',
    output_path='s3://bucket/output/data.json',
    output_format='ndjson'
)

Low-level streaming (advanced):

from splunk_ddss_extractor.decoder import JournalDecoder
import zstandard as zstd

# For low-level access, decoder needs an uncompressed stream
# If reading a compressed file, decompress it first:
with open('/path/to/journal.zst', 'rb') as compressed_file:
    dctx = zstd.ZstdDecompressor()
    with dctx.stream_reader(compressed_file) as reader:
        decoder = JournalDecoder(reader=reader)
        while decoder.scan():
            event = decoder.get_event()
            print(f"Host: {decoder.host()}")
            print(f"Source: {decoder.source()}")
            print(f"Sourcetype: {decoder.source_type()}")
            print(f"Timestamp: {event.index_time}")
            print(f"Message: {event.message_string()}")

# For uncompressed journal files:
with open('/path/to/journal', 'rb') as f:
    decoder = JournalDecoder(reader=f)
    while decoder.scan():
        event = decoder.get_event()
        # Process event...

Docker Usage

# Build image
make docker

# Run with local file
docker run -v /path/to/data:/data ghcr.io/ponquersohn/splunk_ddss_extractor:latest

# Use in your own Dockerfile
FROM ghcr.io/ponquersohn/splunk_ddss_extractor:latest
COPY your_script.py /app/
CMD ["python", "/app/your_script.py"]

Architecture

This is a Python library with the following components:

  1. Journal Decoder - Low-level decoder for Splunk's binary journal format
  2. Extractor Interface - High-level API for common extraction tasks
  3. Output Writers - Support for JSON, CSV, and Parquet formats
  4. Compression Detection - Automatic detection and handling of .zst, .gz formats

Integration Options:

  • Direct Python import in your applications
  • AWS Lambda functions for serverless processing
  • ECS/Fargate tasks for batch processing
  • Docker containers for isolated environments
  • Local scripts for one-off extractions

See CLAUDE.md for detailed architecture documentation.

Development

Quick Commands

# Run tests
make test

# Run tests with coverage
make test-coverage

# Build Docker image
make docker

# Test Docker locally
make docker-run

# Run all checks (tests)
make check

# Clean temporary files
make clean

Manual Commands

# Run tests
pytest tests/

# Code formatting
black src/ tests/

# Local Docker testing
cd docker
docker-compose up

Configuration

When integrating with AWS or other environments, you may use these environment variables:

  • OUTPUT_FORMAT: Output format - json, csv, or parquet (default: json)
  • LOG_LEVEL: Logging level (default: INFO)
  • AWS_REGION: AWS region for S3 operations (default: us-east-1)
  • S3_BUCKET: S3 bucket name (for S3 integrations)

Output Formats

JSON Lines (default)

{"timestamp": 1234567890, "host": "server01", "source": "/var/log/app.log", "sourcetype": "app", "message": "Event data"}

CSV

timestamp,host,source,sourcetype,message
1234567890,server01,/var/log/app.log,app,"Event data"

Parquet

Columnar format optimized for analytics (requires pyarrow).

Credits

This project is inspired by and based on the concept from fionera/splunker, a Go implementation for extracting Splunk journal files. This Python implementation extends the original concept with:

  • Streaming S3 support (no temporary files)
  • Multiple output formats (JSON Lines, CSV, Parquet)
  • Python library API for easy integration
  • Docker and AWS deployment options

License

Proprietary

Contributing

See CLAUDE.md for development guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splunk_ddss_extractor-0.3.0.tar.gz (20.7 kB view details)

Uploaded Source

File details

Details for the file splunk_ddss_extractor-0.3.0.tar.gz.

File metadata

  • Download URL: splunk_ddss_extractor-0.3.0.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for splunk_ddss_extractor-0.3.0.tar.gz
Algorithm Hash digest
SHA256 130ed8d3530ef383332424788d5ae753c5c99c21c7d41bdd2fc0e50d2d8a903c
MD5 f0f2b52d16974fa17ab8cab4863023bb
BLAKE2b-256 5728f1d2fc0258a5eacffb3e008cd0176fe0e79e3773c399a8db1116c1665925

See more details on using hashes here.

Provenance

The following attestation bundles were made for splunk_ddss_extractor-0.3.0.tar.gz:

Publisher: publish.yml on ponquersohn/splunk_ddss_extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page