Extract events from Splunk journal archives to raw format (JSON, CSV, Parquet)
Project description
Splunk DDSS Extractor
Convert Splunk self-hosted storage archives from compressed journal format to raw format.
Overview
Splunk DDSS Extractor is a Python library that processes Splunk journal archives, extracts events, and converts them to raw format for easier analysis and long-term storage. Use it in your own applications, data pipelines, or as a CLI tool.
Note: This project is based on the concept from fionera/splunker, reimplemented in Python with additional features for production use.
Features
- Automatic compression detection (.zst, .gz, uncompressed)
- Extract events with full metadata (host, source, sourcetype, timestamps)
- Multiple output formats (JSON Lines, CSV, Parquet)
- Streaming processing for memory efficiency
- Simple Python API and CLI interface
- Docker support for containerized deployments
- Integrates with AWS Lambda, ECS, or any Python environment
Quick Start
Using the Makefile (Recommended)
# Show all available commands
make env
# Complete development setup (venv + dependencies)
make dev-setup
# Run tests
make test
# Build Docker image
make docker
Manual Setup
Installation
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
pip install -e .
# Optional: Install Parquet support
pip install pyarrow
Basic Usage
Extract a journal file:
from splunk_ddss_extractor.extractor import Extractor
extractor = Extractor()
# Extract to JSON Lines
extractor.extract(
input_path='/path/to/journal.zst',
output_path='output.json',
output_format='ndjson'
)
# Extract to CSV
extractor.extract(
input_path='/path/to/journal.zst',
output_path='output.csv',
output_format='csv'
)
# Extract to Parquet
extractor.extract(
input_path='/path/to/journal.zst',
output_path='output.parquet',
output_format='parquet'
)
# Extract from S3 to local file (streaming, no download)
extractor.extract(
input_path='s3://bucket/path/journal.zst',
output_path='output.json',
output_format='ndjson'
)
# Extract from local to S3
extractor.extract(
input_path='/path/to/journal.zst',
output_path='s3://bucket/output/data.json',
output_format='ndjson'
)
Low-level streaming (advanced):
from splunk_ddss_extractor.decoder import JournalDecoder
import zstandard as zstd
# For low-level access, decoder needs an uncompressed stream
# If reading a compressed file, decompress it first:
with open('/path/to/journal.zst', 'rb') as compressed_file:
dctx = zstd.ZstdDecompressor()
with dctx.stream_reader(compressed_file) as reader:
decoder = JournalDecoder(reader=reader)
while decoder.scan():
event = decoder.get_event()
print(f"Host: {decoder.host()}")
print(f"Source: {decoder.source()}")
print(f"Sourcetype: {decoder.source_type()}")
print(f"Timestamp: {event.index_time}")
print(f"Message: {event.message_string()}")
# For uncompressed journal files:
with open('/path/to/journal', 'rb') as f:
decoder = JournalDecoder(reader=f)
while decoder.scan():
event = decoder.get_event()
# Process event...
Docker Usage
# Build image
make docker
# Run with local file
docker run -v /path/to/data:/data ghcr.io/ponquersohn/splunk_ddss_extractor:latest
# Use in your own Dockerfile
FROM ghcr.io/ponquersohn/splunk_ddss_extractor:latest
COPY your_script.py /app/
CMD ["python", "/app/your_script.py"]
Architecture
This is a Python library with the following components:
- Journal Decoder - Low-level decoder for Splunk's binary journal format
- Extractor Interface - High-level API for common extraction tasks
- Output Writers - Support for JSON, CSV, and Parquet formats
- Compression Detection - Automatic detection and handling of .zst, .gz formats
Integration Options:
- Direct Python import in your applications
- AWS Lambda functions for serverless processing
- ECS/Fargate tasks for batch processing
- Docker containers for isolated environments
- Local scripts for one-off extractions
See CLAUDE.md for detailed architecture documentation.
Development
Quick Commands
# Run tests
make test
# Run tests with coverage
make test-coverage
# Build Docker image
make docker
# Test Docker locally
make docker-run
# Run all checks (tests)
make check
# Clean temporary files
make clean
Manual Commands
# Run tests
pytest tests/
# Code formatting
black src/ tests/
# Local Docker testing
cd docker
docker-compose up
Configuration
When integrating with AWS or other environments, you may use these environment variables:
OUTPUT_FORMAT: Output format - json, csv, or parquet (default: json)LOG_LEVEL: Logging level (default: INFO)AWS_REGION: AWS region for S3 operations (default: us-east-1)S3_BUCKET: S3 bucket name (for S3 integrations)
Output Formats
JSON Lines (default)
{"timestamp": 1234567890, "host": "server01", "source": "/var/log/app.log", "sourcetype": "app", "message": "Event data"}
CSV
timestamp,host,source,sourcetype,message
1234567890,server01,/var/log/app.log,app,"Event data"
Parquet
Columnar format optimized for analytics (requires pyarrow).
Credits
This project is inspired by and based on the concept from fionera/splunker, a Go implementation for extracting Splunk journal files. This Python implementation extends the original concept with:
- Streaming S3 support (no temporary files)
- Multiple output formats (JSON Lines, CSV, Parquet)
- Python library API for easy integration
- Docker and AWS deployment options
License
Proprietary
Contributing
See CLAUDE.md for development guidelines.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file splunk_ddss_extractor-0.3.0.tar.gz.
File metadata
- Download URL: splunk_ddss_extractor-0.3.0.tar.gz
- Upload date:
- Size: 20.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
130ed8d3530ef383332424788d5ae753c5c99c21c7d41bdd2fc0e50d2d8a903c
|
|
| MD5 |
f0f2b52d16974fa17ab8cab4863023bb
|
|
| BLAKE2b-256 |
5728f1d2fc0258a5eacffb3e008cd0176fe0e79e3773c399a8db1116c1665925
|
Provenance
The following attestation bundles were made for splunk_ddss_extractor-0.3.0.tar.gz:
Publisher:
publish.yml on ponquersohn/splunk_ddss_extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
splunk_ddss_extractor-0.3.0.tar.gz -
Subject digest:
130ed8d3530ef383332424788d5ae753c5c99c21c7d41bdd2fc0e50d2d8a903c - Sigstore transparency entry: 1066679962
- Sigstore integration time:
-
Permalink:
ponquersohn/splunk_ddss_extractor@70d9babe6affd2c21cbd48d9e6154139cac47b4d -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ponquersohn
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@70d9babe6affd2c21cbd48d9e6154139cac47b4d -
Trigger Event:
release
-
Statement type: