Convert STIX cyber threat intelligence bundles to Pandas DataFrames
Project description
stix2tabular
Convert STIX cyber threat intelligence bundles to Pandas DataFrames.
Installation
pip install stix2tabular
Quick Start
from stix2tabular import stix_to_tables, save_tables
tables = stix_to_tables("enterprise-attack.json")
print(tables.keys())
# → dict_keys(['attack-pattern', 'intrusion-set', 'malware', 'tool', 'relationships', ...])
print(tables["malware"].head())
# id type name ...
# 0 malware--abc123 malware CHOPSTICK ...
# 1 malware--def456 malware X-Agent ...
# Save to Parquet for later use
save_tables(tables, "attack_tables/")
Before / After
Before (without stix2tabular):
import json
import pandas as pd
with open("enterprise-attack.json") as f:
bundle = json.load(f)
objects_by_type = {}
relationships = []
for obj in bundle["objects"]:
obj_type = obj.get("type")
if obj_type == "marking-definition":
continue
if obj_type == "relationship":
relationships.append({
"id": obj["id"],
"type": obj["type"],
"relationship_type": obj["relationship_type"],
"source_ref": obj["source_ref"],
"target_ref": obj["target_ref"],
"created": obj.get("created"),
"modified": obj.get("modified"),
})
continue
if obj_type not in objects_by_type:
objects_by_type[obj_type] = []
row = {}
for key, value in obj.items():
row[key] = value
objects_by_type[obj_type].append(row)
tables = {}
for obj_type, rows in objects_by_type.items():
tables[obj_type] = pd.DataFrame(rows)
tables["relationships"] = pd.DataFrame(relationships)
# Still missing: sightings, SCO handling, STIX 2.0 embedded observables,
# deduplication, multi-bundle merging, error handling...
After (with stix2tabular):
from stix2tabular import stix_to_tables
tables = stix_to_tables("enterprise-attack.json")
What You Get
tables = stix_to_tables("enterprise-attack.json")
# One DataFrame per STIX type
tables["attack-pattern"] # 680 rows × 15 columns
tables["intrusion-set"] # 138 rows × 12 columns
tables["malware"] # 490 rows × 14 columns
tables["tool"] # 78 rows × 11 columns
tables["campaign"] # 23 rows × 10 columns
# Relationships as a lean edge table
tables["relationships"] # 18,400 rows × 9 columns
# Sightings
tables["sightings"] # 42 rows × 8 columns
# SCO types (when include_scos=True)
tables["ipv4-addr"] # 12 rows × 4 columns
API Reference
stix_to_tables(source, include_scos=True)
Convert STIX bundles into a dict of Pandas DataFrames.
source:str | list[str] | list[dict]- File path (
.json): reads and parses a single file - Directory path: globs all
*.jsonfiles, merges into one set of tables list[str]: each string is parsed as a full STIX bundle JSONlist[dict]: each dict is treated as a parsed STIX bundle
- File path (
include_scos:bool(defaultTrue)- When
True, STIX Cyber-observable Objects (IP addresses, domain names, file hashes, etc.) get their own DataFrames - When
False, only SDOs, relationships, and sightings are included
- When
- Returns:
dict[str, pd.DataFrame]
save_tables(tables, directory)
Save all DataFrames to a directory as Parquet files.
tables: dict returned bystix_to_tables()directory: path to output directory (created if it doesn't exist)- Writes one
{type}.parquetfile per key (e.g.,malware.parquet,relationships.parquet)
load_tables(directory)
Load DataFrames from a directory of Parquet files.
directory: path to directory containing.parquetfiles fromsave_tables()- Returns:
dict[str, pd.DataFrame]— dict keys derived from filenames
Working with the Data
# All techniques used by APT28
rels = tables["relationships"]
apt28_id = tables["intrusion-set"].query("name == 'APT28'")["id"].iloc[0]
technique_ids = rels.query(
"source_ref == @apt28_id and relationship_type == 'uses'"
)["target_ref"]
techniques = tables["attack-pattern"][
tables["attack-pattern"]["id"].isin(technique_ids)
]["name"]
# Most common relationship types
tables["relationships"]["relationship_type"].value_counts()
# Explode aliases to find all names for threat actors
tables["intrusion-set"].explode("aliases")[["name", "aliases"]]
# Merge bundles from a directory of STIX feeds
tables = stix_to_tables("/path/to/stix_feeds/")
# Join source names onto relationships for a denormalized view
import pandas as pd
rels = tables["relationships"].copy()
names = pd.concat([df[["id", "name"]] for df in tables.values() if "name" in df.columns])
rels = rels.merge(names, left_on="source_ref", right_on="id", suffixes=("", "_source"))
rels = rels.merge(names, left_on="target_ref", right_on="id", suffixes=("", "_target"))
Saving and Loading
The library includes built-in Parquet persistence for lossless round-tripping:
from stix2tabular import stix_to_tables, save_tables, load_tables
tables = stix_to_tables("enterprise-attack.json")
# Save all DataFrames to a directory (one .parquet file per type)
save_tables(tables, "output/attack_tables/")
# Creates: attack-pattern.parquet, intrusion-set.parquet, malware.parquet,
# relationships.parquet, sightings.parquet, ...
# Load them back — identical DataFrames, including list/dict columns
tables = load_tables("output/attack_tables/")
Parquet preserves Python lists and dicts natively — no serialization needed, no data loss.
CSV note: If you need CSV, you'll need to serialize list/dict columns yourself before exporting:
import json
df = tables["malware"].copy()
for col in df.columns:
df[col] = df[col].apply(lambda x: json.dumps(x) if isinstance(x, (list, dict)) else x)
df.to_csv("malware.csv", index=False)
Comparison with stix2nx
| Need | Use |
|---|---|
| Graph traversal, centrality | stix2nx |
| Filtering, aggregation, ML | stix2tabular |
| Both | Install both |
Same input API. Same STIX version support. Independent libraries — no cross-dependency.
Running Tests
# Install dev dependencies
pip install -e ".[dev]"
# Run all tests (integration test downloads live ATT&CK data, falls back to curated subset if offline)
pytest
# Run in offline mode (uses curated ~1MB ATT&CK subset only, no network needed)
STIX2TABULAR_OFFLINE=true pytest
# Regenerate the curated subset from latest ATT&CK (requires network)
python tests/data/build_subset.py
STIX Version Support
Supports both STIX 2.0 and STIX 2.1 bundles. STIX 2.0 observed-data objects with embedded observables are automatically extracted into their respective type DataFrames when include_scos=True.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stix2tabular-0.1.0.tar.gz.
File metadata
- Download URL: stix2tabular-0.1.0.tar.gz
- Upload date:
- Size: 185.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ebed206ef7c1583e551543565811bb294727d469d69aac2c6c31565d8368f7c4
|
|
| MD5 |
33fce2477bdb1b03280dd9d8da2e679a
|
|
| BLAKE2b-256 |
1596b2ac359d443216f09bf520d4f01f8059af01b8e326cef8694e5069779542
|
File details
Details for the file stix2tabular-0.1.0-py3-none-any.whl.
File metadata
- Download URL: stix2tabular-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1981474d335683d4f725d46c4fda52a75a93c0d6c90b14f5b3753d46ff2751a0
|
|
| MD5 |
02e759ec870ff6b8b2ae9befcec9b4e2
|
|
| BLAKE2b-256 |
89bce3bc2af9d65dd8cc630b040046c5930bcc0c7e734fc8addc39b23e865fe2
|