Skip to main content

Convert STIX cyber threat intelligence bundles to Pandas DataFrames

Project description

stix2tabular

Convert STIX cyber threat intelligence bundles to Pandas DataFrames.

Installation

pip install stix2tabular

Quick Start

from stix2tabular import stix_to_tables, save_tables

tables = stix_to_tables("enterprise-attack.json")

print(tables.keys())
# → dict_keys(['attack-pattern', 'intrusion-set', 'malware', 'tool', 'relationships', ...])

print(tables["malware"].head())
#                              id       type            name  ...
# 0  malware--abc123             malware     CHOPSTICK  ...
# 1  malware--def456             malware     X-Agent    ...

# Save to Parquet for later use
save_tables(tables, "attack_tables/")

Before / After

Before (without stix2tabular):

import json
import pandas as pd

with open("enterprise-attack.json") as f:
    bundle = json.load(f)

objects_by_type = {}
relationships = []

for obj in bundle["objects"]:
    obj_type = obj.get("type")
    if obj_type == "marking-definition":
        continue
    if obj_type == "relationship":
        relationships.append({
            "id": obj["id"],
            "type": obj["type"],
            "relationship_type": obj["relationship_type"],
            "source_ref": obj["source_ref"],
            "target_ref": obj["target_ref"],
            "created": obj.get("created"),
            "modified": obj.get("modified"),
        })
        continue
    if obj_type not in objects_by_type:
        objects_by_type[obj_type] = []
    row = {}
    for key, value in obj.items():
        row[key] = value
    objects_by_type[obj_type].append(row)

tables = {}
for obj_type, rows in objects_by_type.items():
    tables[obj_type] = pd.DataFrame(rows)
tables["relationships"] = pd.DataFrame(relationships)
# Still missing: sightings, SCO handling, STIX 2.0 embedded observables,
# deduplication, multi-bundle merging, error handling...

After (with stix2tabular):

from stix2tabular import stix_to_tables

tables = stix_to_tables("enterprise-attack.json")

What You Get

tables = stix_to_tables("enterprise-attack.json")

# One DataFrame per STIX type
tables["attack-pattern"]     # 680 rows × 15 columns
tables["intrusion-set"]      # 138 rows × 12 columns
tables["malware"]            # 490 rows × 14 columns
tables["tool"]               # 78 rows × 11 columns
tables["campaign"]           # 23 rows × 10 columns

# Relationships as a lean edge table
tables["relationships"]      # 18,400 rows × 9 columns

# Sightings
tables["sightings"]          # 42 rows × 8 columns

# SCO types (when include_scos=True)
tables["ipv4-addr"]          # 12 rows × 4 columns

API Reference

stix_to_tables(source, include_scos=True)

Convert STIX bundles into a dict of Pandas DataFrames.

  • source: str | list[str] | list[dict]
    • File path (.json): reads and parses a single file
    • Directory path: globs all *.json files, merges into one set of tables
    • list[str]: each string is parsed as a full STIX bundle JSON
    • list[dict]: each dict is treated as a parsed STIX bundle
  • include_scos: bool (default True)
    • When True, STIX Cyber-observable Objects (IP addresses, domain names, file hashes, etc.) get their own DataFrames
    • When False, only SDOs, relationships, and sightings are included
  • Returns: dict[str, pd.DataFrame]

save_tables(tables, directory)

Save all DataFrames to a directory as Parquet files.

  • tables: dict returned by stix_to_tables()
  • directory: path to output directory (created if it doesn't exist)
  • Writes one {type}.parquet file per key (e.g., malware.parquet, relationships.parquet)

load_tables(directory)

Load DataFrames from a directory of Parquet files.

  • directory: path to directory containing .parquet files from save_tables()
  • Returns: dict[str, pd.DataFrame] — dict keys derived from filenames

Working with the Data

# All techniques used by APT28
rels = tables["relationships"]
apt28_id = tables["intrusion-set"].query("name == 'APT28'")["id"].iloc[0]
technique_ids = rels.query(
    "source_ref == @apt28_id and relationship_type == 'uses'"
)["target_ref"]
techniques = tables["attack-pattern"][
    tables["attack-pattern"]["id"].isin(technique_ids)
]["name"]
# Most common relationship types
tables["relationships"]["relationship_type"].value_counts()
# Explode aliases to find all names for threat actors
tables["intrusion-set"].explode("aliases")[["name", "aliases"]]
# Merge bundles from a directory of STIX feeds
tables = stix_to_tables("/path/to/stix_feeds/")
# Join source names onto relationships for a denormalized view
import pandas as pd

rels = tables["relationships"].copy()
names = pd.concat([df[["id", "name"]] for df in tables.values() if "name" in df.columns])
rels = rels.merge(names, left_on="source_ref", right_on="id", suffixes=("", "_source"))
rels = rels.merge(names, left_on="target_ref", right_on="id", suffixes=("", "_target"))

Saving and Loading

The library includes built-in Parquet persistence for lossless round-tripping:

from stix2tabular import stix_to_tables, save_tables, load_tables

tables = stix_to_tables("enterprise-attack.json")

# Save all DataFrames to a directory (one .parquet file per type)
save_tables(tables, "output/attack_tables/")
# Creates: attack-pattern.parquet, intrusion-set.parquet, malware.parquet,
#          relationships.parquet, sightings.parquet, ...

# Load them back — identical DataFrames, including list/dict columns
tables = load_tables("output/attack_tables/")

Parquet preserves Python lists and dicts natively — no serialization needed, no data loss.

CSV note: If you need CSV, you'll need to serialize list/dict columns yourself before exporting:

import json
df = tables["malware"].copy()
for col in df.columns:
    df[col] = df[col].apply(lambda x: json.dumps(x) if isinstance(x, (list, dict)) else x)
df.to_csv("malware.csv", index=False)

Comparison with stix2nx

Need Use
Graph traversal, centrality stix2nx
Filtering, aggregation, ML stix2tabular
Both Install both

Same input API. Same STIX version support. Independent libraries — no cross-dependency.

Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests (integration test downloads live ATT&CK data, falls back to curated subset if offline)
pytest

# Run in offline mode (uses curated ~1MB ATT&CK subset only, no network needed)
STIX2TABULAR_OFFLINE=true pytest

# Regenerate the curated subset from latest ATT&CK (requires network)
python tests/data/build_subset.py

STIX Version Support

Supports both STIX 2.0 and STIX 2.1 bundles. STIX 2.0 observed-data objects with embedded observables are automatically extracted into their respective type DataFrames when include_scos=True.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stix2tabular-0.1.0.tar.gz (185.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stix2tabular-0.1.0-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file stix2tabular-0.1.0.tar.gz.

File metadata

  • Download URL: stix2tabular-0.1.0.tar.gz
  • Upload date:
  • Size: 185.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for stix2tabular-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ebed206ef7c1583e551543565811bb294727d469d69aac2c6c31565d8368f7c4
MD5 33fce2477bdb1b03280dd9d8da2e679a
BLAKE2b-256 1596b2ac359d443216f09bf520d4f01f8059af01b8e326cef8694e5069779542

See more details on using hashes here.

File details

Details for the file stix2tabular-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: stix2tabular-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for stix2tabular-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1981474d335683d4f725d46c4fda52a75a93c0d6c90b14f5b3753d46ff2751a0
MD5 02e759ec870ff6b8b2ae9befcec9b4e2
BLAKE2b-256 89bce3bc2af9d65dd8cc630b040046c5930bcc0c7e734fc8addc39b23e865fe2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page