PCAP → ML tensor extraction for network intrusion detection research.
Project description
pcap2tensor
PCAP → ML tensor extraction for network intrusion detection research.
A fast, streaming, production-grade Python library for turning raw packet captures into training-ready tensors. Built for NIDS researchers tired of rolling their own extraction pipeline for every paper.
Why this exists
Every ML-based network intrusion detection paper reinvents the same pipeline:
- Parse a PCAP
- Extract per-packet features (size, inter-arrival time, direction, TCP flags, ...)
- Slide a window across the sequence
- Save as a tensor
Every implementation is a one-file script that doesn't handle PCAPs larger than RAM, doesn't expose clean extension points for custom features, and quietly crashes on the first malformed packet. pcap2tensor is that pipeline, packaged properly — streaming, extensible, and published on PyPI.
Install
pip install pcap2tensor
Python ≥ 3.9. Depends on Scapy, PyTorch, NumPy, tqdm.
Quickstart
from pcap2tensor import extract
tensor = extract("capture.pcap", features="aegis-6d", window_size=1000, stride=500)
print(tensor.shape) # torch.Size([num_windows, 1000, 6])
Feed straight into any sequence model — Transformer, LSTM, SSM, CNN.
Large PCAPs
Streaming chunked processing — never loads the full PCAP into memory:
from pcap2tensor import PCAPExtractor
extractor = PCAPExtractor(
features="aegis-6d",
window_size=1000,
stride=500,
chunk_size=2_000_000, # flush every 2M packets
)
# Option A: save chunked .pt files
extractor.save("massive.pcap", output_dir="./tensors/")
# Option B: stream chunks into your training loop
for chunk in extractor.extract_chunks("massive.pcap"):
train_step(chunk)
Parallel batch
from pcap2tensor import batch_extract
batch_extract("./pcaps/", output_dir="./tensors/", features="aegis-6d", workers=8)
From the CLI:
pcap2tensor batch ./pcaps/ -o ./tensors/ -n 8
Feature presets
| Preset | Dim | Features |
|---|---|---|
basic-3d |
3 | size, IAT, direction |
aegis-6d |
6 | size, IAT, direction, TCP window, TCP flags, payload ratio |
extended-10d |
10 | aegis-6d + protocol one-hot (TCP/UDP/ICMP/other) |
full-13d |
13 | extended-10d + destination port category (well-known/registered/dynamic) |
The aegis-6d preset matches the feature set in AEGIS (Ferrel, 2026) — a TVD-HL-SSM architecture achieving F1 0.9952 on encrypted traffic detection at 262 μs inference latency.
Custom features
A Feature is any stateful callable returning a float or a flat list of floats. Subclass Feature, implement __call__, optionally override reset if you hold state:
import math
from collections import Counter
from scapy.layers.inet import TCP
from pcap2tensor import PCAPExtractor, Feature, Size, IAT, Direction
class PayloadEntropy(Feature):
name = "payload_entropy"
dim = 1
def __call__(self, pkt):
payload = bytes(pkt[TCP].payload) if TCP in pkt else b""
if not payload:
return 0.0
counts = Counter(payload)
n = len(payload)
return -sum((c / n) * math.log2(c / n) for c in counts.values()) / 8.0
extractor = PCAPExtractor(
features=[Size(), IAT(), Direction(), PayloadEntropy()],
)
tensor = extractor.extract("capture.pcap")
Return a list[float] and set dim accordingly for multi-valued features (e.g. one-hots).
CLI
# Single PCAP
pcap2tensor extract capture.pcap -o ./tensors/
# Parallel batch over a directory
pcap2tensor batch ./pcaps/ -o ./tensors/ -n 8
# List presets
pcap2tensor presets
# Override everything
pcap2tensor extract capture.pcap -f extended-10d -w 2000 -s 1000 -c 5000000
Design
| Concern | How it's handled |
|---|---|
| Memory | Streaming PcapReader, chunked flush every chunk_size packets |
| Malformed packets | Caught per-packet, silently skipped — a 4-hour run doesn't die on one pkt |
| Flow state | Per-Feature instance, auto-reset between PCAPs |
| Parallelism | ProcessPoolExecutor for batch mode |
| IPv6 | First-class (IPv6 src/dst, port extraction, protocol number) |
| Reproducibility | Same PCAP + same config = bit-identical tensor output |
| Output format | PyTorch .pt on disk, torch.Tensor in memory |
Performance
Rough single-core throughput with aegis-6d on a modern x86 machine:
roughly 50–120k packets/sec, TCP-heavy captures slower than UDP-heavy. With
8 workers in batch mode, processing 100 GB+ of PCAPs per hour is achievable.
Your bottleneck is Scapy parsing, not feature extraction.
Output shape
Every extractor produces tensors of shape:
(num_windows, window_size, feature_dim)
where feature_dim = sum(f.dim for f in features). For aegis-6d, that's 6.
Citation
If you use this library in research, please cite the companion paper:
@article{ferrel2026aegis,
title = {AEGIS: Adversarial Entropy-Guided Immune System --
Thermodynamic State Space Models for Zero-Day Network
Evasion Detection},
author = {Ferrel, Vickson},
journal = {arXiv preprint arXiv:2604.02149},
year = {2026},
url = {https://arxiv.org/abs/2604.02149}
}
License
MIT © Vickson Ferrel — Vixero Technology Enterprise
Built in Sarawak. For network defenders everywhere. 🛡️
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pcap2tensor-0.1.0.tar.gz.
File metadata
- Download URL: pcap2tensor-0.1.0.tar.gz
- Upload date:
- Size: 18.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c77df4db9d227170ead5fc70aa0a872730e3951b1bddf3fb106023662058d24
|
|
| MD5 |
38be13305589d49f39a0732d8dad2543
|
|
| BLAKE2b-256 |
0e92dae5d4bbf070af585dc916a2098cfc488b39414c31d30680290cff617dc5
|
File details
Details for the file pcap2tensor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pcap2tensor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e987b468ae36da6f92e497715bef85461ef39281d081e195845cbfd54e7268d3
|
|
| MD5 |
1a1cbe41bcdd4d102705e08b4c0916ca
|
|
| BLAKE2b-256 |
0988ad230ba55a16125c5a38fb616514716bfde4c9654e1f1145fd4b02a9e270
|