Internet Measurement Research Utilities (meta-package)
Project description
tmautils
A collection of Python utilities for Internet measurement research.
tma could stand for Traffic Measurement and Analysis, like the academic conference, or Too Much Analysis, depending on how frustrated you are with your research. (You get to choose.)
Installation
Install everything:
pip install tmautils[all]
Install only the sub-packages you need:
pip install tmautils[rdap,pki] # specific sub-packages (+ core automatically)
pip install tmautils # just core infrastructure
Available extras: bgp, dns, enrich-ip, pki, rdap, all.
You can also install sub-packages directly by name:
pip install tmautils-rdap # same as tmautils[rdap]
pip install tmautils-bgp
pip install tmautils-dns
pip install tmautils-enrich-ip
pip install tmautils-pki
pip install tmautils-core # same as bare tmautils
All packages share the tmautils.* namespace.
Quick Start
tmautils provides two types of APIs:
- Utility classes: self-contained tools that manage their own directories and logging (e.g.,
RevocationChecker,CzdsDownloadUtil), and - Standalone functions: stateless helpers you call directly (e.g.,
get_cert(),request_with_retry())
As a basic example, you can use get_cert() (a function) to download the TLS certificate presented by a server, and RevocationChecker (a utility class) to check whether said certificate has been revoked:
from tmautils.pki import get_cert, RevocationChecker
from pathlib import Path
cert = await get_cert("www.example.com") # using get_cert function
checker = RevocationChecker( # instantiating RevocationChecker
working_root=Path("/tmp"),
)
result = await checker.check_cert_chain(cert) # using RevocationChecker's API
What's Included
tmautils is divided into sub-packages roughly based on the functionality provided by the utilities / functions they contain. Sub-packages and their user-facing utilities and functions are listed below.
tmautils.core
tmautils.core contains core infrastructure used by all other sub-packages, but many exports are useful for writing custom user code. Provided by tmautils-core.
| Utility / Function | What it does |
|---|---|
IOHelper |
Directory structure and logging for utilities |
LogHelper |
Logging configuration |
AsyncRateLimiter |
Rate limiter with concurrency and throughput limits |
RetryConfig |
Configuration for HTTP retry behavior |
get_logger_from_helper() |
Get the configured or no-op logger from LogHelper |
run_coro_sync() |
Run async code from sync context |
request_with_retry() |
HTTP requests with retry and backoff |
get_with_retry() |
Convenience wrapper for GET with retry |
gzip_file(), gunzip_file() |
Compress/decompress a file with gzip |
try_convert_ip() |
Try to convert an IP address string to an IP address object |
is_ipv4(), is_ipv6() |
Check if string is valid IPv4/IPv6 address |
tmautils.db
tmautils.db contains storage backends. Also provided by tmautils-core.
| Utility / Function | What it does |
|---|---|
BufferedWriter |
Batched writes to storage backends |
DuckDbStore |
DuckDB storage interface |
DuckDbBackend |
DuckDB backend for BufferedWriter |
DuckLakeStore |
DuckLake storage interface |
DuckLakeBackend |
DuckLake backend for BufferedWriter |
ParquetBackend |
Parquet backend for BufferedWriter |
DuckDbInetLpmIndex |
In-memory LPM index for fast IP lookups |
pydantic_to_arrow() |
Convert Pydantic models to Arrow tables |
tmautils.bgp
Provided by tmautils-bgp.
| Utility / Function | What it does |
|---|---|
PyasnUtil |
IP to ASN lookups (wraps pyasn) |
ASdbCategoryUtil |
AS categorization using Stanford ASdb |
CaidaAsOrgInfoUtil |
AS to Organization mapping using CAIDA's AS2Org |
tmautils.rdap
Provided by tmautils-rdap.
| Utility / Function | What it does |
|---|---|
RdapClient |
RDAP queries for domains, IPs, ASNs, entities, and nameservers |
tmautils.dns
Provided by tmautils-dns.
| Utility / Function | What it does |
|---|---|
AsyncDnsPythonUtil |
Async DNS resolution (wraps dnspython) |
CzdsDownloadUtil |
Download ICANN CZDS zone files |
OpenIntelZoneStreamUtil |
Subscribe to OpenINTEL ZoneStream |
dns_msg_semantic_hash() |
Compute semantic hash of DNS messages |
tmautils.enrich_ip
Provided by tmautils-enrich-ip.
| Utility / Function | What it does |
|---|---|
IPApiBatchUtil |
Interact with ip-api.com's batch API |
IPInfoLiteUtil |
Interact with the IPinfo's Lite dataset |
IpInfoPrivacyUtil |
Privacy detection using IPinfo's database |
IpInfoCarrierUtil |
Mobile carrier lookup using IPinfo's database |
ChromePrefetchUtil |
Check if an IP address belongs to Chrome Prefetch Proxy |
tmautils.pki
Provided by tmautils-pki.
| Utility / Function | What it does |
|---|---|
RevocationChecker |
Check certificate revocation (OCSP/CRL) |
create_ssl_context() |
Create configurable SSL context |
get_cert() |
Fetch TLS certificate from a server |
get_cert_chain() |
Fetch full certificate chain from a server |
fetch_issuer_cert() |
Fetch issuer certificate via AIA extension |
fetch_issuer_chain() |
Build certificate chain from leaf to root |
Writing Your First Program
As you've probably noticed by now, there is no "one way" to use tmautils: what utility/function you use will be driven by your use case, and the options supported vary by individual utilities. However, I have tried to include useful documentation with each utility/function.
That said, there are two "patterns" you will see across the API:
- All utilities support modification of storage/logging behavior through constructor kwargs. See here for details.
- I/O-heavy APIs are written with an
async-first approach; but in most cases a sync wrapper is provided. See here for details.
The boring stuff
Why does this library exist?
This library grew out of code written by me (Sulyab) for my PhD research. At some point, it made sense to pull out the reusable bits into a common place, and establish some directory structure for data to keep track of what goes where. Days of debugging led to addition of logging features, days of fighting with the GIL led to the addition of multiprocessing helpers, and so on.
Philosophy
There are three main tenets that shaped the design choices of this library:
-
Do NOT reinvent the wheel (when sufficiently good wheels exist.) There are several great Python packages out there that can help with specific Internet measurement analysis tasks, like pyasn for IP to ASN lookups. In such cases, it is preferable to write thin wrappers around such packages (such as
PyasnUtilforpyasn.) -
However, sometimes reinventing the wheel makes more sense. Usually this happens when the "best" Python package available for a use case is not "sufficiently" good according to certain criteria. For example, it may lack an
asyncAPI or aggressive caching, two reasons whyRevocationCheckerexists instead of relying on pki-tools for certificate revocation lookups. -
Solve the problem at hand first, the "perfect" solution can come later. This is an important one, and the one that may affect you, the user, the most. Like mentioned, each utility in this library was written to address specific needs at the time. As such, some utilities may support feature X but not feature Y, and some "pieces" may be more or less polished than others. However, I am always looking to improve the code and add more features, so please consider contributing!
Design Choices
While this library is a grab bag of utilities, I have tried to keep some design choices consistent throughout:
Self-Contained Utilities
Each utility in this library is designed to be "self-contained". Typically, you pass a working_root parameter when you instantiate a utility, say AbcUtil. This action will create a directory named AbcUtil in working_root, with subdirectories such as logs, raw and cache. (There will be another level of subdirectories if there are multiple instances of the same utility.) The following is an example:
from tmautils.dns import CzdsDownloadUtil
from pathlib import Path
czds = CzdsDownloadUtil(working_root=Path("/data"))
# Creates: /data/CzdsDownloadUtil/raw/, /data/CzdsDownloadUtil/logs/, etc.
Under the hood, most of this work is done by IOHelper, which takes care of the directory structure and instantiates LogHelper to take care of logging.
You can pass additional arguments to IOHelper to modify some default behavior, such as making subdirectories symlinks, and turning off file logging. Example:
czds = CzdsDownloadUtil(
working_root=Path("/data"),
# The following arguments are passed to IOHelper
# make /data/CzdsDownloadUtil/raw a symlink to ~/downloads/czds
raw_dir_symlink_to=Path("~/downloads/czds/"),
# IOHelper in turn passes the following argument to LogHelper
logging_kwargs={
"file_level": None, # No file logging
}
)
If you wish, you can delegate the storage/logging management of your custom program by instantiating IOHelper:
from tmautils.core import IOHelper
io = IOHelper(
"MyProgram",
working_root=Path("/data"),
# by default, IOHelper will configure the subdirectories:
# raw/, logs/, processed/, results/
)
# Access storage
out_path = io.raw / "hello.txt" # Returns a pathlib.Path object
out_path.write_text("Hello, World!")
# Use logger
io.logger.warning("I have no clue what I am doing.")
Async-First Approach
Many utilities in this library do I/O-heavy work (e.g., downloading zone files, making HTTP requests, checking certificate revocation status). For better concurrency, I/O-heavy utilities are written with an async-first approach. In such cases, the API also provides a sync version in case you do not want to bother with asyncio:
result = await util.fetch_data("param") # Async
result = util.fetch_data_sync("param") # Sync wrapper
async methods use the base name (fetch_data()), and sync wrappers append _sync (fetch_data_sync()). The sync wrappers simply call run_coro_sync on the async method.
DuckDB as the Middle Layer for Database Stuff
After experimenting with different database backends, the current direction is to use DuckDB as a unified intermediate layer. DuckDB supports several popular backends (CSV, Parquet, SQLite, even query remote sources) and allows executing SQL queries on those backends. This allows us to write code at the level of DuckDB connections rather than add support for different backends.
New code uses the following flow: Pydantic models -> Arrow table -> DuckDB insert. There is a fair amount of instrumentation around this which you can find in tmautils.db. For a good example of how to write a utility using this pattern, see OpenIntelZoneStreamUtil.
Some useful tools:
- Writing rows one-by-one to a database is slow.
BufferedWriteraccumulates rows in memory and flushes them in batches, working with multiple backends (thanks to DuckDB). - For Longest Prefix Matching (LPM) lookups, we use
DuckDbInetLpmIndexwhich builds an in-memory LPM index to speed up lookups. Once built, the index can be used in Python code or DuckDB SQL (via UDFs.)
License
This project is licensed under MPL-2.0 (Mozilla Public License 2.0).
What this means in practice:
- If you modify an existing file, your modifications must remain MPL-2.0.
- You can license new files however you want. (But I won't merge them into
tmautilsunless they are MPL-2.0.) - You can use this code alongside code under other licenses.
Contributing
Contributions to tmautils are highly welcome! After all, there is much to do in Internet measurement research.
Since all the "real" code lives in subpackages, you probably want to contribute to their corresponding repos. If you want to submit a new subpackage to the tmautils metapackage, let me know.
Before writing code, please familiarize yourself with the philosophy and design choices, and try to follow them, or talk to me about how they are stupid and we should do things differently. I want this library to be the best version of itself.
AI Policy: I don't consider AI tool usage any different from IDE usage. This also means that you are responsible for the code you write and you should inspect every line of code written by an LLM. This policy is currently (slightly) relaxed for tests.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tmautils-0.2.0.tar.gz.
File metadata
- Download URL: tmautils-0.2.0.tar.gz
- Upload date:
- Size: 11.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.14.3 Linux/6.19.6-200.fc43.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c2047d2f88ca9799464837fefe419e91461efd032642f6f5eedb27884942397
|
|
| MD5 |
cd1c53cb5894cfb58f38a799bb44c74c
|
|
| BLAKE2b-256 |
36f4e9536ed1df09a0769673ec67bb6c7fe26a29ec799fb1d4758858d971f8f7
|
File details
Details for the file tmautils-0.2.0-py3-none-any.whl.
File metadata
- Download URL: tmautils-0.2.0-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.14.3 Linux/6.19.6-200.fc43.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56e8e802d3121305c12145ee7cfb20edb826c4c1ea062637b2d4fd087697fee2
|
|
| MD5 |
71af4d8e8ccbbf701d227217709dc458
|
|
| BLAKE2b-256 |
8e41e66a923a80eea6e86de0396a6c9b527db94988422697ac83456f89967a77
|