Skip to main content

Internet Measurement Research Utilities (meta-package)

Project description

tmautils

A collection of Python utilities for Internet measurement research.

tma could stand for Traffic Measurement and Analysis, like the academic conference, or Too Much Analysis, depending on how frustrated you are with your research. (You get to choose.)

Installation

Install everything:

pip install tmautils[all]

Install only the sub-packages you need:

pip install tmautils[rdap,pki]      # specific sub-packages (+ core automatically)
pip install tmautils                # just core infrastructure

Available extras: bgp, dns, enrich-ip, pki, rdap, all.

You can also install sub-packages directly by name:

pip install tmautils-rdap           # same as tmautils[rdap]
pip install tmautils-bgp
pip install tmautils-dns
pip install tmautils-enrich-ip
pip install tmautils-pki
pip install tmautils-core           # same as bare tmautils

All packages share the tmautils.* namespace.

Quick Start

tmautils provides two types of APIs:

  • Utility classes: self-contained tools that manage their own directories and logging (e.g., RevocationChecker, CzdsDownloadUtil), and
  • Standalone functions: stateless helpers you call directly (e.g., get_cert(), request_with_retry())

As a basic example, you can use get_cert() (a function) to download the TLS certificate presented by a server, and RevocationChecker (a utility class) to check whether said certificate has been revoked:

from tmautils.pki import get_cert, RevocationChecker
from pathlib import Path

cert = await get_cert("www.example.com") # using get_cert function

checker = RevocationChecker( # instantiating RevocationChecker
    working_root=Path("/tmp"),
)
result = await checker.check_cert_chain(cert) # using RevocationChecker's API

What's Included

tmautils is divided into sub-packages roughly based on the functionality provided by the utilities / functions they contain. Sub-packages and their user-facing utilities and functions are listed below.

tmautils.core

tmautils.core contains core infrastructure used by all other sub-packages, but many exports are useful for writing custom user code. Provided by tmautils-core.

Utility / Function What it does
IOHelper Directory structure and logging for utilities
LogHelper Logging configuration
AsyncRateLimiter Rate limiter with concurrency and throughput limits
RetryConfig Configuration for HTTP retry behavior
get_logger_from_helper() Get the configured or no-op logger from LogHelper
run_coro_sync() Run async code from sync context
request_with_retry() HTTP requests with retry and backoff
get_with_retry() Convenience wrapper for GET with retry
gzip_file(), gunzip_file() Compress/decompress a file with gzip
try_convert_ip() Try to convert an IP address string to an IP address object
is_ipv4(), is_ipv6() Check if string is valid IPv4/IPv6 address

tmautils.db

tmautils.db contains storage backends. Also provided by tmautils-core.

Utility / Function What it does
BufferedWriter Batched writes to storage backends
DuckDbStore DuckDB storage interface
DuckDbBackend DuckDB backend for BufferedWriter
DuckLakeStore DuckLake storage interface
DuckLakeBackend DuckLake backend for BufferedWriter
ParquetBackend Parquet backend for BufferedWriter
DuckDbInetLpmIndex In-memory LPM index for fast IP lookups
pydantic_to_arrow() Convert Pydantic models to Arrow tables

tmautils.bgp

Provided by tmautils-bgp.

Utility / Function What it does
PyasnUtil IP to ASN lookups (wraps pyasn)
ASdbCategoryUtil AS categorization using Stanford ASdb
CaidaAsOrgInfoUtil AS to Organization mapping using CAIDA's AS2Org

tmautils.rdap

Provided by tmautils-rdap.

Utility / Function What it does
RdapClient RDAP queries for domains, IPs, ASNs, entities, and nameservers

tmautils.dns

Provided by tmautils-dns.

Utility / Function What it does
AsyncDnsPythonUtil Async DNS resolution (wraps dnspython)
CzdsDownloadUtil Download ICANN CZDS zone files
OpenIntelZoneStreamUtil Subscribe to OpenINTEL ZoneStream
dns_msg_semantic_hash() Compute semantic hash of DNS messages

tmautils.enrich_ip

Provided by tmautils-enrich-ip.

Utility / Function What it does
IPApiBatchUtil Interact with ip-api.com's batch API
IPInfoLiteUtil Interact with the IPinfo's Lite dataset
IpInfoPrivacyUtil Privacy detection using IPinfo's database
IpInfoCarrierUtil Mobile carrier lookup using IPinfo's database
ChromePrefetchUtil Check if an IP address belongs to Chrome Prefetch Proxy

tmautils.pki

Provided by tmautils-pki.

Utility / Function What it does
RevocationChecker Check certificate revocation (OCSP/CRL)
create_ssl_context() Create configurable SSL context
get_cert() Fetch TLS certificate from a server
get_cert_chain() Fetch full certificate chain from a server
fetch_issuer_cert() Fetch issuer certificate via AIA extension
fetch_issuer_chain() Build certificate chain from leaf to root

Writing Your First Program

As you've probably noticed by now, there is no "one way" to use tmautils: what utility/function you use will be driven by your use case, and the options supported vary by individual utilities. However, I have tried to include useful documentation with each utility/function.

That said, there are two "patterns" you will see across the API:

  • All utilities support modification of storage/logging behavior through constructor kwargs. See here for details.
  • I/O-heavy APIs are written with an async-first approach; but in most cases a sync wrapper is provided. See here for details.

The boring stuff

Why does this library exist?

This library grew out of code written by me (Sulyab) for my PhD research. At some point, it made sense to pull out the reusable bits into a common place, and establish some directory structure for data to keep track of what goes where. Days of debugging led to addition of logging features, days of fighting with the GIL led to the addition of multiprocessing helpers, and so on.

Philosophy

There are three main tenets that shaped the design choices of this library:

  1. Do NOT reinvent the wheel (when sufficiently good wheels exist.) There are several great Python packages out there that can help with specific Internet measurement analysis tasks, like pyasn for IP to ASN lookups. In such cases, it is preferable to write thin wrappers around such packages (such as PyasnUtil for pyasn.)

  2. However, sometimes reinventing the wheel makes more sense. Usually this happens when the "best" Python package available for a use case is not "sufficiently" good according to certain criteria. For example, it may lack an async API or aggressive caching, two reasons why RevocationChecker exists instead of relying on pki-tools for certificate revocation lookups.

  3. Solve the problem at hand first, the "perfect" solution can come later. This is an important one, and the one that may affect you, the user, the most. Like mentioned, each utility in this library was written to address specific needs at the time. As such, some utilities may support feature X but not feature Y, and some "pieces" may be more or less polished than others. However, I am always looking to improve the code and add more features, so please consider contributing!

Design Choices

While this library is a grab bag of utilities, I have tried to keep some design choices consistent throughout:

Self-Contained Utilities

Each utility in this library is designed to be "self-contained". Typically, you pass a working_root parameter when you instantiate a utility, say AbcUtil. This action will create a directory named AbcUtil in working_root, with subdirectories such as logs, raw and cache. (There will be another level of subdirectories if there are multiple instances of the same utility.) The following is an example:

from tmautils.dns import CzdsDownloadUtil
from pathlib import Path
czds = CzdsDownloadUtil(working_root=Path("/data"))
# Creates: /data/CzdsDownloadUtil/raw/, /data/CzdsDownloadUtil/logs/, etc.

Under the hood, most of this work is done by IOHelper, which takes care of the directory structure and instantiates LogHelper to take care of logging.

You can pass additional arguments to IOHelper to modify some default behavior, such as making subdirectories symlinks, and turning off file logging. Example:

czds = CzdsDownloadUtil(
    working_root=Path("/data"),
    # The following arguments are passed to IOHelper
    # make /data/CzdsDownloadUtil/raw a symlink to ~/downloads/czds
    raw_dir_symlink_to=Path("~/downloads/czds/"),
    # IOHelper in turn passes the following argument to LogHelper
    logging_kwargs={
        "file_level": None, # No file logging
    }
)

If you wish, you can delegate the storage/logging management of your custom program by instantiating IOHelper:

from tmautils.core import IOHelper
io = IOHelper(
    "MyProgram",
    working_root=Path("/data"),
    # by default, IOHelper will configure the subdirectories:
    # raw/, logs/, processed/, results/
)
# Access storage
out_path = io.raw / "hello.txt" # Returns a pathlib.Path object
out_path.write_text("Hello, World!")
# Use logger
io.logger.warning("I have no clue what I am doing.")

Async-First Approach

Many utilities in this library do I/O-heavy work (e.g., downloading zone files, making HTTP requests, checking certificate revocation status). For better concurrency, I/O-heavy utilities are written with an async-first approach. In such cases, the API also provides a sync version in case you do not want to bother with asyncio:

result = await util.fetch_data("param")    # Async
result = util.fetch_data_sync("param")     # Sync wrapper

async methods use the base name (fetch_data()), and sync wrappers append _sync (fetch_data_sync()). The sync wrappers simply call run_coro_sync on the async method.

DuckDB as the Middle Layer for Database Stuff

After experimenting with different database backends, the current direction is to use DuckDB as a unified intermediate layer. DuckDB supports several popular backends (CSV, Parquet, SQLite, even query remote sources) and allows executing SQL queries on those backends. This allows us to write code at the level of DuckDB connections rather than add support for different backends.

New code uses the following flow: Pydantic models -> Arrow table -> DuckDB insert. There is a fair amount of instrumentation around this which you can find in tmautils.db. For a good example of how to write a utility using this pattern, see OpenIntelZoneStreamUtil.

Some useful tools:

  • Writing rows one-by-one to a database is slow. BufferedWriter accumulates rows in memory and flushes them in batches, working with multiple backends (thanks to DuckDB).
  • For Longest Prefix Matching (LPM) lookups, we use DuckDbInetLpmIndex which builds an in-memory LPM index to speed up lookups. Once built, the index can be used in Python code or DuckDB SQL (via UDFs.)

License

This project is licensed under MPL-2.0 (Mozilla Public License 2.0).

What this means in practice:

  • If you modify an existing file, your modifications must remain MPL-2.0.
  • You can license new files however you want. (But I won't merge them into tmautils unless they are MPL-2.0.)
  • You can use this code alongside code under other licenses.

Contributing

Contributions to tmautils are highly welcome! After all, there is much to do in Internet measurement research.

Since all the "real" code lives in subpackages, you probably want to contribute to their corresponding repos. If you want to submit a new subpackage to the tmautils metapackage, let me know.

Before writing code, please familiarize yourself with the philosophy and design choices, and try to follow them, or talk to me about how they are stupid and we should do things differently. I want this library to be the best version of itself.

AI Policy: I don't consider AI tool usage any different from IDE usage. This also means that you are responsible for the code you write and you should inspect every line of code written by an LLM. This policy is currently (slightly) relaxed for tests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tmautils-0.2.0.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tmautils-0.2.0-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file tmautils-0.2.0.tar.gz.

File metadata

  • Download URL: tmautils-0.2.0.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.3 Linux/6.19.6-200.fc43.x86_64

File hashes

Hashes for tmautils-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0c2047d2f88ca9799464837fefe419e91461efd032642f6f5eedb27884942397
MD5 cd1c53cb5894cfb58f38a799bb44c74c
BLAKE2b-256 36f4e9536ed1df09a0769673ec67bb6c7fe26a29ec799fb1d4758858d971f8f7

See more details on using hashes here.

File details

Details for the file tmautils-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: tmautils-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.3 Linux/6.19.6-200.fc43.x86_64

File hashes

Hashes for tmautils-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 56e8e802d3121305c12145ee7cfb20edb826c4c1ea062637b2d4fd087697fee2
MD5 71af4d8e8ccbbf701d227217709dc458
BLAKE2b-256 8e41e66a923a80eea6e86de0396a6c9b527db94988422697ac83456f89967a77

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page