Essential utilities for working with files, directories, and data.
Project description
🚪 Doorway
Essential data wrangling utilities.
Contributions are welcome!
Table Of Contents
Overview
Doorway is a common library for interacting with files and data wranging, with zero deps over the common utils.
Get started with doorway by installing it with $pip install doorway or cloning this repository.
Experimental features need pip install "doorway[extras]"
Features
Doorway includes the following features:
- Partial "fast" hashing of files
- Stale file detection
- Atomic file writing and overwriting via a separate temporary file that is moved into place
- Path name modification
- Context managers
- Environment Vars & Validation
- String Colors
- String Formatting
- Downloading with a progress bar
- Deterministic dataset partitioning using
As well as experimental:
- Arbitrary URI interop and handling
- Proxy downloader -- multithreading, auto-retry, auto proxy retrieval and rotation.
Atomic IO
Write or overwrite a file without worrying about it being half-written or corrupted, and left in a bad state.
from doorway import AtomicOpen, AtomicPath
# write to a temporary file, then move it into place once the context closes
with AtomicOpen("my/file.txt", "w") as f:
f.write("Hello, World!")
# generate a temporary path and raise an exception if the file does not exist when the context closes
with AtomicPath("my/file.txt") as path:
with open(path, "w") as f: # `doorway.AtomicPath` + open ~= `doorway.AtomicOpen`
f.write("Hello, World!")
Fast Hashing
from doorway import hash_file
# normal hashing
hast_full = hash_file("my/large.file", hash_mode="full", hash_algo="md5")
# fast hashing, uses start, middle and end bytes
hash_fast = hash_file("my/large.file", hash_mode="fast", hash_algo="md5")
# `doorway.hash_bytes`, `doorway.hash_str`, `doorway.hash_file_is_valid` are also available
Stale Files
from doorway import stalefile_generate, stalefile_is_stale
# generate a file if it is stale (doesn't exist or the hash is different)
stalefile_generate(
make_file_fn=lambda path: open(path, "w").close(),
path="my/file.txt",
hash="some-long-hash-fffffffffffffffffffffffffff",
)
# check if a file is stale (doesn't exist or the hash is different)
exists = stalefile_is_stale(
path="my/file.txt",
hash="some-long-hash-fffffffffffffffffffffffffff",
)
Path Modification
from doorway import path_basename_modify
# `basename` is the full last part of the path e.g. "file.1.2.txt"
# `name` is the basename without the `extension` e.g. "file.1.2" or "file" (depending on the `name_contains_sep` parameter)
path = "my/path/to/file.1.2.txt"
p0 = path_basename_modify(
path,
ext="csv",
name_prefix="a_",
name_suffix="_b",
name_contains_sep=True,
)
p1 = path_basename_modify(
path,
ext="csv",
name_prefix="a_",
name_suffix="_b",
name_contains_sep=False,
)
assert p0 == "my/path/to/a_file.1.2_b.csv"
assert p1 == "my/path/to/a_file_b.csv"
# see `doorway.basename_modify` for the non-path version of this
# see `doorway.basename_split_ext` for splitting the name from the extension
File Downloading
Easy file downloading wrapping AtomicIO with a progress bar if tqdm is installed.
from doorway import io_download
# download a file with a progress bar
io_download(
src_url="https://example.com/file.txt",
dst_path="my/file.txt",
exists_mode="error", # options are: "error" (default), "overwrite", "skip"
progress=True, # (default)
)
Context Managers
Context managers for various tasks.
import os
from doorway import ctx_temp_environ
# temporarily set an environment variable
with ctx_temp_environ(TEMP_VAR="temp_value"):
assert os.environ.get("TEMP_VAR", None) == "temp_value"
assert os.environ.get("TEMP_VAR", None) is None
# most of the other context managers are similar
# all gracefully handle exceptions and undo their changes
# - ctx_temp_wd: temporarily update the working directory
# - ctx_temp_attr: temporarily update an attribute on an object
# - ctx_temp_sys_args: temporarily update sys.argv
# - ctx_no_stdout: temporarily suppress stdout
# - ctx_no_stderr: temporarily suppress stderr
# - ctx_do_undo: do something (run "do" fn), then undo it (run "undo" fn)
Environment Variables
Parse environment variables and validate their values.
from doorway import EnvVar, ctx_temp_environ
MY_VAR = EnvVar.env_int(
"MY_VAR",
default=42,
validator=EnvVar.validator_min_max(0, 100),
)
# get the value of the environment variable (and validate)
with ctx_temp_environ(MY_VAR="7"):
assert MY_VAR.get() == 7
# get a default value if set (and validate)
# - if no default was set, an exception is raised
assert MY_VAR.get() == 42
# override the default value for this call (and validate)
with ctx_temp_environ(MY_VAR="8"):
assert MY_VAR.get(default=43) == 8
assert MY_VAR.get(default=43) == 43
# force a value (and validate)
with ctx_temp_environ(MY_VAR="9"):
assert MY_VAR.get(override=99) == 99
assert MY_VAR.get(override=99) == 99
ANSI Colors
from doorway import c
print(f"{c.lRED}This is red {c.RST}This is reset to defaults")
Human Readable Byte Conversion
from doorway import fmt_bytes_to_human
assert fmt_bytes_to_human(1025**0, base=1024) == "1.000 B"
assert fmt_bytes_to_human(1025**0, base=1000) == "1.000 B"
assert fmt_bytes_to_human(1025**1, base=1024) == "1.001 KiB"
assert fmt_bytes_to_human(1025**1, base=1000) == "1.025 kB" # lowercase is more correct for kB in this case
assert fmt_bytes_to_human(1025**2, base=1024) == "1.002 MiB"
assert fmt_bytes_to_human(1025**2, base=1000) == "1.051 MB"
assert fmt_bytes_to_human(1025**3, base=1024) == "1.003 GiB"
assert fmt_bytes_to_human(1025**3, base=1000) == "1.077 GB"
assert fmt_bytes_to_human(1025**4, base=1024) == "1.004 TiB"
assert fmt_bytes_to_human(1025**4, base=1000) == "1.104 TB"
Deterministic Dataset Partitioning
Split datapoints deterministically but pseudo-randomly into buckets.
This is useful when you have datasets that are increasing or decreasing in size, but you want to keep adding the same datapoints to the same splits/buckets/shards. (A standard shuffle with seed CANNOT do this)
The example below only has 9 datapoints, which doesn't show the full power of this function, this is useful for much larger datasets when assignment to buckets follows the expected statistical distribution.
from doorway import sharded, sharded_weighted
# create 3 buckets and assign values to them pseudo-randomly
shards = sharded(
values=[1, 2, 3, 4, 5, 6, 7, 8, 9],
num_shards=3,
shard_key=lambda x: str(x),
)
print(shards) # [[2, 5, 6, 7], [1, 3], [4, 8, 9]]
# create 3 buckets with relative weights 9:1:1 and assign values to them pseudo-randomly
train, test, split = sharded_weighted(
values=[1, 2, 3, 4, 5, 6, 7, 8, 9],
shard_weights=[8, 3, 1],
shard_key=str,
)
print(train, test, split) # [2, 6, 9, 7, 8, 1, 3] [4, 5] []
Proxy Downloader
# EXPERIMENTAL, API may change -- `pip install doorway[extras]`
from doorway.x import ProxyDownloader, proxy_download
# collect a default list of proxies and cache them
# - see: `proxies_register_scraper` for registering custom proxies, or pass in a list of proxies directly to this
downloader = ProxyDownloader()
# download a single file with a randomly chosen proxy
path = downloader.download(
"https://example.com/file.txt", # src
"my/file.txt", # dst
exists_mode="error", # options are: "error" (default), "overwrite", "skip"
)
# download multiple files in parallel while rotating proxies and retrying if any fail
failed = downloader.download_threaded(
[
("https://example.com/file1.txt", "my/file1.txt"),
("https://example.com/file2.txt", "my/file2.txt"),
("https://example.com/file3.txt", "my/file3.txt"),
],
exists_mode="error", # options are: "error" (default), "overwrite", "skip"
verbose=True, # (default is False)
ignore_failures=True, # (default is False)
)
Proxy Issues?
The scrape logic used to obtain the proxy list will probably go out of date. You can override the default scrape logic by registering a new scrape function.
from doorway.x import proxies_register_scraper
@proxies_register_scraper(name='my_proxy_source', is_default=True)
def custom_proxy_scraper(proxy_type):
# you should respect this setting
assert proxy_type in ('all', 'http', 'https')
# proxies is a list of dictionaries, where each dictionary only has one entry:
# - the key is the protocol
# - the value is the matching full url
return [
{'HTTP': 'http://<my-http-proxy>.com'},
{'HTTPS': 'https://<my-https-proxy>.com'},
]
URI Handling
Interop between URIs from different locations, e.g. S3, Local, HTTP, etc.
# TODO: W.I.P
TODO
- More Docs
- More Examples
- More APIs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doorway-0.6.3.tar.gz.
File metadata
- Download URL: doorway-0.6.3.tar.gz
- Upload date:
- Size: 40.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
094438b30882e4194bf36162d899d513e7d671bb28e5d5bf4a082c4cff8f3f03
|
|
| MD5 |
e85c42d14547308679ff6b61634d063d
|
|
| BLAKE2b-256 |
8dcdaedf9c1018807f31fbb2e1bb960dae3a91f83e1f218bcc63c22081be83c4
|
File details
Details for the file doorway-0.6.3-py3-none-any.whl.
File metadata
- Download URL: doorway-0.6.3-py3-none-any.whl
- Upload date:
- Size: 49.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2be4e1a26dadd0606fda19c20242eeb0550d42212fffb94b6caa52bd68ee4802
|
|
| MD5 |
e01617998c78f9acd69c08082c5105d8
|
|
| BLAKE2b-256 |
f4c29e472981d9bfb14c3ab5e3a5d2a0ab54bab1779331c90526f44d31fa1fd4
|