A tool for displaying and manipulating Web Request+Response (WRR) files of Private Passive Web Archive (pwebarc) project
Project description
What?
wrrarms
(pwebarc-wrrarms
) is a tool for displaying and manipulating Personal Private Passive Web Archive (pwebarc) (also there) Web Request+Response (WRR) files produced by pWebArc browser extension (also there).
Quickstart
Installation
- Install with:
pip install pwebarc-wrrarms
and run aswrrarms --help
- Alternatively, install it via Nix
nix-env -i -f ./default.nix wrrarms --help
- Alternatively, run without installing:
alias wrrarms="python3 -m wrrarms" wrrarms --help
How to build a hierarchy of latest versions of all URLs
Assuming you keep your WRR dumps in ~/pwebarc/raw
you can generate a wget
-like file hierarchy of symlinks under ~/pwebarc/latest
pointing to the latest version of each URL in ~/pwebarc/raw
with
wrrarms organize --action symlink-update --output hupq --to ~/pwebarc/latest --and "status|== 200C" ~/pwebarc/raw
or, using a bit better format:
wrrarms organize --action symlink-update --output hupnq --to ~/pwebarc/latest --and "status|== 200C" ~/pwebarc/raw
Personally, I prefer the flat
format as I dislike deep file hierarchies and it allows to see and filter new dumps more easily in ranger
file browser:
wrrarms organize --action symlink-update --output flat --to ~/pwebarc/latest --and "status|== 200C" ~/pwebarc/raw
If you have a lot of WRR files all of the above commands could be rather slow, so if you want to keep your tree updated in real-time you should use a two-stage --stdin0
pipeline shown in the examples section below instead.
How do I open WRR files with xdg-open
? How do I generate previews for them?
See script
sub-directory for examples.
What is left TODO
-
Rendering into static website mirrors a-la
wget -k
.Currently, the extension archives everything except WebSockets data but
wrrarms
+pandoc
only work well for dumps of mostly plain text websites (which is the main use case I use this whole thing for: scrape a website and then mass-convert everything to PDFs via somepandoc
magic, then index those withrecoll
). -
Converter from HAR, WARC, and PCAP files into WRR.
-
Converter from WRR to WARC.
-
Data de-duplication between different WRR files.
-
Non-dumb server with time+URL index and replay, i.e. a local Wayback Machine.
-
Full text indexing and search.
Usage
wrrarms
A tool to pretty-print, compute and print values from, search, organize (programmatically rename/move/symlink/hardlink files), (WIP: check, deduplicate, and edit) pWebArc WRR (WEBREQRES, Web REQuest+RESponse) archive files.
Terminology: a reqres
(Reqres
when a Python type) is an instance of a structure representing HTTP request+response pair with some additional metadata.
-
options:
--version
: show program's version number and exit-h, --help
: show this help message and exit--markdown
: show help messages formatted in Markdown
-
subcommands:
{pprint,get,run,stream,find,organize,import}
pprint
: pretty-print given WRR filesget
: print values produced by computing given expressions on a given WRR filerun
: spawn a process with generated temporary files produced by given expressions computed on given WRR files as argumentsstream
: produce a stream of structured lists containing values produced by computing given expressions on given WRR files, a generalizedwrrarms get
find
: print paths of WRR files matching specified criteriaorganize
: programmatically rename/move/hardlink/symlink WRR files based on their contentsimport
: convert other archive formats into WRR files
wrrarms pprint
Pretty-print given WRR files to stdout.
-
positional arguments:
PATH
: inputs, can be a mix of files and directories (which will be traversed recursively)
-
options:
-u, --unabridged
: print all data in full--abridged
: shorten long strings for brevity (useful when you want to visually scan through batch data dumps) (default)--stdin0
: read zero-terminatedPATH
s from stdin, these will be processed afterPATH
s specified as command-line arguments
-
error handling:
--errors {fail,skip,ignore}
: when an error occurs:fail
: report failure and stop the execution (default)skip
: report failure but skip the reqres that produced it from the output and continueignore
:skip
, but don't report the failure
-
filters:
--or EXPR
: only print reqres which match any of these expressions...--and EXPR
: ... and all of these expressions, both can be specified multiple times, both use the same expression format aswrrarms get --expr
, which see
-
file system path ordering:
--paths-given-order
:argv
and--stdin0
PATH
s are processed in the order they are given (default)--paths-sorted
:argv
and--stdin0
PATH
s are processed in lexicographic order--paths-reversed
:argv
and--stdin0
PATH
s are processed in reverse lexicographic order--walk-fs-order
: recursive file system walk is done in the orderreaddir(2)
gives results (default)--walk-sorted
: recursive file system walk is done in lexicographic order--walk-reversed
: recursive file system walk is done in reverse lexicographic order
wrrarms get
Compute output values by evaluating expressions EXPR
s on a given reqres stored at PATH
, then print them to stdout terminating each value as specified.
-
positional arguments:
PATH
: input WRR file path
-
options:
-e EXPR, --expr EXPR
: an expression to compute; can be specified multiple times in which case computed outputs will be printed sequentially, see also "output" options below; (default:response.body|es
); each EXPR describes a state-transformer (pipeline) which starts from valueNone
and evaluates a script built from the following:- constants and functions:
es
: replaceNone
value with an empty string""
eb
: replaceNone
value with an empty byte stringb""
false
: replaceNone
value withFalse
true
: replaceNone
value withTrue
missing
:True
if the value isNone
0
: replaceNone
value with0
1
: replaceNone
value with1
not
: apply logicalnot
to valuelen
: applylen
to valuestr
: cast value tostr
or failbytes
: cast value tobytes
or failbool
: cast value tobool
or failint
: cast value toint
or failfloat
: cast value tofloat
or failquote
: URL-percent-encoding quote valuequote_plus
: URL-percent-encoding quote value and replace spaces with+
symbolsunquote
: URL-percent-encoding unquote valueunquote_plus
: URL-percent-encoding unquote value and replace+
symbols with spacessha256
: computehex(sha256(value.encode("utf-8"))
==
: apply== arg
,arg
is cast to the same type as the current value!=
: apply!= arg
, similarly<
: apply< arg
, similarly<=
: apply<= arg
, similarly>
: apply> arg
, similarly>=
: apply>= arg
, similarlyprefix
: take firstarg
characterssuffix
: take lastarg
charactersabbrev
: leave the current value as if if its length is less or equal thanarg
characters, otherwise take firstarg/2
followed by lastarg/2
charactersabbrev_each
:abbrev arg
each element in a valuelist
replace
: replace all occurences of the first argument in the current value with the second argument, casts arguments to the same type as the current valuepp_to_path
: encodepath_parts
list
into a POSIX path, quoting as little as neededqsl_urlencode
: encode parsedquery
list
into a URL's query componentstr
qsl_to_path
: encodequery
list
into a POSIX path, quoting as little as needed
- reqres fields, these work the same way as constants above, i.e. they replace current value of
None
with field's value, if reqres is missing the field in question, which could happen forresponse*
fields, the result isNone
:version
: WEBREQRES format version; intsource
:+
-separated list of applications that produced this reqres; strprotocol
: protocol; e.g."HTTP/1.1"
,"HTTP/2.0"
; strrequest.started_at
: request start time in seconds since 1970-01-01 00:00; Epochrequest.method
: request HTTP method; e.g."GET"
,"POST"
, etc; strrequest.url
: request URL, including the fragment/hash part; strrequest.headers
: request headers; list[tuple[str, bytes]]request.complete
: is request body complete?; boolrequest.body
: request body; bytesresponse.started_at
: response start time in seconds since 1970-01-01 00:00; Epochresponse.code
: HTTP response code; e.g.200
,404
, etc; intresponse.reason
: HTTP response reason; e.g."OK"
,"Not Found"
, etc; usually empty for Chromium and filled for Firefox; strresponse.headers
: response headers; list[tuple[str, bytes]]response.complete
: is response body complete?; boolresponse.body
: response body; Firefox gives raw bytes, Chromium gives UTF-8 encoded strings; bytes | strfinished_at
: request completion time in seconds since 1970-01-01 00:00; Epochwebsocket
: a list of WebSocket frames
- derived attributes:
fs_path
: file system path for the WRR file containing this reqres; str or Noneqtime
: aliast forrequest.started_at
; mnemonic: "reQuest TIME"; seconds since UNIX epoch; decimal floatqtime_ms
:qtime
in milliseconds rounded down to nearest integer; milliseconds since UNIX epoch; intqtime_msq
: three least significant digits ofqtime_ms
; intqyear
: year number ofgmtime(qtime)
(UTC year number ofqtime
); intqmonth
: month number ofgmtime(qtime)
; intqday
: day of the month ofgmtime(qtime)
; intqhour
: hour ofgmtime(qtime)
in 24h format; intqminute
: minute ofgmtime(qtime)
; intqsecond
: second ofgmtime(qtime)
; intstime
:response.started_at
if there was a response,finished_at
otherwise; mnemonic: "reSponse TIME"; seconds since UNIX epoch; decimal floatstime_ms
:stime
in milliseconds rounded down to nearest integer; milliseconds since UNIX epoch, intstime_msq
: three least significant digits ofstime_msq
; intsyear
: similar tosyear
, but forstime
; intsmonth
: similar tosmonth
, but forstime
; intsday
: similar tosday
, but forstime
; intshour
: similar toshour
, but forstime
; intsminute
: similar tosminute
, but forstime
; intssecond
: similar tossecond
, but forstime
; intftime
: aliast forfinished_at
; seconds since UNIX epoch; decimal floatftime_ms
:ftime
in milliseconds rounded down to nearest integer; milliseconds since UNIX epoch; intftime_msq
: three least significant digits offtime_msq
; intfyear
: similar tosyear
, but forftime
; intfmonth
: similar tosmonth
, but forftime
; intfday
: similar tosday
, but forftime
; intfhour
: similar toshour
, but forftime
; intfminute
: similar tosminute
, but forftime
; intfsecond
: similar tossecond
, but forftime
; intstatus
:"NR"
if there was no response,str(response.code) + "C"
if response was complete,str(response.code) + "N"
otherwise; strmethod
: aliast forrequest.method
; strraw_url
: aliast forrequest.url
; strnet_url
:raw_url
with Punycode UTS46 IDNA encoded hostname, unsafe characters quoted, and without the fragment/hash part; this is the URL that actually gets sent to the server; strscheme
: scheme part ofraw_url
; e.g.http
,https
, etc; strraw_hostname
: hostname part ofraw_url
as it is recorded in the reqres; strnet_hostname
: hostname part ofraw_url
, encoded as Punycode UTS46 IDNA; this is what actually gets sent to the server; ASCII strhostname
:net_hostname
decoded back into UNICODE; this is the canonical hostname representation for which IDNA-encoding and decoding are bijective; strrhostname
:hostname
with the order of its parts reversed; e.g."www.example.org"
->"com.example.www"
; strport
: port part ofraw_url
; int or Nonenetloc
: netloc part ofraw_url
; i.e., in the most general case,<username>:<password>@<hostname>:<port>
; strraw_path
: raw path part ofraw_url
as it is recorded is the reqres; e.g."https://www.example.org"
->""
,"https://www.example.org/"
->"/"
,"https://www.example.org/index.html"
->"/index.html"
; strpath_parts
: component-wise unquoted "/"-splitraw_path
with empty components removed and dots and double dots interpreted away; e.g."https://www.example.org"
->[]
,"https://www.example.org/"
->[]
,"https://www.example.org/index.html"
->["index.html"]
,"https://www.example.org/skipped/.//../used/"
-> `["used"]; list[str]wget_parts
:path + ["index.html"]
ifraw_path
ends in a slash,path
otherwise; this is whatwget
does inwget -mpk
; list[str]raw_query
: query part ofraw_url
(i.e. everything after the?
character and before the#
character) as it is recorded in the reqres; strquery_parts
: parsed (and component-wise unquoted)raw_query
; list[tuple[str, str]]query_ne_parts
:query_parts
with empty query parameters removed; list[tuple[str, str]]oqm
: optional query mark:?
character ifquery
is non-empty, an empty string otherwise; strfragment
: fragment (hash) part of the url; strofm
: optional fragment mark:#
character iffragment
is non-empty, an empty string otherwise; str
- a compound expression built by piping (
|
) the above, for example:net_url|sha256
net_url|sha256|prefix 4
path_parts|pp_to_path
query_parts|qsl_to_path|abbrev 128
response.complete
: this will print the value ofresponse.complete
orNone
, if there was no responseresponse.complete|false
: this will printresponse.complete
orFalse
response.body|eb
: this will printresponse.body
or an empty string, if there was no response
- constants and functions:
-
output:
--not-terminated
: don't terminate output values with anything, just concatenate them (default)-l, --lf-terminated
: terminate output values with\n
(LF) newline characters-z, --zero-terminated
: terminate output values with\0
(NUL) bytes
wrrarms run
Compute output values by evaluating expressions EXPR
s for each of NUM
reqres stored at PATH
s, dump the results into into newly generated temporary files terminating each value as specified, spawn a given COMMAND
with given arguments ARG
s and the resulting temporary file paths appended as the last NUM
arguments, wait for it to finish, delete the temporary files, exit with the return code of the spawned process.
-
positional arguments:
COMMAND
: command to spawnARG
: additional arguments to give to the COMMANDPATH
: input WRR file paths to be mapped into new temporary files
-
options:
-e EXPR, --expr EXPR
: the expression to compute, can be specified multiple times, see{__package__} get --expr
for more info; (default:response.body|es
)-n NUM, --num-args NUM
: number ofPATH
s (default:1
)
-
output:
--not-terminated
: don't terminate output values with anything, just concatenate them (default)-l, --lf-terminated
: terminate output values with\n
(LF) newline characters-z, --zero-terminated
: terminate output values with\0
(NUL) bytes
wrrarms stream
Compute given expressions for each of given WRR files, encode them into a requested format, and print the result to stdout.
-
positional arguments:
PATH
: inputs, can be a mix of files and directories (which will be traversed recursively)
-
options:
-u, --unabridged
: print all data in full--abridged
: shorten long strings for brevity (useful when you want to visually scan through batch data dumps) (default)--format {py,cbor,json,raw}
: generate output in:- py: Pythonic Object Representation aka
repr
(default) - cbor: CBOR (RFC8949)
- json: JavaScript Object Notation aka JSON; binary data can't be represented, UNICODE replacement characters will be used
- raw: concatenate raw values; termination is controlled by
*-terminated
options
- py: Pythonic Object Representation aka
-e EXPR, --expr EXPR
: an expression to compute, seewrrarms get --expr
for more info on expression format, can be specified multiple times (default:[]
); to dump all the fields of a reqres, specify ".
"--stdin0
: read zero-terminatedPATH
s from stdin, these will be processed afterPATH
s specified as command-line arguments
-
error handling:
--errors {fail,skip,ignore}
: when an error occurs:fail
: report failure and stop the execution (default)skip
: report failure but skip the reqres that produced it from the output and continueignore
:skip
, but don't report the failure
-
filters:
--or EXPR
: only print reqres which match any of these expressions...--and EXPR
: ... and all of these expressions, both can be specified multiple times, both use the same expression format aswrrarms get --expr
, which see
-
--format=raw
output:--not-terminated
: don't terminateraw
output values with anything, just concatenate them-l, --lf-terminated
: terminateraw
output values with\n
(LF) newline characters (default)-z, --zero-terminated
: terminateraw
output values with\0
(NUL) bytes
-
file system path ordering:
--paths-given-order
:argv
and--stdin0
PATH
s are processed in the order they are given (default)--paths-sorted
:argv
and--stdin0
PATH
s are processed in lexicographic order--paths-reversed
:argv
and--stdin0
PATH
s are processed in reverse lexicographic order--walk-fs-order
: recursive file system walk is done in the orderreaddir(2)
gives results (default)--walk-sorted
: recursive file system walk is done in lexicographic order--walk-reversed
: recursive file system walk is done in reverse lexicographic order
wrrarms find
Print paths of WRR files matching specified criteria.
-
positional arguments:
PATH
: inputs, can be a mix of files and directories (which will be traversed recursively)
-
options:
--stdin0
: read zero-terminatedPATH
s from stdin, these will be processed afterPATH
s specified as command-line arguments
-
error handling:
--errors {fail,skip,ignore}
: when an error occurs:fail
: report failure and stop the execution (default)skip
: report failure but skip the reqres that produced it from the output and continueignore
:skip
, but don't report the failure
-
filters:
--or EXPR
: only output paths to reqres which match any of these expressions...--and EXPR
: ... and all of these expressions, both can be specified multiple times, both use the same expression format aswrrarms get --expr
, which see
-
output:
-l, --lf-terminated
: output absolute paths of matching WRR files terminated with\n
(LF) newline characters to stdout (default)-z, --zero-terminated
: output absolute paths of matching WRR files terminated with\0
(NUL) bytes to stdout
-
file system path ordering:
--paths-given-order
:argv
and--stdin0
PATH
s are processed in the order they are given (default)--paths-sorted
:argv
and--stdin0
PATH
s are processed in lexicographic order--paths-reversed
:argv
and--stdin0
PATH
s are processed in reverse lexicographic order--walk-fs-order
: recursive file system walk is done in the orderreaddir(2)
gives results (default)--walk-sorted
: recursive file system walk is done in lexicographic order--walk-reversed
: recursive file system walk is done in reverse lexicographic order
wrrarms organize
Parse given WRR files into their respective reqres and then rename/move/hardlink/symlink each file to DESTINATION
with the new path derived from each reqres' metadata.
Operations that could lead to accidental data loss are not permitted.
E.g. wrrarms organize --move
will not overwrite any files, which is why the default --output
contains %(num)d
.
-
positional arguments:
PATH
: inputs, can be a mix of files and directories (which will be traversed recursively)
-
options:
--dry-run
: perform a trial run without actually performing any changes-q, --quiet
: don't log computed updates to stderr-t DESTINATION, --to DESTINATION
: destination directory, when unset each sourcePATH
must be a directory which will be treated as its ownDESTINATION
-o FORMAT, --output FORMAT
: format describing generated output paths, an alias name or "format:" followed by a custom pythonic %-substitution string:- available aliases and corresponding %-substitutions:
default
:%(syear)d/%(smonth)02d/%(sday)02d/%(shour)02d%(sminute)02d%(ssecond)02d%(stime_msq)03d_%(qtime_ms)s_%(method)s_%(net_url|sha256|prefix 4)s_%(status)s_%(hostname)s.%(num)d.wrr
(default)short
:%(syear)d/%(smonth)02d/%(sday)02d/%(stime_ms)d_%(qtime_ms)s.%(num)d.wrr
surl
:%(scheme)s/%(netloc)s/%(path_parts|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path)s
url
:%(netloc)s/%(path_parts|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path)s
surl_msn
:%(scheme)s/%(netloc)s/%(path_parts|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path)s_%(method)s_%(status)s.%(num)d.wrr
url_msn
:%(netloc)s/%(path_parts|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path)s_%(method)s_%(status)s.%(num)d.wrr
shpq
:%(scheme)s/%(hostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path|abbrev 120)s.wrr
hpq
:%(hostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path|abbrev 120)s.wrr
shpq_msn
:%(scheme)s/%(hostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path|abbrev 100)s_%(method)s_%(status)s.%(num)d.wrr
hpq_msn
:%(hostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path|abbrev 100)s_%(method)s_%(status)s.%(num)d.wrr
shupq
:%(scheme)s/%(hostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path|abbrev 120)s.wrr
hupq
:%(hostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path|abbrev 120)s.wrr
shupq_msn
:%(scheme)s/%(hostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path|abbrev 100)s_%(method)s_%(status)s.%(num)d.wrr
hupq_msn
:%(hostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path|abbrev 100)s_%(method)s_%(status)s.%(num)d.wrr
srhupq
:%(scheme)s/%(rhostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path|abbrev 120)s.wrr
rhupq
:%(rhostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path|abbrev 120)s.wrr
srhupq_msn
:%(scheme)s/%(rhostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path|abbrev 100)s_%(method)s_%(status)s.%(num)d.wrr
rhupq_msn
:%(rhostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_parts|qsl_to_path|abbrev 100)s_%(method)s_%(status)s.%(num)d.wrr
shupnq
:%(scheme)s/%(hostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_ne_parts|qsl_to_path|abbrev 120)s.wrr
hupnq
:%(hostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_ne_parts|qsl_to_path|abbrev 120)s.wrr
shupnq_msn
:%(scheme)s/%(hostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_ne_parts|qsl_to_path|abbrev 100)s_%(method)s_%(status)s.%(num)d.wrr
hupnq_msn
:%(hostname)s/%(wget_parts|abbrev_each 120|pp_to_path)s%(oqm)s%(query_ne_parts|qsl_to_path|abbrev 100)s_%(method)s_%(status)s.%(num)d.wrr
flat
:%(hostname)s/%(wget_parts|abbrev_each 120|pp_to_path|replace / __|abbrev 120)s%(oqm)s%(query_ne_parts|qsl_to_path|abbrev 100)s_%(method)s_%(net_url|sha256|prefix 4)s_%(status)s.wrr
- available substitutions:
num
: number of times the resulting output path was encountered before; adding this parameter to your--output
format will ensure all generated file names will be unique- all expressions of
wrrarms get --expr
, which see
- available aliases and corresponding %-substitutions:
--stdin0
: read zero-terminatedPATH
s from stdin, these will be processed afterPATH
s specified as command-line arguments
-
error handling:
--errors {fail,skip,ignore}
: when an error occurs:fail
: report failure and stop the execution (default)skip
: report failure but skip the reqres that produced it from the output and continueignore
:skip
, but don't report the failure
-
filters:
--or EXPR
: only work on reqres which match any of these expressions...--and EXPR
: ... and all of these expressions, both can be specified multiple times, both use the same expression format aswrrarms get --expr
, which see
-
output:
--no-output
: don't print anything to stdout (default)-l, --lf-terminated
: output absolute paths of newly produced files terminated with\n
(LF) newline characters to stdout-z, --zero-terminated
: output absolute paths of newly produced files terminated with\0
(NUL) bytes to stdout
-
action:
--move
: move source files underDESTINATION
(default)--copy
: copy source files to files underDESTINATION
--hardlink
: create hardlinks from source files to paths underDESTINATION
--symlink
: create symlinks from source files to paths underDESTINATION
-
updates:
--keep
: disallow replacements and overwrites for any existing files underDESTINATION
(default); broken symlinks are allowed to be replaced; if source and target directories are the same then some files can still be renamed into previously non-existing names; all other updates are disallowed--latest
: replace files underDESTINATION
ifstime_ms
for the source reqres is newer than the same value for reqres stored at the destination
-
batching and caching:
--batch-number INT
: batch at most this many IO actions together (default:1024
), making this larger improves performance at the cost of increased memory consumption, setting it to zero will force all IO actions to be applied immediately--cache-number INT
: cachestat(2)
information about this many files in memory (default:4096
); making this larger improves performance at the cost of increased memory consumption; setting this to a too small number will likely force {package} into repeatedly performing lots ofstat(2)
system calls on the same files; setting this to a value smaller than--batch-number
will not improve memory consumption very much since batched IO actions also cache information about their own files--lazy
: sets--cache-number
and--batch-number
to positive infinity; most useful in combination with--symlink --latest
in which case it will forcewrrarms
to compute the desired file system state first and then perform disk writes in a single batch
-
file system path ordering:
--paths-given-order
:argv
and--stdin0
PATH
s are processed in the order they are given (default when--keep
)--paths-sorted
:argv
and--stdin0
PATH
s are processed in lexicographic order--paths-reversed
:argv
and--stdin0
PATH
s are processed in reverse lexicographic order (default when--latest
)--walk-fs-order
: recursive file system walk is done in the orderreaddir(2)
gives results (default when--keep
)--walk-sorted
: recursive file system walk is done in lexicographic order--walk-reversed
: recursive file system walk is done in reverse lexicographic order (default when--latest
)
wrrarms import
Parse data in each INPUT
PATH
into reqres and dump them under DESTINATION
with paths derived from their metadata, similar to organize
.
Internally, this shares most of the code with organize
, but unlike organize
this holds the whole reqres in memory until its written out to disk.
- file formats:
{mitmproxy}
mitmproxy
: convert other archive formats into WRR files
wrrarms import mitmproxy
-
positional arguments:
PATH
: inputs, can be a mix of files and directories (which will be traversed recursively)
-
options:
--dry-run
: perform a trial run without actually performing any changes-q, --quiet
: don't log computed updates to stderr-t DESTINATION, --to DESTINATION
: destination directory-o FORMAT, --output FORMAT
: format describing generated output paths, an alias name or "format:" followed by a custom pythonic %-substitution string; same aswrrarms organize --output
, which see--stdin0
: read zero-terminatedPATH
s from stdin, these will be processed afterPATH
s specified as command-line arguments
-
error handling:
--errors {fail,skip,ignore}
: when an error occurs:fail
: report failure and stop the execution (default)skip
: report failure but skip the reqres that produced it from the output and continueignore
:skip
, but don't report the failure
-
filters:
--or EXPR
: only import reqres which match any of these expressions...--and EXPR
: ... and all of these expressions, both can be specified multiple times, both use the same expression format aswrrarms get --expr
, which see
-
output:
--no-output
: don't print anything to stdout (default)-l, --lf-terminated
: output absolute paths of newly produced files terminated with\n
(LF) newline characters to stdout-z, --zero-terminated
: output absolute paths of newly produced files terminated with\0
(NUL) bytes to stdout
-
file system path ordering:
--paths-given-order
:argv
and--stdin0
PATH
s are processed in the order they are given (default)--paths-sorted
:argv
and--stdin0
PATH
s are processed in lexicographic order--paths-reversed
:argv
and--stdin0
PATH
s are processed in reverse lexicographic order--walk-fs-order
: recursive file system walk is done in the orderreaddir(2)
gives results (default)--walk-sorted
: recursive file system walk is done in lexicographic order--walk-reversed
: recursive file system walk is done in reverse lexicographic order
Examples
-
Pretty-print all reqres in
../dumb_server/pwebarc-dump
using an abridged (for ease of reading and rendering) verbose textual representation:wrrarms pprint ../dumb_server/pwebarc-dump
-
Pipe response body from a given WRR file to stdout:
wrrarms get ../dumb_server/pwebarc-dump/path/to/file.wrr
-
Get first 4 characters of a hex digest of sha256 hash computed on the URL without the fragment/hash part:
wrrarms get -e "net_url|sha256|prefix 4" ../dumb_server/pwebarc-dump/path/to/file.wrr
-
Pipe response body from a given WRR file to stdout, but less efficiently, by generating a temporary file and giving it to
cat
:wrrarms run cat ../dumb_server/pwebarc-dump/path/to/file.wrr
Thus
wrrarms run
can be used to do almost anything you want, e.g.wrrarms run less ../dumb_server/pwebarc-dump/path/to/file.wrr
wrrarms run -- sort -R ../dumb_server/pwebarc-dump/path/to/file.wrr
wrrarms run -n 2 -- diff -u ../dumb_server/pwebarc-dump/path/to/file-v1.wrr ../dumb_server/pwebarc-dump/path/to/file-v2.wrr
-
List paths of all WRR files from
../dumb_server/pwebarc-dump
that contain only complete200 OK
responses with bodies larger than 1K:wrrarms find --and "status|== 200C" --and "response.body|len|> 1024" ../dumb_server/pwebarc-dump
-
Rename all WRR files in
../dumb_server/pwebarc-dump/default
according to their metadata using--output default
(see thewrrarms organize
section for its definition, thedefault
format is designed to be human-readable while causing almost no collisions, thus makingnum
substitution parameter to almost always stay equal to0
, making things nice and deterministic):wrrarms organize ../dumb_server/pwebarc-dump/default
alternatively, just show what would be done
wrrarms organize --dry-run ../dumb_server/pwebarc-dump/default
-
The output of
wrrarms organize --zero-terminated
can be piped intowrrarms organize --stdin0
to perform complex updates. E.g. the following will rename new reqres from../dumb_server/pwebarc-dump
to~/pwebarc/raw
renaming them with--output default
, thefor
loop is there to preserve profiles:for arg in ../dumb_server/pwebarc-dump/* ; do wrrarms organize --zero-terminated --to ~/pwebarc/raw/"$(basename "$arg")" "$arg" done > changes
then, we can reuse
changes
to symlink all new files from~/pwebarc/raw
to~/pwebarc/all
using--output hupq_msn
, which would show most of the URL in the file name:wrrarms organize --stdin0 --symlink --to ~/pwebarc/all --output hupq_msn < changes
and then, we can reuse
changes
again and use them to update~/pwebarc/latest
, filling it with symlinks pointing to the latest200 OK
complete reqres from~/pwebarc/raw
, similar to whatwget -r
would produce (exceptwget
would do network requests and produce responce bodies, while this will build a file system tree of symlinks to WRR files in/pwebarc/raw
):wrrarms organize --stdin0 --symlink --latest --to ~/pwebarc/latest --output hupq --and "status|== 200C" < changes
-
wrrarms organize --move
is de-duplicating when possible, while--copy
,--hardlink
, and--symlink
are non-duplicating when possible, i.e.:wrrarms organize --copy --to ~/pwebarc/copy1 ~/pwebarc/original wrrarms organize --copy --to ~/pwebarc/copy2 ~/pwebarc/original wrrarms organize --hardlink --to ~/pwebarc/copy3 ~/pwebarc/original # noops wrrarms organize --copy --to ~/pwebarc/copy1 ~/pwebarc/original wrrarms organize --hardlink --to ~/pwebarc/copy1 ~/pwebarc/original wrrarms organize --copy --to ~/pwebarc/copy2 ~/pwebarc/original wrrarms organize --hardlink --to ~/pwebarc/copy2 ~/pwebarc/original wrrarms organize --copy --to ~/pwebarc/copy3 ~/pwebarc/original wrrarms organize --hardlink --to ~/pwebarc/copy3 ~/pwebarc/original # de-duplicate wrrarms organize --move --to ~/pwebarc/all ~/pwebarc/original ~/pwebarc/copy1 ~/pwebarc/copy2 ~/pwebarc/copy3
will produce
~/pwebarc/all
which has each duplicated file stored only once. Similarly,wrrarms organize --symlink --output hupq_msn --to ~/pwebarc/pointers ~/pwebarc/original wrrarms organize --symlink --output shupq_msn --to ~/pwebarc/schemed ~/pwebarc/original # noop wrrarms organize --symlink --output hupq_msn --to ~/pwebarc/pointers ~/pwebarc/original ~/pwebarc/schemed
will produce
~/pwebarc/pointers
which has each symlink only once.
Advanced examples
-
Pretty-print all reqres in
../dumb_server/pwebarc-dump
by dumping their whole structure into an abridged Pythonic Object Representation (repr):wrrarms stream --expr . ../dumb_server/pwebarc-dump
wrrarms stream -e . ../dumb_server/pwebarc-dump
-
Pretty-print all reqres in
../dumb_server/pwebarc-dump
using the unabridged verbose textual representation:wrrarms pprint --unabridged ../dumb_server/pwebarc-dump
wrrarms pprint -u ../dumb_server/pwebarc-dump
-
Pretty-print all reqres in
../dumb_server/pwebarc-dump
by dumping their whole structure into the unabridged Pythonic Object Representation (repr) format:wrrarms stream --unabridged --expr . ../dumb_server/pwebarc-dump
wrrarms stream -ue . ../dumb_server/pwebarc-dump
-
Produce a JSON list of
[<file path>, <time it finished loading in milliseconds since UNIX epoch>, <URL>]
tuples (one per reqres) and pipe it intojq
for indented and colored output:wrrarms stream --format=json -ue fs_path -e finished_at -e request.url ../dumb_server/pwebarc-dump | jq .
-
Similarly, but produce a CBOR output:
wrrarms stream --format=cbor -ue fs_path -e finished_at -e request.url ../dumb_server/pwebarc-dump | less
-
Concatenate all response bodies of all the requests in
../dumb_server/pwebarc-dump
:wrrarms stream --format=raw --not-terminated -ue "response.body|es" ../dumb_server/pwebarc-dump | less
-
Print all unique visited URLs, one per line:
wrrarms stream --format=raw --lf-terminated -ue request.url ../dumb_server/pwebarc-dump | sort | uniq
-
Same idea, but using NUL bytes while processing, and prints two URLs per line:
wrrarms stream --format=raw --zero-terminated -ue request.url ../dumb_server/pwebarc-dump | sort -z | uniq -z | xargs -0 -n2 echo
How to handle binary data
Trying to use response bodies produced by wrrarms stream --format=json
is likely to result garbled data as JSON can't represent raw sequences of bytes, thus binary data will have to be encoded into UNICODE using replacement characters:
wrrarms stream --format=json -ue . ../dumb_server/pwebarc-dump/path/to/file.wrr | jq .
The most generic solution to this is to use --format=cbor
instead, which would produce a verbose CBOR representation equivalent to the one used by --format=json
but with binary data preserved as-is:
wrrarms stream --format=cbor -ue . ../dumb_server/pwebarc-dump/path/to/file.wrr | less
Or you could just dump raw response bodies separately:
wrrarms stream --format=raw -ue response.body ../dumb_server/pwebarc-dump/path/to/file.wrr | less
wrrarms get ../dumb_server/pwebarc-dump/path/to/file.wrr | less
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pwebarc_wrrarms-0.8-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 16a3960f8eda7e8137150c570289dc77fa0092997a35d3b0425823cf7da7d30a |
|
MD5 | 6ebe7930f742af3d662c737e8475f79f |
|
BLAKE2b-256 | c1ea5d460200f990eee04e164f15f2822185f507b72abdf7ca9fd891565b1e19 |