A package to read parted files on disk.
Project description
SplitFileReader, SplitFileWriter, and HTTPFileReader
A collection of tools designed to produce and consume data across multiple files as though they were single,
cohesive files. The modules are fully file-like, and completely transparent to anything that a Python object might
require when manipulating files. zipfile.ZipFile
, tarfile.TarFile
, io.TextIOWrapper
, and many other classes
interact with all of these classes natively with no modification whatsoever.
SplitFileReader
A python module to transparently read files that have been split on disk, without combining them. Exposes the
readable
, read
, writable
, write
, tellable
, tell
, seekable
, seek
, open
and close
methods, as well
as a Context Manager and an Iterable.
Usage
Simple Example
List all of the files within a TAR file that has been broken into multiple parts.
import tarfile
from split_file_reader import SplitFileReader
filepaths = [
"./files/archives/files.tar.000",
"./files/archives/files.tar.001",
"./files/archives/files.tar.002",
"./files/archives/files.tar.003",
]
with SplitFileReader(filepaths) as sfr:
with tarfile.open(fileobj=sfr, mode="r") as tf:
for tff in tf.filelist:
print("File in archive: ", tff.name)
Text files.
The SplitFileReader
and SplitFileWriter
works only on binary data, but do support the use of the io.TextIOWrapper
.
The SplitFileReader
may also be given a glob for the filepaths.
import glob
from io import TextIOWrapper
from split_file_reader import SplitFileReader
file_glob = glob.glob("./files/plaintext/Adventures_In_Wonderland.txt.*")
with SplitFileReader(file_glob) as sfr:
with TextIOWrapper(sfr) as text_wrapper:
for line in text_wrapper:
print(line, end='')
SplitFileWriter
A python module to produce parted files on disk. Exposes the
readable
, read
, writable
, write
, tellable
, tell
, seekable
, seek
, open
and close
functions, as well
as a Context Manager.
Usage
This module is the conceptual counterpart to the SplitFileReader, in that it produces data that has been binary split.
Simple Example
with SplitFileWriter("split.zip.", 500_000) as sfw:
with zipfile.ZipFile(file=sfw, mode='w') as zipf:
for root, dirs, files in os.walk("./"):
for file in files:
if file.startswith("random_payload"):
zipf.write(os.path.join(root, file))
Complex file names and mixed file objects are permitted via generators instead.
Memory-only file-like operations
Sometimes it behooves one to keep track of files strictly in memory, ensuring that the various file sizes remain managable. This is a way to process each chunk after producing a large file.
import zipfile
from io import BytesIO
from split_file_reader import SplitFileWriter
chunks = []
xml_file = """<xml>payload</xml>"""
def gen(lst):
while True:
lst.append(BytesIO())
yield lst[-1]
with SplitFileWriter(gen(chunks), 1_000_000) as sfw:
with zipfile.ZipFile(sfw, "w") as zip_file:
zip_file.writestr("test.xml", xml_file)
for i, chunk in enumerate(chunks):
print(f'chunk {i}: {len(chunk.getvalue())}')
See this stackoverflow question for a for a specific example.
Alternatively, rather than keeping a list of every BytesIO object, after the yield one could manage that buffer on demand, such as by saving it to disk or sending it via HTTP.
HTTPFileReader
A python module to transparently read files that are hosted on a remote HTTP
server. Exposes the readable
, read
, writable
, write
, tellable
,
tell
, seekable
, seek
, open
and close
functions, as well as a Context
Manager.
This class is fast for skipping over large amounts of data and downloading only select parts. It is inefficient for extracting or processing entire files.
Usage
This module is the equivalent of builtins.open
for a file hosted on an HTTP server.
Simple Example
List all of the files within a Tar file that is remotely hosted.
import tarfile
import requests
from split_file_reader.http_file_reader import HTTPFileReader
with requests.Session() as session:
with HTTPFileReader(url="http://localhost/file.tar", session=session) as hfr:
with tarfile.open(fileobj=hfr, mode="r") as tf:
for tff in tf.filelist:
print("File in archive: ", tff.name)
Use cases
Github and Gitlab Large File Size
Github and Gitlab (as well as other file repositories) impose file size limits. By parting these files into
sufficiently small chunks, the SplitFileReader
will be able to make transparent use of them, as though they were a
single cohesive file. This will remove any requirements to host these files with pre-fetch or pre-commit scripts, or
any other "setup" mechanism to make use of them.
The HTTPFileReader
will be able to make direct use of the files as they are hosted on Gitlab or Github, without
downloading the entire archive.
These two classes can be mixed as well. For example, to access just one file from a Tar file that is hosted on Gitlab that has been split into parts:
with requests.Session() as ses:
with HTTPFileReader(url="https://gitlab.com/.../files.zip.000?inline=false",session=ses,) as hfr0, \
HTTPFileReader(url="https://gitlab.com/.../files.zip.001?inline=false",session=ses,) as hfr1, \
HTTPFileReader(url="https://gitlab.com/.../files.zip.002?inline=false",session=ses,) as hfr2:
with SplitFileReader(
files=[hfr0, hfr1, hfr2, ]
) as sfr:
with zipfile.ZipFile(file=sfr, mode='r') as zf:
zff = zf.open("random_payload_3.bin")
...
Random Access
These reader objects allows for random access of the data, allowing for Tar or Zip files to be extracted without first combining them. They do not, however, enforce any data encoding; wrap a TextIOWrapper if text is needed, or any other wrapper for any other sort of data encoding.
sfr = split_file_reader.open(filepaths)
with zipfile.ZipFile(sfr, "r") as zf:
print(zf.filelist)
sfr.close()
Or, for text files:
with SplitFileReader(filepaths) as sfr:
with io.TextIOWrapper(sfr, encoding="utf-8") as tiow:
for line in tiow:
print(line, end='')
Command Line Invocation
The module may be used via the command line for the reading of certain archive types. Presently, only Tar
and Zip formats are supported, and they must have been split via the split
command, or other binary split mechanism.
These files can be produced on the command line via split
, f.x., zip - *.bin | split -d -n3 - -b 50000 archive.zip.
usage: [-h] [-a {zip,z,tar,t,tgz,tbz,txz}] [-p <password>]
(-t | -l | -x <destination> | -r <filename>)
<filepath> [<filepath> ...]
Identify and process parted archives without manual concat. This command line
capability provides supports only Tar and Zip files; but not 7z or Rar.
Designed to work for files that have been split via the `split` utility, or
any other binary cut; but does not support Zip's built-in split capability.
The python module supports any arbitrarily split files, regardless of type.
positional arguments:
<filepath> In-order list of the parted files on disk. Use shell
expansion, such as ./files.zip.*
optional arguments:
-h, --help show this help message and exit
-a {zip,z,tar,t,tgz,tbz,txz}, --archive {zip,z,tar,t,tgz,tbz,txz}
Archive type, either zip, tar, tgz, tbz, or txz
-p <password>, --password <password>
Zip password, if needed
-t, --test Test the archive, using the module's built-in test.
-l, --list List all the payload files in the archive.
-x <destination>, --extract <destination>
Extract the entire archive to filepath <destination>.
-r <filename>, --read <filename>
Read out payload file contents to stdout.
Examples
To display the contents of the Zip files included in the test suite of this module, run
python3 -m split_file_reader -azip --list ./files/archives/files.zip.*
The bash autoexpansion of the *
wildcard will fill in the files in order, correctly. --list
will print out the
names of the payload files within the zip archive, and the -azip
flag instructs the module to expect the Zip
archive type.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file split_file_reader-0.1.4.tar.gz
.
File metadata
- Download URL: split_file_reader-0.1.4.tar.gz
- Upload date:
- Size: 34.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 45bf503999640dd8c38a200878a5f4e9d6486ac57b26c953c13b5eea4da8852a |
|
MD5 | aae0cfaee21ddba05350a736dddbafea |
|
BLAKE2b-256 | 9702eda254e6828ba5165bb5ad3ff3ba679ded1c700b5fa4a806d52ff7b51701 |
File details
Details for the file split_file_reader-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: split_file_reader-0.1.4-py3-none-any.whl
- Upload date:
- Size: 56.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b113844ff733707de95c874d48f390f8fa0ec82a4c33f61edd61f4bd37ffd625 |
|
MD5 | 3aa22d79fc5a0dc34eab5dc258493c89 |
|
BLAKE2b-256 | fd7456ef41594942ad21bd77e035fb906faa0eca2a37477b4ae8158539be85ab |