Skip to main content

Virtually split a file based on a pattern.

Project description

vsplit

vsplit can be used to "virtually" split a file. This is similar to the UNIX split command, but with the key difference being that vsplit does not write the chunks of the file to disk. Instead, the offsets and lengths of the file chunks are computed and made available for downstream processing.

The use case is when you want to write code to process pieces of a file in parallel but need to reduce I/O overhead. With the traditional UNIX split approach, the file data are first read from disk and then each chunk is written to a separate file. This requires I/O of the entire dataset twice before the intended processing can even begin. For a small local file this is rarely a problem, but if the storage is on a networked filesystem (as is common in HPCS clusters) and the file is large, the additional overhead can be significant. An obvious alternative is to initially find the offsets and lengths of the chunks that split would have written to the disk and give these directly to the program doing the actual processing. That program can open the large file, seek to the offset of its chunk and read just the number of bytes of its chunk.

Installation

$ pip install vsplit

Or if you have uv installed, you can use it to run vsplit directly:

$ uvx vsplit ....

Example data file

The examples below use the SARS-CoV-2 sequences.fasta file downloaded (and uncompressed) from GISAID on April 19, 2025. The file is about 0.5TB (511,448,191,537 bytes).

Basic usage

You provide a desired number of chunks

The simplest usage is to just print details of the file chunks. You give a pattern to split the file on and the number of chunks you want and get TAB-separated output with (zero-based) file offsets and chunk lengths:

$ vsplit --pattern \> --n-chunks 10 sequences.fasta
0               51596653008
51596653008     51192803792
102789456800    51405791696
154195248496    51180888528
205376137024    51315241424
256691378448    51195830736
307887209184    51176833488
359064042672    51299463632
410363506304    51177325008
461540831312    49907360225

Note: the backslash in the \> is to prevent the > from redirecting the shell output, and I have adjusted the spacing on the TAB-separated output to better line things up.

You can pipe that outut into awk '{sum += $2} END {print sum}' if you want to quickly confirm that the sum of all the chunk lengths is 511,448,191,537.

You provide a desired chunk size

Instead of giving a number of chunks, you can give a chunk size:

$ vsplit --pattern \> --chunk-size 50000000000 sequences.fasta
0               50014110719
50014110719     50225394687
100239505406    50092671999
150332177405    50210952191
200543129596    50089251839
250632381435    50038522879
300670904314    50020840447
350691744761    50124010495
400815755256    50052318207
450868073463    50000001102
500868074565    10580116972

Note that the requested number of chunks or chunk size are just your suggestions. The actual values in the split output will depend heavily on where the pattern is present in the data as well as the suggested values (which determine where vsplit jumps to in the file to look for the next splitting pattern).

Printing initial data from the chunk

If you want a look at the start of the chunks that are found, you can provide a --prefix length and that many bytes from the start of the chunk will be printed.

$ vsplit --prefix 20 --pattern \> --chunk-size 50000000000 sequences.fasta
0               50014110719 >hCoV-19/Australia/N
50014110719     50225394687 >hCoV-19/Chongqing/Y
100239505406    50092671999 >hCoV-19/England/ALD
150332177405    50210952191 >hCoV-19/USA/CA-CZB-
200543129596    50089251839 >hCoV-19/USA/VA-VCUV
250632381435    50038522879 >hCoV-19/England/ALD
300670904314    50020840447 >hCoV-19/Scotland/QE
350691744761    50124010495 >hCoV-19/France/GES-
400815755256    50052318207 >hCoV-19/USA/CA-CDC-
450868073463    50000001102 >hCoV-19/USA/TX-CDC-
500868074565    10580116972 >hCoV-19/USA/MN-CDC-

The file is not examined by line, so your pattern can span lines

In the above example, we are splitting on >. In a FASTA file it might be more reliable to split on the pattern "\n>" (i.e., a newline followed by a >). You can embed the newline in the pattern as follows:

$ vsplit --prefix 20 --pattern '"\n>"' --eval-pattern --chunk-size 50000000000 sequences.fasta
0               50014110718 '>hCoV-19/Australia/N'
50014110718     50048541694 '\n>hCoV-19/Chongqing/'
100062652412    50094859262 '\n>hCoV-19/Australia/'
150157511674    50089751550 '\n>hCoV-19/USA/IL-CDC'
200247263224    50000002084 '\n>hCoV-19/Brazil/SP-'
250247265308    50172081150 '\n>hCoV-19/USA/CA-CDC'
300419346458    50200663038 '\n>hCoV-19/USA/TN-VUM'
350620009496    50222236670 '\n>hCoV-19/USA/WI-CDC'
400842246166    50183771134 '\n>hCoV-19/England/PL'
451026017300    50220909566 '\n>hCoV-19/Mexico/VER'
501246926866    10201264671 '\n>hCoV-19/USA/SC-CDC'

In the above, the extra --eval-pattern argument tells vsplit to use Python's eval function to evaluate the string pattern. This allows you to use the regular backslash escaping to specify any string. The single quotes are used to make sure the double-quoted string is passed through to Python instead of being interpreted by your shell.

Splitting on a byte pattern

You can also split on a byte pattern using Python's convention of putting a b before your string pattern:

$ vsplit --prefix 20 --pattern 'b"\n>"' --eval-pattern --chunk-size 50000000000 sequences.fasta
# Output is identical to the command above.

Dropping a prefix from the matched pattern

In the previous examples where we split on \n>, you can see that the initial chunk (beginning >hCoV-19/Australia/N) does not start with the split pattern. That's because vsplit jumps to its first offset in your file without even looking at the initial data (that's kind of the point, after all). The default is to return the details of the initial chunk (before the first instance of the pattern is located). If you don't want this initial chunk to be returned, you can use --skip-zero-chunk:

$ vsplit --skip-zero-chunk --prefix 20 --pattern '"\n>"' --eval-pattern --chunk-size 50000000000 sequences.fasta
100062652412    50094859262 '\n>hCoV-19/Australia/'
150157511674    50089751550 '\n>hCoV-19/USA/IL-CDC'
200247263224    50000002084 '\n>hCoV-19/Brazil/SP-'
250247265308    50172081150 '\n>hCoV-19/USA/CA-CDC'
300419346458    50200663038 '\n>hCoV-19/USA/TN-VUM'
350620009496    50222236670 '\n>hCoV-19/USA/WI-CDC'
400842246166    50183771134 '\n>hCoV-19/England/PL'
451026017300    50220909566 '\n>hCoV-19/Mexico/VER'
501246926866    10201264671 '\n>hCoV-19/USA/SC-CDC'

Dropping a prefix from the matched pattern

Also in the previous examples where we split on \n>, we actually don't want the leading newline to be part of the chunk. You can indicate that a certain number of leading characters be dropped from the pattern in the returned chunk using --remove-prefix to indicate a number of prefix characters to drop:

$ vsplit --remove-prefix 1 --prefix 20 --pattern '"\n>"' --eval --chunk-size 50000000000 sequences.fasta
0               50014110718 '>hCoV-19/Australia/N'
50014110719     50225394686 '>hCoV-19/Chongqing/Y'
100239505406    50092671998 '>hCoV-19/England/ALD'
150332177405    50210952190 '>hCoV-19/USA/CA-CZB-'
200543129596    50089251838 '>hCoV-19/USA/VA-VCUV'
250632381435    50038522878 '>hCoV-19/England/ALD'
300670904314    50020840446 '>hCoV-19/Scotland/QE'
350691744761    50124010494 '>hCoV-19/France/GES-'
400815755256    50052318206 '>hCoV-19/USA/CA-CDC-'
450868073463    50000001101 '>hCoV-19/USA/TX-CDC-'
500868074565    10580116972 '>hCoV-19/USA/MN-CDC-'

Getting chunk information into your program

So much for printing the basic information about chunks. The next step is to make use of this information in your program. You can obviously save the above TAB-separated output to a file, count the number of lines (i.e., chunks), and run your program once for each chunk, passing an argument to indicate which chunk to read. Then your program simply opens the chunk offset/length file, reads to the line with the offset and length for the given chunk, opens the file, seeks to its offset, and reads just the correct amount of data (as given by the length).

This is pretty straightforward, but it's a bit fiddly in several ways (mostly because you will need to keep track of how much data you have read).

To make your life easier, vsplit offers several mechanisms to get data chunks to your program.

Two things are needed: 1) a mechanism for reading the chunk given the filename, and the chunk offset and length, and 2) a way to pass the filename, offset, and length to your program.

Reading chunks

If you are writing Python, you can use the FileChunk class to read your data. Given variables filename, offset, and length you can write e.g.,:

from vsplit import FileChunk

with FileChunk(filename, offset, length) as fp:
    for line in fp:
        print(line)

# or

with FileChunk(filename, offset, length) as fp:
    print(line.read())

Here fp is a file-like object that will return just the data from the chunk of the original (virtually split) file.

Passing chunk information to your script

To help with the issue of getting chunk information to your program, vsplit gives you three options. All three will print commands for you. You can store them in a file as a shell script, or pipe them into a shell process or into GNU parallel to run the commands directly.

Using command line arguments

Suppose you write a program process-chunk that accepts three arguments, --filename, --chunk-offset, and --chunk-length. You can ask vsplit to print commands to call your program for each chunk:

$ vsplit --command 'process-chunk --filename [F] --chunk-offset [O] --chunk-length [L]' \
         --pattern \> --n-chunks 3 sequences.fasta
process-chunk --filename sequences.fasta --chunk-offset 0 --chunk-length 170570581519
process-chunk --filename sequences.fasta --chunk-offset 170570581519 --chunk-length 170633512463
process-chunk --filename sequences.fasta --chunk-offset 341204093982 --chunk-length 170244097555

In the above, you use [x] markers on your command line to indicate things that should be replaced with per-chunk information. The full set of indicators is

[F]: The (shell-quoted) filename.
[I]: The (zero-based) chunk index.
[0I]: The (zero-based) chunk index, but padded with leading zeroes.
[L]: The chunk length.
[N]: The overall number of chunks found.
[O]: The chunk offset.
[C]: The (shell-quoted) name of the file with the TAB-separated chunk offset/lengths.

These are all provided but for any particular program only some subset will be used. Note that there is no reliance on Python here, your program could be written in any language and then just open the file and read its data however it likes. But if you are in Python you can use the FileChunk class described above.

Using environment variables

Alternatively, you might prefer to run vsplit with --env and a command to have it print env (see man env) commands to set environment variables that your program can then examine and use to get its chunk:

$ vsplit --env --command process-chunk --pattern \> --n-chunks 3 sequences.fasta
env VSPLIT_INPUT_FILENAME=sequences.fasta VSPLIT_N_CHUNKS=3 \
    VSPLIT_CHUNK_OFFSETS_FILENAME=/var/folders/zw/s4wf68h12lxfcx1ggjmf0nq40000gn/T/tmp8tzjf1dk/chunks.tsv \
    VSPLIT_CHUNK_INDEX=0 VSPLIT_LENGTH=170570581519 VSPLIT_OFFSET=0 process-chunk
env VSPLIT_INPUT_FILENAME=sequences.fasta VSPLIT_N_CHUNKS=3 \
    VSPLIT_CHUNK_OFFSETS_FILENAME=/var/folders/zw/s4wf68h12lxfcx1ggjmf0nq40000gn/T/tmp8tzjf1dk/chunks.tsv \
    VSPLIT_CHUNK_INDEX=1 VSPLIT_LENGTH=170633512463 VSPLIT_OFFSET=170570581519 process-chunk
env VSPLIT_INPUT_FILENAME=sequences.fasta VSPLIT_N_CHUNKS=3 \
    VSPLIT_CHUNK_OFFSETS_FILENAME=/var/folders/zw/s4wf68h12lxfcx1ggjmf0nq40000gn/T/tmp8tzjf1dk/chunks.tsv \
    VSPLIT_CHUNK_INDEX=2 VSPLIT_LENGTH=170244097555 VSPLIT_OFFSET=341204093982 process-chunk

Note that the chunk offsets filename has been set by vsplit (in a directory created by the Python tempfile.mkdtemp function). You can pass an explicit filename via --chunk-offsets-filename, if you prefer. This file will eventually be removed by your operating system - vsplit cannot remove it because you might be needing it in your script (if you are relying on the chunk index variable as opposed to the offset and length variables).

If your script is in Python, there is a convenience function for reading the environment variables and getting you a FileChunk instance. E.g.:

from vsplit import chunk_from_env

with chunk_from_env() as fp:
    for line in fp:
        print(line)

Using a SLURM job array

If you will run your script under SLURM, you can use the --sbatch argument to ask vsplit to print an sbatch command that will submit a job array to launch a task for each chunk:

$ vsplit --sbatch --chunk-offsets-filename chunks.tsv --command process-chunk --pattern \> \
         --n-chunks 100 sequences.fasta
env VSPLIT_INPUT_FILENAME=sequences.fasta VSPLIT_N_CHUNKS=3 VSPLIT_CHUNK_OFFSETS_FILENAME=chunks.tsv \
    sbatch --array=0-99 \
    --export VSPLIT_INPUT_FILENAME,VSPLIT_N_CHUNKS,VSPLIT_CHUNK_OFFSETS_FILENAME process-chunk

As in the previous example, the chunk details will be communicated by environment variables (--env is implied if you use --sbatch).

If your script is written in Python, you can use the chunk_from_env function to read its chunk, exactly as above (the chunk index is obtained from the SLURM SLURM_ARRAY_TASK_ID environment variable for the job array and its offset and length are taken from the corresponding line in the chunks TSV file (chunks.tsv in this example).

When using --sbatch, you must explicitly specify (via --chunk-offsets-filename) the location for vsplit to store the chunk offset/length information. Obviously, this will need to be a file that will be accessible to your SLURM jobs once they are started, otherwise your script will not be able to determine its chunk details.

You can specify additional arguments to be given to sbatch using the --sbatch-args option (see man sbatch for the many options, including specification of output files). Because vsplit does not actually run your command (it just prints it), you can always insert additional sbatch arguments manually before running the command.

Additional details

If vsplit is slow

If vsplit is running for a long time (more than a handful of seconds) without printing anything, it means your pattern is not being found. The most likely cause of this is forgetting to use --eval-pattern to cause embedded backslash indicators to be evaluated.

Buffer size

vsplit has a --buffer-size argument that can be used to set the size of the chunks it reads when looking for your pattern. The default is the optimal filesystem I/O block size (obtained from os.stat by Python). If the pieces of your file (as delimited by your pattern) tend to be bigger than this, you can increase this value to possibly get a speed gain. vsplit will typically be very fast so you are not likely to need this option.

To give an example, the sequences.fasta file used in the examples above contains SARS-CoV-2 genome sequences that are each about 30,000 characters. The > separator between FASTA sequences will therefore only be found after reading ~30,000 characters, so if the default buffer size is 4096, there will typically be seven reads before the next > pattern is found. You can see the default buffer size (for the filesystem where you run the command from) in the help text for --buffer-size when you run vsplit --help.

Here's example timing for identifying 100 chunks using the default buffer size and a 32K one:

$ time vsplit --pattern \> --n-chunks 100 sequences.fasta > /dev/null

________________________________________________________
Executed in    3.90 secs    fish           external
   usr time    2.00 secs    0.44 millis    2.00 secs
   sys time    1.62 secs    2.21 millis    1.62 secs

$ time vsplit --buffer-size 32000 --pattern \> --n-chunks 100 sequences.fasta > /dev/null

________________________________________________________
Executed in   70.06 millis    fish           external
   usr time   38.81 millis    0.39 millis   38.42 millis
   sys time   14.16 millis    1.73 millis   12.44 millis

Maximum pattern length

vsplit reads chunks of the file in its search for your pattern. If the pattern is not found in a chunk, it will read more of the file and examine that. To guard against the situation where the chunk it reads ends in the middle of your pattern, it prepends the final part of the current chunk to the next chunk. By default, the number of bytes kept is the length of your pattern minus one. In the case of a fixed-length pattern, this will always be sufficient to ensure your pattern is not missed. But there is a --max-pattern-length option that you can set to give an alternate value. This will be useful when support for regular expression patterns is implemented.

Todo

Make it possible to use a regular expression to split the file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vsplit-0.1.0.tar.gz (88.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vsplit-0.1.0-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file vsplit-0.1.0.tar.gz.

File metadata

  • Download URL: vsplit-0.1.0.tar.gz
  • Upload date:
  • Size: 88.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for vsplit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9c96990c53145df23de6e37d5d2373d7ffede526f3b1ab990cc0d255e718c9e1
MD5 74e059b6c69835b055d0ab608cfab10c
BLAKE2b-256 fb9548dce69226da79f42151e14dd9bc59f3967a0be6d2402285b1ae8c0f8934

See more details on using hashes here.

File details

Details for the file vsplit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vsplit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for vsplit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8ad25356b1047c3becc020cc8db6a29dd63f9d019bafaa52bd36b0c604ec969e
MD5 449ced22f06b4e044b3e7b406954de73
BLAKE2b-256 3dc9b9e6db40dbf4d7e1be1f254c05cb8fa416b3feeacf457eca547968c3a158

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page