Skip to main content

Module to split file of any size into multiple chunks

Project description

https://badge.fury.io/py/filesplit.png

filesplit

File splitting made easy for python programmers!

A python module that can split files of any size into multiple chunks and also merge them back. This module can be used on structured and unstructured files. The file splits are numbered from 1 to n as follows:

[filename]_1.ext, [filename]_2.ext, …., [filename]_n.ext

System Requirements

Operating System: Windows/Linux/Mac

Python version: 3

Changelog

v3.0.2

  • Bug fix for module producing infinite number of empty split files when the split size provided is greater than the file size

v3.0.1

  • Bug fix for module throwing exception when using newline set to True and include_header set to False

v3.0.0

Here is what changed from previous versions

  • v3.0.0 is not backward compatible to the previous versions. This is for good, following a futuristic approach.

  • FileSplit class has been renamed to Filesplit

  • Added logging functionality

  • splitbyencoding() method has been removed and the functionality has been moved to split() method.

  • Added support for splitting unstructured files including binary files.

  • Merge functionality has been introduced to merge the split files back.

  • Performance optimizations.

Usage

The module is available as a part of PyPI and can be easily installed using pip

pip install filesplit

Create an instance

from fsplit.filesplit import Filesplit

fs = Filesplit()

With the instance created, the following functionalities can be leveraged.

split ()

Method that splits the file into multiple chunks. This method accepts the following arguments

file (str) - Path to the source file (Required)

split_size (int) - Split size in bytes (Required). Each split will correspond to the size provided.

output_dir (str) - Directory to write the split files (Optional). If not provided, the current directory will be used.

callback (callable) - Callback function (Optional). The callback function should accept two arguments [func (str, int)] - full path to the split file, split file size (bytes). The callback function will be called after each file split.

example:

def split_cb(f, s):
    print("file: {0}, size: {1}".format(f, s))

fs.split(file="/path/to/source/file", split_size=900000, output_dir="/path/to/output/dir", callback=split_cb)

By default, the split method splits the file in binary mode keeping the encoding and line endings as-is to that of the source that works for most of the use cases. However, the module also offers some more flexibility to control the splits by passing additional keyword arguments

newline (bool) - (Optional) When set to True, split files will not carry any incomplete lines. This flag can be helpful when splitting structured file.

include_header (bool) - (Optional) When set to True, the first line in the source file is considered as a header and each split will include the header. This flag can be helpful when splitting structured file.

encoding (str) - (Optional) When provided, the splits are handled in text mode with the specified encoding. The file is read and the split files are written with the same encoding. This can be useful for text files and requires the source file encoding to be known beforehand.

split_file_encoding (str) - (Optional) In case, the split files should be of different encoding to that of the source, this can be set. Note: If split_file_encoding is specified, then encoding needs to be specified as well.

The split process creates a manifest file fs_manifest.csv in the output directory. This manifest file is required for the merge operation.

merge()

Method that merges the split files into a single file. This method requires the manifest file generated by the split() process along with the split files and accepts the following arguments

input_dir (str) - Path to the directory containing split files (Required)

output_file (str) - Path to the final output file (Optional). If not provided, the final merged filename is derived from the split filename and placed in the same input directory.

manifest_file (str) - Path to the manifest file (Optional). If not provided, the process will look for the file within the input_dir

callback (callable) - Callback function (Optional). The callback function should accept two arguments [func (str, int)] - full path to the final output file, file size (bytes).

cleanup (bool) - (Optional) If True, all the split files, manifest file will be deleted after merge leaving behind only the merged file.

example:

def merge_cb(f, s):
    print("file: {0}, size: {1}".format(f, s))

fs.merge(input_dir="/path/to/split/files/dir", callback=merge_cb)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filesplit-3.0.2.tar.gz (5.7 kB view details)

Uploaded Source

File details

Details for the file filesplit-3.0.2.tar.gz.

File metadata

  • Download URL: filesplit-3.0.2.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for filesplit-3.0.2.tar.gz
Algorithm Hash digest
SHA256 a22655e2261ba1a3df934a7f405adfa1f3df39586d6aadbd285fc109fca9cedd
MD5 a8de227fb4cbb7d40ffcddc2ee81d075
BLAKE2b-256 84e316c52980db61310f9cf76e6f2e3f39e802e2e242c9d6cd9f834a5448be9f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page