Skip to main content

The tool considers a file so large that it does not fit in memory as a single string and performs a split process of the string. The tool stores the result as separate files.

Project description

large_file_splitter

下の方に日本語の説明があります

Overview

  • The tool considers a file so large that it does not fit in memory as a single string and performs a split process of the string. The tool stores the result as separate files.
  • under construction

Usage

import large_file_splitter

# Split a large file [large_file_splitter].
large_file_splitter.split(
	"dummy_large_file.txt", # File to be split
	split_str = "SPLIT_MARK\r\n", # Split string (For convenience of splitting, it is processed as binary internally, so setting this to a single character is not recommended because it may lead to erroneous splitting of multi-byte characters, etc.)
	div_mode = "start", # mode for handling split strings (delete: split string is not included in output; start: split string is concatenated at the beginning of the next chunk; end: split string is concatenated at the end of the previous chunk)
	output_filename_frame = "./output/div_%d.txt", # Template for output filename (an integer value is automatically inserted for %d)
	cache_size = 10 * 1024 * 1024 # Specify the size of the chunk of data to work with in memory (in bytes; memory capacity must be at least several times this size.)
)

Example of usage (in the context of a for loop)

import large_file_splitter

# Split a large file (for loop version) [large_file_splitter]
for one_str in large_file_splitter.for_split(
	"dummy_large_file.txt",	# Target file for splitting
	split_str = "SPLIT_MARK\r\n",	# Split string (for internal processing, it is treated as binary, so it is not recommended to make this a single character, etc., as it may lead to incorrect splitting of multi-byte characters)
	div_mode = "start",	# Mode of handling the split string (delete: split string is not included in the output; start: split string is joined to the beginning of the next block; end: split string is joined to the end of the previous block)
	cache_size = 1024	# Specifies the size of the data block to work with in memory (in bytes; at least this multiple of memory capacity is required)
):
	# Some processing using the string `one_str`
	print(one_str)

概要

  • メモリに乗らないほど巨大なファイルを一つの文字列とみなし、文字列のsplit処理を実施。その結果を別々のファイルとして格納するツール。
  • 説明は執筆中です

使用例

import large_file_splitter

# 巨大ファイルの分割 [large_file_splitter]
large_file_splitter.split(
	"dummy_large_file.txt",	# 分割対象ファイル
	split_str = "SPLIT_MARK\r\n",	# 分割文字列 (分割の都合上内部ではbinaryとして処理するので、ここを一文字等にするのは、マルチバイト文字等の誤分割に繋がる可能性があるため非推奨)
	div_mode = "start",	# 分割文字列の扱いのモード (delete: 分割文字列は出力に含まない; start: 分割文字列は次の塊の先頭に結合される; end: 分割文字列は前の塊の末尾に結合される)
	output_filename_frame = "./output/div_%d.txt",	# 出力先ファイル名のテンプレート (%dのところは自動で整数値が挿入される)
	cache_size = 10 * 1024 * 1024	# メモリで作業するデータ塊の大きさの指定 (バイト単位; メモリ容量は少なくともこの数倍は必要)
)

使用例 (for文脈での利用)

import large_file_splitter

# 巨大ファイルの分割 (for文脈バージョン) [large_file_splitter]
for one_str in large_file_splitter.for_split(
	"dummy_large_file.txt",	# 分割対象ファイル
	split_str = "SPLIT_MARK\r\n",	# 分割文字列 (分割の都合上内部ではbinaryとして処理するので、ここを一文字等にするのは、マルチバイト文字等の誤分割に繋がる可能性があるため非推奨)
	div_mode = "start",	# 分割文字列の扱いのモード (delete: 分割文字列は出力に含まない; start: 分割文字列は次の塊の先頭に結合される; end: 分割文字列は前の塊の末尾に結合される)
	cache_size = 1024	# メモリで作業するデータ塊の大きさの指定 (バイト単位; メモリ容量は少なくともこの数倍は必要)
):
	# 文字列 `one_str` を用いた何らかの処理
	print(one_str)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

large-file-splitter-0.2.0.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

large_file_splitter-0.2.0-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file large-file-splitter-0.2.0.tar.gz.

File metadata

  • Download URL: large-file-splitter-0.2.0.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.64.1 CPython/3.8.8

File hashes

Hashes for large-file-splitter-0.2.0.tar.gz
Algorithm Hash digest
SHA256 669fa633eac6d037ce408cb8c18884d8f5b30d77d515729d611e18bb9101f132
MD5 f52e135c16a61e0665af7c88cdf3d593
BLAKE2b-256 24f10df82eb4383175159fb5192a951f66cba7afeb044a7aa2560b33043428c0

See more details on using hashes here.

File details

Details for the file large_file_splitter-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: large_file_splitter-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 5.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.64.1 CPython/3.8.8

File hashes

Hashes for large_file_splitter-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b44e9f87d1ad3297dd075c86d3a0d92bb7f69bc27aef307974d6c7670b84d9e5
MD5 d9e36f43daa81b469cfa09a5b5221b2b
BLAKE2b-256 501596a4815381b4e08d5eb544ab8f712aaeb07effd416b08b2488b1d23ee5ac

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page