The tool considers a file so large that it does not fit in memory as a single string and performs a split process of the string. The tool stores the result as separate files.
Project description
large_file_splitter
下の方に日本語の説明があります
Overview
- The tool considers a file so large that it does not fit in memory as a single string and performs a split process of the string. The tool stores the result as separate files.
- under construction
Usage
import large_file_splitter
# Split a large file [large_file_splitter].
large_file_splitter.split(
"dummy_large_file.txt", # File to be split
split_str = "SPLIT_MARK\r\n", # Split string (For convenience of splitting, it is processed as binary internally, so setting this to a single character is not recommended because it may lead to erroneous splitting of multi-byte characters, etc.)
div_mode = "start", # mode for handling split strings (delete: split string is not included in output; start: split string is concatenated at the beginning of the next chunk; end: split string is concatenated at the end of the previous chunk)
output_filename_frame = "./output/div_%d.txt", # Template for output filename (an integer value is automatically inserted for %d)
cache_size = 10 * 1024 * 1024 # Specify the size of the chunk of data to work with in memory (in bytes; memory capacity must be at least several times this size.)
)
Example of usage (in the context of a for loop)
import large_file_splitter
# Split a large file (for loop version) [large_file_splitter]
for one_str in large_file_splitter.for_split(
"dummy_large_file.txt", # Target file for splitting
split_str = "SPLIT_MARK\r\n", # Split string (for internal processing, it is treated as binary, so it is not recommended to make this a single character, etc., as it may lead to incorrect splitting of multi-byte characters)
div_mode = "start", # Mode of handling the split string (delete: split string is not included in the output; start: split string is joined to the beginning of the next block; end: split string is joined to the end of the previous block)
cache_size = 1024 # Specifies the size of the data block to work with in memory (in bytes; at least this multiple of memory capacity is required)
):
# Some processing using the string `one_str`
print(one_str)
概要
- メモリに乗らないほど巨大なファイルを一つの文字列とみなし、文字列のsplit処理を実施。その結果を別々のファイルとして格納するツール。
- 説明は執筆中です
使用例
import large_file_splitter
# 巨大ファイルの分割 [large_file_splitter]
large_file_splitter.split(
"dummy_large_file.txt", # 分割対象ファイル
split_str = "SPLIT_MARK\r\n", # 分割文字列 (分割の都合上内部ではbinaryとして処理するので、ここを一文字等にするのは、マルチバイト文字等の誤分割に繋がる可能性があるため非推奨)
div_mode = "start", # 分割文字列の扱いのモード (delete: 分割文字列は出力に含まない; start: 分割文字列は次の塊の先頭に結合される; end: 分割文字列は前の塊の末尾に結合される)
output_filename_frame = "./output/div_%d.txt", # 出力先ファイル名のテンプレート (%dのところは自動で整数値が挿入される)
cache_size = 10 * 1024 * 1024 # メモリで作業するデータ塊の大きさの指定 (バイト単位; メモリ容量は少なくともこの数倍は必要)
)
使用例 (for文脈での利用)
import large_file_splitter
# 巨大ファイルの分割 (for文脈バージョン) [large_file_splitter]
for one_str in large_file_splitter.for_split(
"dummy_large_file.txt", # 分割対象ファイル
split_str = "SPLIT_MARK\r\n", # 分割文字列 (分割の都合上内部ではbinaryとして処理するので、ここを一文字等にするのは、マルチバイト文字等の誤分割に繋がる可能性があるため非推奨)
div_mode = "start", # 分割文字列の扱いのモード (delete: 分割文字列は出力に含まない; start: 分割文字列は次の塊の先頭に結合される; end: 分割文字列は前の塊の末尾に結合される)
cache_size = 1024 # メモリで作業するデータ塊の大きさの指定 (バイト単位; メモリ容量は少なくともこの数倍は必要)
):
# 文字列 `one_str` を用いた何らかの処理
print(one_str)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for large-file-splitter-0.2.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 669fa633eac6d037ce408cb8c18884d8f5b30d77d515729d611e18bb9101f132 |
|
MD5 | f52e135c16a61e0665af7c88cdf3d593 |
|
BLAKE2b-256 | 24f10df82eb4383175159fb5192a951f66cba7afeb044a7aa2560b33043428c0 |
Close
Hashes for large_file_splitter-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b44e9f87d1ad3297dd075c86d3a0d92bb7f69bc27aef307974d6c7670b84d9e5 |
|
MD5 | d9e36f43daa81b469cfa09a5b5221b2b |
|
BLAKE2b-256 | 501596a4815381b4e08d5eb544ab8f712aaeb07effd416b08b2488b1d23ee5ac |