Python library for a duplicate lines removal written in Rust
Project description
Python library for a duplicate lines removal written in Rust
Table of Contents
About The Project
This library is used to manipulate the lines of files. To achieve speed and efficiency, the library is written in Rust.
There are two functions in the library:
compute_unique_lines
- This function takes a list of input file paths and an output file path, iterates over the input file paths and writes unique lines to the output file.compute_added_lines
- This function takes three argumentsfirst_file_path
,second_file_path
andoutput_file_path
, and writes to the output file only lines that appeared in the second file but not in the first.
Built With
Performance
Deduplicating
Library | Function | Time | Peak Memory |
---|---|---|---|
GNU Sort | sort -u -o output 500mb_one 500mb_two | 37.35s | 8,261mb |
PyDeduplines | compute_unique_lines('./workdir', ['500mb_one', '500mb_two'], 'output', 16) | 4.55s | 685mb |
Added Lines
Library | Function | Time | Peak Memory |
---|---|---|---|
GNU Sort | comm -1 -3 <(sort 500mb_one) <(sort 500mb_two) > output.txt | 26.53s | 4,132mb |
PyDeduplines | compute_added_lines('./workdir', '500mb_one', '500mb_two', 'output', 16) | 3.95s | 314mb |
Installation
pip3 install PyDeduplines
Documentation
def compute_unique_lines(
working_directory: str,
file_paths: typing.List[str],
output_file_path: str,
number_of_splits: int,
number_of_threads: int = 0,
) -> None: ...
working_directory
- A file path of a directory to work in. Each split file would be created in this directory.file_paths
- A list of strings containing the input file paths to iterate over and to calculate unique values for.output_file_path
- The path where the unique lines will be written.number_of_splits
- This parameter specifies how many smaller splits are to be made from each input file based on the number of splits. The idea behind this library is defined by this parameter. The more splits, the lower the peak memory consumption. Remember that the more splits you have, the more files you have open.number_of_threads
- Number of parallel threads. Using 0 means to use as many cores as possible. The number of threads greater than 1 would cause multiple splits on each input file.
def compute_added_lines(
working_directory: str,
first_file_path: str,
second_file_path: str,
output_file_path: str,
number_of_splits: int,
number_of_threads: int = 0,
) -> None: ...
working_directory
- A file path of a directory to work in. Each split file would be created in this directory.first_file_path
- A path to the first file to be iterated over.second_file_path
- A file path to iterate over and find lines that do not exist in the first file.output_file_path
- A path to the output file that contains the lines that appeared in the second file but not in the first.number_of_splits
- This parameter specifies how many smaller splits are to be made from each input file based on the number of splits. The idea behind this library is defined by this parameter. The more splits, the lower the peak memory consumption. Remember that the more splits you have, the more files you have open.number_of_threads
- Number of parallel threads. Using 0 means to use as many cores as possible. The number of threads greater than 1 would cause multiple splits on each input file.
Usage
import pydeduplines
pydeduplines.compute_unique_lines(
working_directory='tmp',
file_paths=[
'500mb_one',
'500mb_two',
],
output_file_path='output',
number_of_splits=4,
)
pydeduplines.compute_added_lines(
working_directory='tmp',
first_file_path='500mb_one',
second_file_path='500mb_two',
output_file_path='output',
number_of_splits=4,
)
License
Distributed under the MIT License. See LICENSE
for more information.
Contact
Gal Ben David - gal@intsights.com
Project Link: https://github.com/intsights/PyDeduplines
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file PyDeduplines-0.6.1-cp311-none-win_amd64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp311-none-win_amd64.whl
- Upload date:
- Size: 144.8 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d2484f14f757fbd85909f93641e38506dcc39ccc94eab51d3da0eaa5a9bd03a2 |
|
MD5 | 8b413e167c67590e7b21c92ebd586072 |
|
BLAKE2b-256 | d3289c8901628b74512e0154d5a125470ce6801a7262c05d47771a5a5df1725c |
File details
Details for the file PyDeduplines-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 229.5 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 941d273754378e7e979ac172b4d66a56738aba1e0c2bfd143346047ce0085b60 |
|
MD5 | 63b73feffd110b9a1f78c57f4c5eb49e |
|
BLAKE2b-256 | b0772e17d24e6b1ba6d33e47b5f6738e38ac4dc048fe3cb06bcc684462d85a08 |
File details
Details for the file PyDeduplines-0.6.1-cp311-cp311-macosx_11_0_arm64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 196.2 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64eb48122199c3fffb3cbbc5cec09166c43e1cc4071f662c988f4a3c3d90cfc6 |
|
MD5 | 81a7647a4990ee9a320f04a71482a050 |
|
BLAKE2b-256 | 3af94a6385de46f5f4e82d3f365cb2d84bc480883d36107d0d9207ccb745d9b2 |
File details
Details for the file PyDeduplines-0.6.1-cp311-cp311-macosx_10_7_x86_64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp311-cp311-macosx_10_7_x86_64.whl
- Upload date:
- Size: 208.8 kB
- Tags: CPython 3.11, macOS 10.7+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d4b6f57891b38d4c88997c7abaecd1942bcec182d543f2428232d2525235ea29 |
|
MD5 | 8dae1545274b9ad0fd0ad1842afffc57 |
|
BLAKE2b-256 | e6425e6c9e55ebe4262d45ccfdf5149a9cb3ce502524a89e757968a7d247831b |
File details
Details for the file PyDeduplines-0.6.1-cp310-none-win_amd64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp310-none-win_amd64.whl
- Upload date:
- Size: 144.8 kB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59aa99d56bfd0d185fe9dda98b4d1ac4fe5bb041cacf9dc430a65a384e18f8e6 |
|
MD5 | 41a97e327dc8a690a739b85b867b636d |
|
BLAKE2b-256 | 7fd592dbfb7137cb37682266e4d009ba7b7d85c38db080123b6e02c7be006f98 |
File details
Details for the file PyDeduplines-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 229.5 kB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ff5bc90a3384e419c3c4378ea8658dd158bab28a95abad6df15b15b9b7f9d43 |
|
MD5 | b5ea3642e8b74cd078c30e27fcbe70d1 |
|
BLAKE2b-256 | 1e3f4617eb19452013a60694b3eaef4e9ae7498fe267ed280d43eee8872a3db2 |
File details
Details for the file PyDeduplines-0.6.1-cp310-cp310-macosx_11_0_arm64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 196.2 kB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c70c9a4c1d7c002c4f38d0c0f47ad17551fe56816576ffcc7b2115306aca9b1 |
|
MD5 | 19f615ef0a9af7da6dbddf62c43a4cfa |
|
BLAKE2b-256 | f7f331efbf60f0d88e02c3e553a54f2911b30de669cc3e944f333925277ffcf3 |
File details
Details for the file PyDeduplines-0.6.1-cp310-cp310-macosx_10_7_x86_64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp310-cp310-macosx_10_7_x86_64.whl
- Upload date:
- Size: 208.8 kB
- Tags: CPython 3.10, macOS 10.7+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 825cdfb84b6ec0ad392af653b621d37ac8b276b2c4c6cb625e601d46e1f0f886 |
|
MD5 | 3a8ac1a232f302797f9583f14fce5984 |
|
BLAKE2b-256 | 3b49590e88056512b2b0c533f0426e1ed1b28e7baa14d89e97f4034b5cd7037f |
File details
Details for the file PyDeduplines-0.6.1-cp39-none-win_amd64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp39-none-win_amd64.whl
- Upload date:
- Size: 144.8 kB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4458ea264159f5bc3ce41b2bba2604b2ce13a4781a812c72a948345b81e24e87 |
|
MD5 | 889d8989dd0b80f4756a26c228395531 |
|
BLAKE2b-256 | 686c09f25b4e95800731468d7b45a8c327bf20a4b8b7dd58c3ff14d42bbbcbc2 |
File details
Details for the file PyDeduplines-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 229.5 kB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 162c935ee0d8d6031e6af338d35579a4f407824eb7795ce5a63f891b7639aae7 |
|
MD5 | a1aa0e212afc5afe9d37ba68891c8750 |
|
BLAKE2b-256 | 044addf7a080c5cf3b924ba145f552afb9bf519936cbdf0817a3e9687cf1dba5 |
File details
Details for the file PyDeduplines-0.6.1-cp39-cp39-macosx_11_0_arm64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 196.2 kB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9b7c169f6cae3b9cc24ee001876c442cdf702bed1f768a825b9d1ba485923ed2 |
|
MD5 | 457486acee45ea48fbdb5e4beaf37373 |
|
BLAKE2b-256 | 3a92180a81e9291fd3cc4321f97ecedf2f4e5d8a45af4f8c79b64d8cc5297a13 |
File details
Details for the file PyDeduplines-0.6.1-cp39-cp39-macosx_10_7_x86_64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp39-cp39-macosx_10_7_x86_64.whl
- Upload date:
- Size: 208.8 kB
- Tags: CPython 3.9, macOS 10.7+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 66bb91324f007402be1aeaf8d2e3b33a0e2d337caab161ff7ea9cabc8afc6489 |
|
MD5 | 6543ccbb3292eb9292adea4795ece4ef |
|
BLAKE2b-256 | 751010ed64eeb7dcc3da0af6eaefab85c95d3df9873ea9e486c64df74fc36011 |
File details
Details for the file PyDeduplines-0.6.1-cp38-none-win_amd64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp38-none-win_amd64.whl
- Upload date:
- Size: 145.0 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 91861d36258ef38bf6fd7596ef6fc0462d7386af4b6cdc021b7080f17e7e25b0 |
|
MD5 | aabdca851f2f6228b173847400c7617a |
|
BLAKE2b-256 | 4b5b1a5c187aff491582d3607fa222ae8ae5bf35c131b6fd3c74b256c8baf974 |
File details
Details for the file PyDeduplines-0.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 229.6 kB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 75373549503e2279b6d4940b46dc9ac6da6e01c079f6b1da6f94c541514023a1 |
|
MD5 | 015b59d9819ee5dfe908ecf6be02cc95 |
|
BLAKE2b-256 | 1f18df4e7ce9c3980c5b7c3c2478f0d16292613e0ec375275a39a4cf28fbe076 |
File details
Details for the file PyDeduplines-0.6.1-cp38-cp38-macosx_11_0_arm64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp38-cp38-macosx_11_0_arm64.whl
- Upload date:
- Size: 196.4 kB
- Tags: CPython 3.8, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15892d8adca9c87e2018fc7dba427a15c4eaf8324e26f005e4039e992bd4df16 |
|
MD5 | abba923768936bd6e7c385e14deddbee |
|
BLAKE2b-256 | d8aae9561e559e369c27399292ee8b994001a5266e7f5cbecb296b99a5658330 |
File details
Details for the file PyDeduplines-0.6.1-cp38-cp38-macosx_10_7_x86_64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp38-cp38-macosx_10_7_x86_64.whl
- Upload date:
- Size: 208.9 kB
- Tags: CPython 3.8, macOS 10.7+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 577ab76e605ff55ddcdfce84bbda32e2922cd1f56c9df6602bd719da896f9d15 |
|
MD5 | 555332c6c0528f1eb755e6073d56e8e1 |
|
BLAKE2b-256 | d72e5e8a01b038a17b209ef61f3ba69a2f04fa8187c6554690aa862eb66cf4b1 |
File details
Details for the file PyDeduplines-0.6.1-cp37-none-win_amd64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp37-none-win_amd64.whl
- Upload date:
- Size: 144.9 kB
- Tags: CPython 3.7, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d91b3f92cac74041053ec79d24acfc20c185fb003abaf135ec44b6bd9cdc72e |
|
MD5 | 8888700f634c20086d2b5a969c92536c |
|
BLAKE2b-256 | c19947fa0932ef43fdb4ffd717a6fa4c9d40e093d5341b01001244671d08276e |
File details
Details for the file PyDeduplines-0.6.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 229.6 kB
- Tags: CPython 3.7m, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3988a9a915d02229d8940413547f09433d276e13ce5628dc5e98298e956c7218 |
|
MD5 | fa138d2d96679f061e7b7f997ae74fa5 |
|
BLAKE2b-256 | 32216b221dab37841e3ba18c9ffd6000b1aa51050fa64f0342f2e54dc01eec3d |
File details
Details for the file PyDeduplines-0.6.1-cp37-cp37m-macosx_11_0_arm64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp37-cp37m-macosx_11_0_arm64.whl
- Upload date:
- Size: 196.4 kB
- Tags: CPython 3.7m, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d106cd89cc2ed21e8421b6ff49b156ef466a330c93f2c0960471735b6b0820e6 |
|
MD5 | 908d95fbba60d6f8e39d87a85c718bb5 |
|
BLAKE2b-256 | ff38b81c0fd801a44de87a820e229f03828a439190f2cf0927ca1c7296298635 |
File details
Details for the file PyDeduplines-0.6.1-cp37-cp37m-macosx_10_7_x86_64.whl
.
File metadata
- Download URL: PyDeduplines-0.6.1-cp37-cp37m-macosx_10_7_x86_64.whl
- Upload date:
- Size: 208.9 kB
- Tags: CPython 3.7m, macOS 10.7+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a401cee1aed7a0a97feffb4cdbc9b5637fd9c91d185919762e7299124608aee7 |
|
MD5 | 0e7c516d75b9cc9c3d0b2d5b58b2c449 |
|
BLAKE2b-256 | 721307f9b3161200dd5ef95814ac45254c05d51b4f6b4af1d85e21f3e2e2f908 |