Skip to main content

Python library for a duplicate lines removal written in Rust

Project description

Logo

Python library for a duplicate lines removal written in Rust

license Python OS Build PyPi

Table of Contents

About The Project

This library is used to manipulate the lines of files. To achieve speed and efficiency, the library is written in Rust.

There are two functions in the library:

  • compute_unique_lines - This function takes a list of input file paths and an output file path, iterates over the input file paths and writes unique lines to the output file.
  • compute_added_lines - This function takes three arguments first_file_path, second_file_path and output_file_path, and writes to the output file only lines that appeared in the second file but not in the first.

Built With

Performance

Deduplicating

Library Function Time Peak Memory
GNU Sort sort -u -o output 500mb_one 500mb_two 37.35s 8,261mb
PyDeduplines compute_unique_lines('./workdir', ['500mb_one', '500mb_two'], 'output', 16) 4.55s 685mb

Added Lines

Library Function Time Peak Memory
GNU Sort comm -1 -3 <(sort 500mb_one) <(sort 500mb_two) > output.txt 26.53s 4,132mb
PyDeduplines compute_added_lines('./workdir', '500mb_one', '500mb_two', 'output', 16) 3.95s 314mb

Installation

pip3 install PyDeduplines

Documentation

def compute_unique_lines(
    working_directory: str,
    file_paths: typing.List[str],
    output_file_path: str,
    number_of_splits: int,
    number_of_threads: int = 0,
) -> None: ...
  • working_directory - A file path of a directory to work in. Each split file would be created in this directory.
  • file_paths - A list of strings containing the input file paths to iterate over and to calculate unique values for.
  • output_file_path - The path where the unique lines will be written.
  • number_of_splits - This parameter specifies how many smaller splits are to be made from each input file based on the number of splits. The idea behind this library is defined by this parameter. The more splits, the lower the peak memory consumption. Remember that the more splits you have, the more files you have open.
  • number_of_threads - Number of parallel threads. Using 0 means to use as many cores as possible. The number of threads greater than 1 would cause multiple splits on each input file.
def compute_added_lines(
    working_directory: str,
    first_file_path: str,
    second_file_path: str,
    output_file_path: str,
    number_of_splits: int,
    number_of_threads: int = 0,
) -> None: ...
  • working_directory - A file path of a directory to work in. Each split file would be created in this directory.
  • first_file_path - A path to the first file to be iterated over.
  • second_file_path - A file path to iterate over and find lines that do not exist in the first file.
  • output_file_path - A path to the output file that contains the lines that appeared in the second file but not in the first.
  • number_of_splits - This parameter specifies how many smaller splits are to be made from each input file based on the number of splits. The idea behind this library is defined by this parameter. The more splits, the lower the peak memory consumption. Remember that the more splits you have, the more files you have open.
  • number_of_threads - Number of parallel threads. Using 0 means to use as many cores as possible. The number of threads greater than 1 would cause multiple splits on each input file.

Usage

import pydeduplines


pydeduplines.compute_unique_lines(
    working_directory='tmp',
    file_paths=[
        '500mb_one',
        '500mb_two',
    ],
    output_file_path='output',
    number_of_splits=4,
)

pydeduplines.compute_added_lines(
    working_directory='tmp',
    first_file_path='500mb_one',
    second_file_path='500mb_two',
    output_file_path='output',
    number_of_splits=4,
)

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Gal Ben David - gal@intsights.com

Project Link: https://github.com/intsights/PyDeduplines

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

PyDeduplines-0.6.1-cp311-none-win_amd64.whl (144.8 kB view details)

Uploaded CPython 3.11 Windows x86-64

PyDeduplines-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (229.5 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

PyDeduplines-0.6.1-cp311-cp311-macosx_11_0_arm64.whl (196.2 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

PyDeduplines-0.6.1-cp311-cp311-macosx_10_7_x86_64.whl (208.8 kB view details)

Uploaded CPython 3.11 macOS 10.7+ x86-64

PyDeduplines-0.6.1-cp310-none-win_amd64.whl (144.8 kB view details)

Uploaded CPython 3.10 Windows x86-64

PyDeduplines-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (229.5 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

PyDeduplines-0.6.1-cp310-cp310-macosx_11_0_arm64.whl (196.2 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

PyDeduplines-0.6.1-cp310-cp310-macosx_10_7_x86_64.whl (208.8 kB view details)

Uploaded CPython 3.10 macOS 10.7+ x86-64

PyDeduplines-0.6.1-cp39-none-win_amd64.whl (144.8 kB view details)

Uploaded CPython 3.9 Windows x86-64

PyDeduplines-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (229.5 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

PyDeduplines-0.6.1-cp39-cp39-macosx_11_0_arm64.whl (196.2 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

PyDeduplines-0.6.1-cp39-cp39-macosx_10_7_x86_64.whl (208.8 kB view details)

Uploaded CPython 3.9 macOS 10.7+ x86-64

PyDeduplines-0.6.1-cp38-none-win_amd64.whl (145.0 kB view details)

Uploaded CPython 3.8 Windows x86-64

PyDeduplines-0.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (229.6 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

PyDeduplines-0.6.1-cp38-cp38-macosx_11_0_arm64.whl (196.4 kB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

PyDeduplines-0.6.1-cp38-cp38-macosx_10_7_x86_64.whl (208.9 kB view details)

Uploaded CPython 3.8 macOS 10.7+ x86-64

PyDeduplines-0.6.1-cp37-none-win_amd64.whl (144.9 kB view details)

Uploaded CPython 3.7 Windows x86-64

PyDeduplines-0.6.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (229.6 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

PyDeduplines-0.6.1-cp37-cp37m-macosx_11_0_arm64.whl (196.4 kB view details)

Uploaded CPython 3.7m macOS 11.0+ ARM64

PyDeduplines-0.6.1-cp37-cp37m-macosx_10_7_x86_64.whl (208.9 kB view details)

Uploaded CPython 3.7m macOS 10.7+ x86-64

File details

Details for the file PyDeduplines-0.6.1-cp311-none-win_amd64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp311-none-win_amd64.whl
Algorithm Hash digest
SHA256 d2484f14f757fbd85909f93641e38506dcc39ccc94eab51d3da0eaa5a9bd03a2
MD5 8b413e167c67590e7b21c92ebd586072
BLAKE2b-256 d3289c8901628b74512e0154d5a125470ce6801a7262c05d47771a5a5df1725c

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 941d273754378e7e979ac172b4d66a56738aba1e0c2bfd143346047ce0085b60
MD5 63b73feffd110b9a1f78c57f4c5eb49e
BLAKE2b-256 b0772e17d24e6b1ba6d33e47b5f6738e38ac4dc048fe3cb06bcc684462d85a08

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 64eb48122199c3fffb3cbbc5cec09166c43e1cc4071f662c988f4a3c3d90cfc6
MD5 81a7647a4990ee9a320f04a71482a050
BLAKE2b-256 3af94a6385de46f5f4e82d3f365cb2d84bc480883d36107d0d9207ccb745d9b2

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp311-cp311-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp311-cp311-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 d4b6f57891b38d4c88997c7abaecd1942bcec182d543f2428232d2525235ea29
MD5 8dae1545274b9ad0fd0ad1842afffc57
BLAKE2b-256 e6425e6c9e55ebe4262d45ccfdf5149a9cb3ce502524a89e757968a7d247831b

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp310-none-win_amd64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp310-none-win_amd64.whl
Algorithm Hash digest
SHA256 59aa99d56bfd0d185fe9dda98b4d1ac4fe5bb041cacf9dc430a65a384e18f8e6
MD5 41a97e327dc8a690a739b85b867b636d
BLAKE2b-256 7fd592dbfb7137cb37682266e4d009ba7b7d85c38db080123b6e02c7be006f98

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9ff5bc90a3384e419c3c4378ea8658dd158bab28a95abad6df15b15b9b7f9d43
MD5 b5ea3642e8b74cd078c30e27fcbe70d1
BLAKE2b-256 1e3f4617eb19452013a60694b3eaef4e9ae7498fe267ed280d43eee8872a3db2

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1c70c9a4c1d7c002c4f38d0c0f47ad17551fe56816576ffcc7b2115306aca9b1
MD5 19f615ef0a9af7da6dbddf62c43a4cfa
BLAKE2b-256 f7f331efbf60f0d88e02c3e553a54f2911b30de669cc3e944f333925277ffcf3

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp310-cp310-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp310-cp310-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 825cdfb84b6ec0ad392af653b621d37ac8b276b2c4c6cb625e601d46e1f0f886
MD5 3a8ac1a232f302797f9583f14fce5984
BLAKE2b-256 3b49590e88056512b2b0c533f0426e1ed1b28e7baa14d89e97f4034b5cd7037f

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp39-none-win_amd64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp39-none-win_amd64.whl
Algorithm Hash digest
SHA256 4458ea264159f5bc3ce41b2bba2604b2ce13a4781a812c72a948345b81e24e87
MD5 889d8989dd0b80f4756a26c228395531
BLAKE2b-256 686c09f25b4e95800731468d7b45a8c327bf20a4b8b7dd58c3ff14d42bbbcbc2

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 162c935ee0d8d6031e6af338d35579a4f407824eb7795ce5a63f891b7639aae7
MD5 a1aa0e212afc5afe9d37ba68891c8750
BLAKE2b-256 044addf7a080c5cf3b924ba145f552afb9bf519936cbdf0817a3e9687cf1dba5

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9b7c169f6cae3b9cc24ee001876c442cdf702bed1f768a825b9d1ba485923ed2
MD5 457486acee45ea48fbdb5e4beaf37373
BLAKE2b-256 3a92180a81e9291fd3cc4321f97ecedf2f4e5d8a45af4f8c79b64d8cc5297a13

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp39-cp39-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp39-cp39-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 66bb91324f007402be1aeaf8d2e3b33a0e2d337caab161ff7ea9cabc8afc6489
MD5 6543ccbb3292eb9292adea4795ece4ef
BLAKE2b-256 751010ed64eeb7dcc3da0af6eaefab85c95d3df9873ea9e486c64df74fc36011

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp38-none-win_amd64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp38-none-win_amd64.whl
Algorithm Hash digest
SHA256 91861d36258ef38bf6fd7596ef6fc0462d7386af4b6cdc021b7080f17e7e25b0
MD5 aabdca851f2f6228b173847400c7617a
BLAKE2b-256 4b5b1a5c187aff491582d3607fa222ae8ae5bf35c131b6fd3c74b256c8baf974

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 75373549503e2279b6d4940b46dc9ac6da6e01c079f6b1da6f94c541514023a1
MD5 015b59d9819ee5dfe908ecf6be02cc95
BLAKE2b-256 1f18df4e7ce9c3980c5b7c3c2478f0d16292613e0ec375275a39a4cf28fbe076

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 15892d8adca9c87e2018fc7dba427a15c4eaf8324e26f005e4039e992bd4df16
MD5 abba923768936bd6e7c385e14deddbee
BLAKE2b-256 d8aae9561e559e369c27399292ee8b994001a5266e7f5cbecb296b99a5658330

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp38-cp38-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp38-cp38-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 577ab76e605ff55ddcdfce84bbda32e2922cd1f56c9df6602bd719da896f9d15
MD5 555332c6c0528f1eb755e6073d56e8e1
BLAKE2b-256 d72e5e8a01b038a17b209ef61f3ba69a2f04fa8187c6554690aa862eb66cf4b1

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp37-none-win_amd64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp37-none-win_amd64.whl
Algorithm Hash digest
SHA256 3d91b3f92cac74041053ec79d24acfc20c185fb003abaf135ec44b6bd9cdc72e
MD5 8888700f634c20086d2b5a969c92536c
BLAKE2b-256 c19947fa0932ef43fdb4ffd717a6fa4c9d40e093d5341b01001244671d08276e

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3988a9a915d02229d8940413547f09433d276e13ce5628dc5e98298e956c7218
MD5 fa138d2d96679f061e7b7f997ae74fa5
BLAKE2b-256 32216b221dab37841e3ba18c9ffd6000b1aa51050fa64f0342f2e54dc01eec3d

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp37-cp37m-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp37-cp37m-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d106cd89cc2ed21e8421b6ff49b156ef466a330c93f2c0960471735b6b0820e6
MD5 908d95fbba60d6f8e39d87a85c718bb5
BLAKE2b-256 ff38b81c0fd801a44de87a820e229f03828a439190f2cf0927ca1c7296298635

See more details on using hashes here.

File details

Details for the file PyDeduplines-0.6.1-cp37-cp37m-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for PyDeduplines-0.6.1-cp37-cp37m-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 a401cee1aed7a0a97feffb4cdbc9b5637fd9c91d185919762e7299124608aee7
MD5 0e7c516d75b9cc9c3d0b2d5b58b2c449
BLAKE2b-256 721307f9b3161200dd5ef95814ac45254c05d51b4f6b4af1d85e21f3e2e2f908

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page