pydeduplines·PyPI

Python library for a duplicate lines removal written in C++

These details have not been verified by PyPI

Project links

Project description

Python library for a duplicate lines removal written in C++

license Python Build

Table of Contents
About The Project
Documentation
Usage
License
Contact

About The Project

PyDeduplines is a library intended for manipulating files' lines. The library is written in C++ to achieve speed and efficiency. For the deduplication, the library uses a specific hash set implementation called Parallel Hashmap which is fast and memory efficient. The library consists of two functions:

compute_unique_lines - This function takes a list of input files paths and an output file path, iterates over each of the input files paths and writes to the output file unique lines.
compute_added_lines - This function take three arguments first_file_path, second_file_path and output_file_path, and writes to the output file only lines that appeared in the second file but not in the first.

Built With

Performance

Deduplicating

Library	Function	Time	Peak Memory
GNU Sort	sort -u -o output 500mb_one 500mb_two	53.35s	9,376mb
PyDeduplines	compute_unique_lines(['500mb_one', '500mb_two'], 'output', 4)	17.31s	1,040mb

Added Lines

Library	Function	Time	Peak Memory
GNU Sort	comm -13 <(sort 500mb_one -u) <(sort 500mb_two -u)	52.04s	9,376mb
PyDeduplines	compute_added_lines('500mb_one', '500mb_two', 'output', 4)	6.91s	681mb

Prerequisites

In order to compile this package you should have GCC & Python development package installed.

Fedora

sudo dnf install python3-devel gcc-c++

Ubuntu 20.04

sudo apt install python3-dev build-essential

Installation

pip3 install PyDeduplines

Documentation

class FilesDeduplicator:
  def __init__(
      self,
      working_directory: str,
      number_of_threads: int,
  ) -> None

working_directory - A file path of a directory to work with. Every splitted file would be created in this directory.
number_of_threads - The number of threads to execute in parallel. 0 means the mumber of available cpu cores. Every number of threads greater than 1 would produce multiple splits on each input file.

def compute_unique_lines(
    self,
    file_paths: typing.List[str],
    output_file_path: str,
    number_of_splits: int,
) -> None

file_paths - A list of strings containing the inputs file paths to iterate over and to compute unique lines.
output_file_path - An output file path that will be filled with the unique lines.
number_of_splits - Each input file would be split into multiple smaller splits according to this parameter. This parameter is the while idea of this library. The more splits the less peak memory consumption. One should remember that the more splits the more disk io.

def compute_added_lines(
    self,
    first_file_path: str,
    second_file_path: str,
    output_file_path: str,
    number_of_splits: int,
) -> None

first_file_path - A file path to iterate over.
second_file_path - A file path to iterate over and look for lines that do not exist in the first file.
output_file_path - An output file path that will be filled with the lines that appeared in the second file but not in the first.
number_of_splits - Each input file would be split into multiple smaller splits according to this parameter. This parameter is the while idea of this library. The more splits the less peak memory consumption. One should remember that the more splits the more disk io.

Usage

import pydeduplines

file_deduplicator = pydeduplines.FilesDeduplicator(
  working_directory='/home/wavenator/work/PyDeduplines/tmp',
  number_of_threads=0,
)
file_deduplicator.compute_unique_lines(
    file_paths=[
        '500mb_one',
        '500mb_two',
    ],
    output_file_path='output',
    number_of_splits=4,
)

file_deduplicator.compute_added_lines(
    first_file_path='500mb_one',
    second_file_path='500mb_two',
    output_file_path='output',
    number_of_splits=4,
)

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Gal Ben David - gal@intsights.com

Project Link: https://github.com/intsights/PyDeduplines

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.1

Feb 14, 2024

0.6.0

Jan 17, 2023

0.5.0

Feb 2, 2022

0.4.0

Sep 23, 2021

0.3.1

Jul 7, 2021

0.3.0

Jul 4, 2021

This version

0.2.0

Aug 24, 2020

0.1.5

Aug 22, 2020

0.1.4

Jun 3, 2020

0.1.3

Mar 16, 2020

0.1.2

Mar 16, 2020

0.1.1

Feb 19, 2020

0.1.0

Feb 4, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyDeduplines-0.2.0.tar.gz (347.6 kB view details)

Uploaded Aug 24, 2020 Source

File details

Details for the file PyDeduplines-0.2.0.tar.gz.

File metadata

Download URL: PyDeduplines-0.2.0.tar.gz
Upload date: Aug 24, 2020
Size: 347.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for PyDeduplines-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a282548c2245f3f21f6e9a6dfb003ffd341b224f07f1d477fb4721e6efea84cb`
MD5	`025ee0fd1421f37db9717033d39be834`
BLAKE2b-256	`5c6c70fcf7dbd3829232fdecbbeb0437aafef6f9fb0ffa9521a1260c6f04919d`

See more details on using hashes here.

pydeduplines 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Python library for a duplicate lines removal written in C++

Table of Contents

About The Project

Built With

Performance

Deduplicating

Added Lines

Prerequisites

Installation

Documentation

Usage

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes