Skip to main content

Python library for a duplicate lines removal written in C++

Project description

Logo

Python library for a duplicate lines removal written in C++

license Python Build PyPi

Table of Contents

About The Project

PyDeduplines is a library intended for manipulating files' lines. The library is written in C++ to achieve speed and efficiency. For the deduplication, the library uses a specific hash set implementation called Parallel Hashmap which is fast and memory efficient. The library consists of two functions:

  • compute_unique_lines - This function takes a list of input files paths and an output file path, iterates over each of the input files paths and writes to the output file unique lines.
  • compute_added_lines - This function take three arguments first_file_path, second_file_path and output_file_path, and writes to the output file only lines that appeared in the second file but not in the first.

Built With

Performance

Deduplicating

Library Function Time Peak Memory
GNU Sort sort -u -o output 500mb_one 500mb_two 53.35s 9,376mb
PyDeduplines compute_unique_lines(['500mb_one', '500mb_two'], 'output', 4) 17.31s 1,040mb

Added Lines

Library Function Time Peak Memory
GNU Sort comm -13 <(sort 500mb_one -u) <(sort 500mb_two -u) 52.04s 9,376mb
PyDeduplines compute_added_lines('500mb_one', '500mb_two', 'output', 4) 6.91s 681mb

Prerequisites

In order to compile this package you should have GCC & Python development package installed.

  • Fedora
sudo dnf install python3-devel gcc-c++
  • Ubuntu 20.04
sudo apt install python3-dev build-essential

Installation

pip3 install PyDeduplines

Documentation

class FilesDeduplicator:
  def __init__(
      self,
      working_directory: str,
      number_of_threads: int,
  ) -> None
  • working_directory - A file path of a directory to work with. Every splitted file would be created in this directory.
  • number_of_threads - The number of threads to execute in parallel. 0 means the mumber of available cpu cores. Every number of threads greater than 1 would produce multiple splits on each input file.
def compute_unique_lines(
    self,
    file_paths: typing.List[str],
    output_file_path: str,
    number_of_splits: int,
) -> None
  • file_paths - A list of strings containing the inputs file paths to iterate over and to compute unique lines.
  • output_file_path - An output file path that will be filled with the unique lines.
  • number_of_splits - Each input file would be split into multiple smaller splits according to this parameter. This parameter is the while idea of this library. The more splits the less peak memory consumption. One should remember that the more splits the more disk io.
def compute_added_lines(
    self,
    first_file_path: str,
    second_file_path: str,
    output_file_path: str,
    number_of_splits: int,
) -> None
  • first_file_path - A file path to iterate over.
  • second_file_path - A file path to iterate over and look for lines that do not exist in the first file.
  • output_file_path - An output file path that will be filled with the lines that appeared in the second file but not in the first.
  • number_of_splits - Each input file would be split into multiple smaller splits according to this parameter. This parameter is the while idea of this library. The more splits the less peak memory consumption. One should remember that the more splits the more disk io.

Usage

import pydeduplines

file_deduplicator = pydeduplines.FilesDeduplicator(
  working_directory='/home/wavenator/work/PyDeduplines/tmp',
  number_of_threads=0,
)
file_deduplicator.compute_unique_lines(
    file_paths=[
        '500mb_one',
        '500mb_two',
    ],
    output_file_path='output',
    number_of_splits=4,
)

file_deduplicator.compute_added_lines(
    first_file_path='500mb_one',
    second_file_path='500mb_two',
    output_file_path='output',
    number_of_splits=4,
)

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Gal Ben David - gal@intsights.com

Project Link: https://github.com/intsights/PyDeduplines

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyDeduplines-0.2.0.tar.gz (347.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page