Python library for a duplicate lines removal written in C++
Project description
Python library for a duplicate lines removal written in C++
Table of Contents
About The Project
PyDeduplines is a library intended for manipulating files' lines. The library is written in C++ to achieve speed and efficiency. For the deduplication, the library uses a specific hash set implementation called Parallel Hashmap which is fast and memory efficient. The library consists of two functions:
compute_unique_lines
- This function takes a list of input files paths and an output file path, iterates over each of the input files paths and writes to the output file unique lines.compute_added_lines
- This function take three argumentsfirst_file_path
,second_file_path
andoutput_file_path
, and writes to the output file only lines that appeared in the second file but not in the first.
Built With
Performance
Deduplicating
Library | Function | Time | Peak Memory |
---|---|---|---|
GNU Sort | sort -u -o output 500mb_one 500mb_two | 53.35s | 9,376mb |
PyDeduplines | compute_unique_lines(['500mb_one', '500mb_two'], 'output', 4) | 17.31s | 1,040mb |
Added Lines
Library | Function | Time | Peak Memory |
---|---|---|---|
GNU Sort | comm -13 <(sort 500mb_one -u) <(sort 500mb_two -u) | 52.04s | 9,376mb |
PyDeduplines | compute_added_lines('500mb_one', '500mb_two', 'output', 4) | 6.91s | 681mb |
Prerequisites
In order to compile this package you should have GCC & Python development package installed.
- Fedora
sudo dnf install python3-devel gcc-c++
- Ubuntu 20.04
sudo apt install python3-dev build-essential
Installation
pip3 install PyDeduplines
Documentation
class FilesDeduplicator:
def __init__(
self,
working_directory: str,
number_of_threads: int,
) -> None
working_directory
- A file path of a directory to work with. Every splitted file would be created in this directory.number_of_threads
- The number of threads to execute in parallel.0
means the mumber of available cpu cores. Every number of threads greater than1
would produce multiple splits on each input file.
def compute_unique_lines(
self,
file_paths: typing.List[str],
output_file_path: str,
number_of_splits: int,
) -> None
file_paths
- A list of strings containing the inputs file paths to iterate over and to compute unique lines.output_file_path
- An output file path that will be filled with the unique lines.number_of_splits
- Each input file would be split into multiple smaller splits according to this parameter. This parameter is the while idea of this library. The more splits the less peak memory consumption. One should remember that the more splits the more disk io.
def compute_added_lines(
self,
first_file_path: str,
second_file_path: str,
output_file_path: str,
number_of_splits: int,
) -> None
first_file_path
- A file path to iterate over.second_file_path
- A file path to iterate over and look for lines that do not exist in the first file.output_file_path
- An output file path that will be filled with the lines that appeared in the second file but not in the first.number_of_splits
- Each input file would be split into multiple smaller splits according to this parameter. This parameter is the while idea of this library. The more splits the less peak memory consumption. One should remember that the more splits the more disk io.
Usage
import pydeduplines
file_deduplicator = pydeduplines.FilesDeduplicator(
working_directory='/home/wavenator/work/PyDeduplines/tmp',
number_of_threads=0,
)
file_deduplicator.compute_unique_lines(
file_paths=[
'500mb_one',
'500mb_two',
],
output_file_path='output',
number_of_splits=4,
)
file_deduplicator.compute_added_lines(
first_file_path='500mb_one',
second_file_path='500mb_two',
output_file_path='output',
number_of_splits=4,
)
License
Distributed under the MIT License. See LICENSE
for more information.
Contact
Gal Ben David - gal@intsights.com
Project Link: https://github.com/intsights/PyDeduplines
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
PyDeduplines-0.2.0.tar.gz
(347.6 kB
view hashes)