Python library for a duplicate lines removal written in C++
Project description
Python library for a duplicate lines removal written in C++
Table of Contents
About The Project
PyDeduplines is a library intended for deduplicating multiple files, line by line. The library is written in C++ to achieve speed and efficiency. The library also uses mimalloc allocator written by Microsoft
for memory allocation efficiency. For the deduplication, the library uses a specific hash set implementation called Parallel Hashmap which is fast and memory efficient. The library is twice as fast as sort-uniq and twice as memory efficient as sort-uniq.
Built With
Performance
CPU
Library | Text Size | Function | Time | Improvement Factor |
---|---|---|---|---|
GNU Sort | 500mb | sort -u -o output input | 40.37s | 1.0x |
PyDeduplines | 500mb | pydeduplines.deduplicate_lines(['input'], 'output') | 18.54s | 2.17x |
Memory
Library | Text Size | Function | Peak RSS Memory (bytes) | Improvement Factor |
---|---|---|---|---|
GNU Sort | 500mb | sort -u -o output input | 4802100 | 1.0x |
PyDeduplines | 500mb | pydeduplines.deduplicate_lines(['input'], 'output') | 2345932 | 2.05x |
Prerequisites
In order to compile this package you should have GCC & Python development package installed.
- Fedora
sudo dnf install python3-devel gcc-c++
- Ubuntu 18.04
sudo apt install python3-dev g++-9
Installation
pip3 install PyDeduplines
Usage
import PyDeduplines
# reads files line by line as writes them into a new file only if they
# were found for the first time.
pydeduplines.deduplicate_lines(['input'], 'output')
License
Distributed under the MIT License. See LICENSE
for more information.
Contact
Gal Ben David - gal@intsights.com
Project Link: https://github.com/intsights/PyDeduplines
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.