Python library for a duplicate lines removal written in C++
Project description
Python library for a duplicate lines removal written in C++
Table of Contents
About The Project
PyDeduplines is a library intended for deduplicating multiple files, line by line. The library is written in C++ to achieve speed and efficiency. The library also uses mimalloc allocator written by Microsoft for memory allocation efficiency. For the deduplication, the library uses a specific hash set implementation called Parallel Hashmap which is fast and memory efficient. The library is twice as fast as sort-uniq and twice as memory efficient as sort-uniq.
Built With
Performance
CPU
| Library | Text Size | Function | Time | Improvement Factor |
|---|---|---|---|---|
| GNU Sort | 500mb | sort -u -o output input | 40.37s | 1.0x |
| PyDeduplines | 500mb | pydeduplines.deduplicate_lines(['input'], 'output') | 18.54s | 2.17x |
Memory
| Library | Text Size | Function | Peak RSS Memory (bytes) | Improvement Factor |
|---|---|---|---|---|
| GNU Sort | 500mb | sort -u -o output input | 4802100 | 1.0x |
| PyDeduplines | 500mb | pydeduplines.deduplicate_lines(['input'], 'output') | 2345932 | 2.05x |
Prerequisites
In order to compile this package you should have GCC & Python development package installed.
- Fedora
sudo dnf install python3-devel gcc-c++
- Ubuntu 18.04
sudo apt install python3-dev g++-9
Installation
pip3 install PyDeduplines
Usage
import PyDeduplines
# reads files line by line as writes them into a new file only if they
# were found for the first time.
pydeduplines.deduplicate_lines(['input'], 'output')
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Gal Ben David - gal@intsights.com
Project Link: https://github.com/intsights/PyDeduplines
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file PyDeduplines-0.1.2.tar.gz.
File metadata
- Download URL: PyDeduplines-0.1.2.tar.gz
- Upload date:
- Size: 237.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85d15e886f53a38c4b908a91adc0107bcd9801e192e42755b6d115fd92a94aef
|
|
| MD5 |
1c5758b0999ff9e60ae9d7cc6f17395e
|
|
| BLAKE2b-256 |
b3ba98530d684a580181dfe5ce0d317a48ac91aabe7aeaaa0d5f498302115bcf
|