Skip to main content

Python library for a duplicate lines removal written in C++

Project description


Python library for a duplicate lines removal written in C++

license Python Build PyPi

Table of Contents

About The Project

PyDeduplines is a library intended for deduplicating multiple files, line by line. The library is written in C++ to achieve speed and efficiency. The library also uses mimalloc allocator written by Microsoft for memory allocation efficiency. For the deduplication, the library uses a specific hash set implementation called Parallel Hashmap which is fast and memory efficient. The library is twice as fast as sort-uniq and twice as memory efficient as sort-uniq.

Built With



Library Text Size Function Time Improvement Factor
ripgrepy 500mb sort -u -o output input 40.37s 1.0x
PyDeduplines 500mb pydeduplines.deduplicate_lines(['input'], 'output') 18.54s 2.17x


Library Text Size Function Peak RSS Memory (bytes) Improvement Factor
ripgrepy 500mb sort -u -o output input 4802100 1.0x
PyDeduplines 500mb pydeduplines.deduplicate_lines(['input'], 'output') 2345932 2.05x


In order to compile this package you should have GCC & Python development package installed.

  • Fedora
sudo dnf install python3-devel gcc-c++
  • Ubuntu 18.04
sudo apt install python3-dev g++-9


pip3 install PyDeduplines


import PyDeduplines

# reads files line by line as writes them into a new file only if they
# were found for the first time.
pydeduplines.deduplicate_lines(['input'], 'output')


Distributed under the MIT License. See LICENSE for more information.


Gal Ben David -

Project Link:

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for PyDeduplines, version 0.1.1
Filename, size File type Python version Upload date Hashes
Filename, size PyDeduplines-0.1.1.tar.gz (237.6 kB) File type Source Python version None Upload date Hashes View

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page