Skip to main content

Python library for a duplicate lines removal written in C++

Project description

Logo

Python library for a duplicate lines removal written in C++

license Python Build PyPi

Table of Contents

About The Project

PyDeduplines is a library intended for deduplicating multiple files, line by line. The library is written in C++ to achieve speed and efficiency. The library also uses mimalloc allocator written by Microsoft for memory allocation efficiency. For the deduplication, the library uses a specific hash set implementation called Parallel Hashmap which is fast and memory efficient. The library is twice as fast as sort-uniq and twice as memory efficient as sort-uniq.

Built With

Performance

CPU

Library Text Size Function Time Improvement Factor
GNU Sort 500mb sort -u -o output input 40.37s 1.0x
PyDeduplines 500mb pydeduplines.deduplicate_lines(['input'], 'output') 18.54s 2.17x

Memory

Library Text Size Function Peak RSS Memory (bytes) Improvement Factor
GNU Sort 500mb sort -u -o output input 4802100 1.0x
PyDeduplines 500mb pydeduplines.deduplicate_lines(['input'], 'output') 2345932 2.05x

Prerequisites

In order to compile this package you should have GCC & Python development package installed.

  • Fedora
sudo dnf install python3-devel gcc-c++
  • Ubuntu 18.04
sudo apt install python3-dev g++-9

Installation

pip3 install PyDeduplines

Usage

import PyDeduplines

# reads files line by line as writes them into a new file only if they
# were found for the first time.
pydeduplines.deduplicate_lines(['input'], 'output')

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Gal Ben David - gal@intsights.com

Project Link: https://github.com/intsights/PyDeduplines

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyDeduplines-0.1.2.tar.gz (237.7 kB view details)

Uploaded Source

File details

Details for the file PyDeduplines-0.1.2.tar.gz.

File metadata

  • Download URL: PyDeduplines-0.1.2.tar.gz
  • Upload date:
  • Size: 237.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7

File hashes

Hashes for PyDeduplines-0.1.2.tar.gz
Algorithm Hash digest
SHA256 85d15e886f53a38c4b908a91adc0107bcd9801e192e42755b6d115fd92a94aef
MD5 1c5758b0999ff9e60ae9d7cc6f17395e
BLAKE2b-256 b3ba98530d684a580181dfe5ce0d317a48ac91aabe7aeaaa0d5f498302115bcf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page