Skip to main content

Optimized code for text de-duplication, written in Rust

Project description

dupekit

Raison d'être: Home for the Rust code used for text deduplication.

Install

  • Locally: This code is auto-magically built by uv via Cargo and Maturin. You might need to install them (e.g., brew install maturin rust on macOS).
  • Cluster: This code is compiled as part of the Docker build (uv pip install -e ... step): Maturin builds the Rust code and places it in the system site-packages (e.g., /home/ray/anaconda3/lib/python3.11/site-packages/dupekit/dupekit.abi3.so).

[!NOTE] What about making dupekit a hybrid Python/Rust Maturin workspace? We tried and experienced issues getting the Docker build to work while keeping it simple—a simple Rust workspace helps keep the setup clean.

[!NOTE] Building from source requires a Rust toolchain (Cargo). Pre-built wheels are available from GitHub Releases for users who don't want to compile locally.

Benchmarking

The goal of these benchmarks is to test different ways of marshaling large text content between Python and Rust "foreign function interface" (wiki:FFI). These tests are designed to isolate the overhead of marshaling from the actual Rust computation (by doing minimal processing in Rust).

Dataset: 1 shard of HuggingFaceFW/fineweb-edu/sample/10BT (2.15 GB Parquet file, benchmarked on 250k out of 726k documents)

Install:

uv sync --all-packages --extra=benchmark --group dev

Benchmark (Takes a few minutes):

uv run pytest lib/dupekit/tests/bench/test_dedupe.py --run-benchmark --benchmark-min-rounds=20
uv run pytest lib/dupekit/tests/bench/test_marshaling.py --run-benchmark
uv run pytest lib/dupekit/tests/bench/test_batch_tuning.py --run-benchmark
uv run pytest lib/dupekit/tests/bench/test_io.py --run-benchmark
uv run pytest lib/dupekit/tests/bench/test_hashing.py --run-benchmark
uv run pytest lib/dupekit/tests/bench/test_minhash.py --run-benchmark

Note: Run separated by type of benchmark (otherwise results are mixed within one table)

Footprint (Note: sampling the stack might taint the mem measurements, so we disable benchmarking):

uv run pytest lib/dupekit/tests/bench/test_marshaling.py \
  --run-benchmark \
  --benchmark-disable \
  --memray \
  --native \
  --most-allocations=0

Results

Dedup: Rust vs. Python

---------------------------------------------------------------------------- benchmark 'Documents: Exact Deduplication': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                             Min                 Max                Mean            StdDev              Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_deduplication[rust-documents]         3.9872 (1.0)        5.2516 (1.0)        4.2341 (1.0)      0.1949 (1.0)        4.2247 (1.0)      0.2845 (1.0)          52;2  236.1805 (1.0)         188           1
test_deduplication[python-documents]     133.8747 (33.58)    157.3844 (29.97)    139.6233 (32.98)    7.7842 (39.94)    135.3300 (32.03)    9.6717 (34.00)         5;0    7.1621 (0.03)         20           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------- benchmark 'Documents: Hash Generation': 2 tests ---------------------------------------------------------------------------
Name (time in ms)                       Min                 Max                Mean            StdDev              Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_hashing[rust-documents]         2.2755 (1.0)        2.4842 (1.0)        2.3041 (1.0)      0.0301 (1.0)        2.2938 (1.0)      0.0381 (1.0)          50;9  434.0169 (1.0)         375           1
test_hashing[python-documents]     130.0445 (57.15)    132.3783 (53.29)    130.7795 (56.76)    0.6663 (22.13)    130.5910 (56.93)    0.6259 (16.44)         5;3    7.6465 (0.02)         20           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'Paragraphs: Exact Deduplication': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                              Min                 Max                Mean             StdDev              Median                IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_deduplication[rust-paragraphs]        85.4666 (1.0)      109.3652 (1.0)       90.2916 (1.0)       6.8294 (1.0)       87.3405 (1.0)       2.0275 (1.0)           4;4  11.0752 (1.0)          20           1
test_deduplication[python-paragraphs]     303.0885 (3.55)     342.9836 (3.14)     321.3022 (3.56)     13.8377 (2.03)     329.4886 (3.77)     25.1111 (12.39)         9;0   3.1123 (0.28)         20           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------- benchmark 'Paragraphs: Hash Generation': 2 tests ---------------------------------------------------------------------------
Name (time in ms)                        Min                 Max                Mean             StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_hashing[rust-paragraphs]        23.9739 (1.0)       26.9860 (1.0)       25.3823 (1.0)       0.4419 (1.0)       25.3160 (1.0)      0.2099 (1.0)           5;5  39.3975 (1.0)          38           1
test_hashing[python-paragraphs]     247.5415 (10.33)    321.4654 (11.91)    255.3421 (10.06)    19.0653 (43.15)    249.1948 (9.84)     1.7899 (8.53)          2;2   3.9163 (0.10)         20           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Marshaling

-------------------------------------------------------------------------------------------- benchmark: 7 tests -------------------------------------------------------------------------------------------
Name (time in ms)                    Min                   Max                  Mean             StdDev                Median                IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_arrow_giant                 86.4414 (1.0)         96.0537 (1.01)        90.0259 (1.0)       2.8582 (31.54)       90.4363 (1.0)       4.1787 (27.96)         3;0  11.1079 (1.0)          11           1
test_arrow_small                 94.4010 (1.09)        94.6679 (1.0)         94.5616 (1.05)      0.0906 (1.0)         94.5570 (1.05)      0.1494 (1.0)           5;0  10.5751 (0.95)         11           1
test_dicts_batched_stream     3,975.1581 (45.99)    3,979.7102 (42.04)    3,977.7639 (44.18)     1.8357 (20.26)    3,978.3399 (43.99)     2.8370 (18.98)         2;0   0.2514 (0.02)          5           1
test_dicts_batch              4,398.7191 (50.89)    4,421.9632 (46.71)    4,410.0489 (48.99)     8.7694 (96.78)    4,411.2232 (48.78)    12.0295 (80.50)         2;0   0.2268 (0.02)          5           1
test_dicts_loop               4,411.8727 (51.04)    4,457.0985 (47.08)    4,431.9081 (49.23)    19.8323 (218.86)   4,430.5465 (48.99)    35.6846 (238.78)        2;0   0.2256 (0.02)          5           1
test_rust_structs             4,449.5728 (51.47)    4,479.8173 (47.32)    4,465.2999 (49.60)    14.1041 (155.65)   4,472.5336 (49.46)    24.8971 (166.60)        3;0   0.2239 (0.02)          5           1
test_arrow_tiny               7,023.5789 (81.25)    7,064.2094 (74.62)    7,044.9691 (78.25)    19.4414 (214.55)   7,047.1538 (77.92)    37.8036 (252.96)        1;0   0.1419 (0.01)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

PyArrow Batch Size

--------------------------------------------------------------------------------------------- benchmark: 11 tests ----------------------------------------------------------------------------------------------
Name (time in ms)                         Min                   Max                  Mean             StdDev                Median                IQR            Outliers      OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_arrow_batch_sizes[8192]          28.6030 (1.0)         32.0178 (1.07)        29.2802 (1.0)       0.8846 (4.75)        28.9333 (1.0)       0.7970 (3.52)          5;3  34.1528 (1.0)          34           1
test_arrow_batch_sizes[16384]         28.7303 (1.00)        30.8987 (1.03)        29.3111 (1.00)      0.5447 (2.92)        29.1404 (1.01)      0.5907 (2.61)          9;2  34.1168 (1.00)         33           1
test_arrow_batch_sizes[4096]          28.8488 (1.01)        30.1474 (1.01)        29.2876 (1.00)      0.3776 (2.03)        29.2212 (1.01)      0.6339 (2.80)         12;0  34.1441 (1.00)         34           1
test_arrow_batch_sizes[2048]          29.1493 (1.02)        30.4442 (1.02)        29.5710 (1.01)      0.3013 (1.62)        29.5505 (1.02)      0.3483 (1.54)         10;1  33.8169 (0.99)         32           1
test_arrow_batch_sizes[32768]         29.2200 (1.02)        29.9410 (1.0)         29.5896 (1.01)      0.1863 (1.0)         29.5706 (1.02)      0.2423 (1.07)         11;0  33.7956 (0.99)         34           1
test_arrow_batch_sizes[65536]         30.3973 (1.06)        31.3805 (1.05)        30.9409 (1.06)      0.2453 (1.32)        30.9829 (1.07)      0.2263 (1.0)           9;3  32.3197 (0.95)         33           1
test_arrow_batch_sizes[131072]        30.7074 (1.07)        33.1845 (1.11)        31.4322 (1.07)      0.6799 (3.65)        31.1102 (1.08)      0.8739 (3.86)          6;1  31.8145 (0.93)         32           1
test_arrow_batch_sizes[1024]          30.7724 (1.08)        32.6049 (1.09)        31.6173 (1.08)      0.5506 (2.96)        31.6311 (1.09)      0.9233 (4.08)         13;0  31.6283 (0.93)         30           1
test_arrow_batch_sizes[512]           33.8866 (1.18)        36.2981 (1.21)        34.5224 (1.18)      0.6189 (3.32)        34.2960 (1.19)      0.5087 (2.25)          6;3  28.9667 (0.85)         29           1
test_arrow_batch_sizes[128]           51.0530 (1.78)        56.3190 (1.88)        53.5492 (1.83)      1.6124 (8.65)        53.7474 (1.86)      2.3557 (10.41)         7;0  18.6744 (0.55)         18           1
test_arrow_batch_sizes[1]          2,781.2088 (97.23)    2,812.2547 (93.93)    2,797.8572 (95.55)    11.6892 (62.74)    2,801.0024 (96.81)    15.3956 (68.03)         2;0   0.3574 (0.01)          5           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I/O

------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------
Name (time in s)          Min               Max              Mean            StdDev            Median               IQR            Outliers     OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rust_native       1.6757 (1.0)      1.6848 (1.0)      1.6794 (1.0)      0.0035 (1.26)     1.6783 (1.0)      0.0047 (1.73)          2;0  0.5955 (1.0)           5           1
test_arrow_giant       2.9501 (1.76)     2.9570 (1.76)     2.9521 (1.76)     0.0028 (1.0)      2.9511 (1.76)     0.0027 (1.0)           1;0  0.3387 (0.57)          5           1
test_arrow_small       3.3476 (2.00)     3.6588 (2.17)     3.5583 (2.12)     0.1241 (44.48)    3.5726 (2.13)     0.1289 (47.18)         1;0  0.2810 (0.47)          5           1
test_dicts_loop_io     7.3664 (4.40)     7.3913 (4.39)     7.3837 (4.40)     0.0101 (3.63)     7.3871 (4.40)     0.0113 (4.14)          1;0  0.1354 (0.23)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Hashing

--------------------------------------------------------------------------------------- benchmark: 6 tests ---------------------------------------------------------------------------------------
Name (time in ms)                     Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_hash_rust_xxh3_64_batch       4.4886 (1.0)       4.9466 (1.0)       4.5860 (1.0)      0.0616 (1.57)      4.5939 (1.0)      0.0957 (2.52)         74;1  218.0558 (1.0)         210           1
test_hash_rust_xxh3_64_scalar      5.0276 (1.12)      5.3367 (1.08)      5.1276 (1.12)     0.0393 (1.0)       5.1307 (1.12)     0.0379 (1.0)         41;12  195.0244 (0.89)        190           1
test_hash_rust_xxh3_128            6.1686 (1.37)      6.5772 (1.33)      6.2901 (1.37)     0.1098 (2.79)      6.2334 (1.36)     0.1731 (4.56)         37;0  158.9811 (0.73)        160           1
test_hash_rust_blake3             28.7743 (6.41)     29.0392 (5.87)     28.8919 (6.30)     0.0593 (1.51)     28.8799 (6.29)     0.0709 (1.87)         10;1   34.6118 (0.16)         35           1
test_hash_rust_blake2             54.1043 (12.05)    55.0271 (11.12)    54.4180 (11.87)    0.3711 (9.43)     54.1916 (11.80)    0.7337 (19.34)         5;0   18.3763 (0.08)         19           1
test_hash_python_blake2b          84.0109 (18.72)    84.1698 (17.02)    84.0611 (18.33)    0.0465 (1.18)     84.0469 (18.30)    0.0595 (1.57)          3;0   11.8961 (0.05)         12           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Mem Footprin (sorted from high to low):

Allocation results for lib/dupekit/tests/bench/test_marshaling.py::test_rust_structs at the high watermark

	 📦 Total memory allocated: 4.3GiB
	 📏 Total allocations: 21
	 📊 Histogram of allocation sizes: | ▃█▁▃|

Allocation results for lib/dupekit/tests/bench/test_marshaling.py::test_dicts_batch at the high watermark

	 📦 Total memory allocated: 3.3GiB
	 📏 Total allocations: 20
	 📊 Histogram of allocation sizes: |  ▁█▂|

Allocation results for lib/dupekit/tests/bench/test_marshaling.py::test_dicts_loop at the high watermark

	 📦 Total memory allocated: 3.3GiB
	 📏 Total allocations: 19
	 📊 Histogram of allocation sizes: |  ▁█▂|

Allocation results for lib/dupekit/tests/bench/test_marshaling.py::test_arrow_giant at the high watermark

	 📦 Total memory allocated: 64.9MiB
	 📏 Total allocations: 36
	 📊 Histogram of allocation sizes: |▅█   |

Allocation results for lib/dupekit/tests/bench/test_marshaling.py::test_dicts_batched_stream at the high watermark

	 📦 Total memory allocated: 28.1MiB
	 📏 Total allocations: 7
	 📊 Histogram of allocation sizes: |█▄▄▄▄|

Allocation results for lib/dupekit/tests/bench/test_marshaling.py::test_arrow_tiny at the high watermark

	 📦 Total memory allocated: 22.0MiB
	 📏 Total allocations: 37
	 📊 Histogram of allocation sizes: |█▇   |

Allocation results for lib/dupekit/tests/bench/test_marshaling.py::test_arrow_small at the high watermark

	 📦 Total memory allocated: 551.7KiB
	 📏 Total allocations: 42
	 📊 Histogram of allocation sizes: |▂█▁  |

Statement of attribution:

  • This code was seeded from nelson-liu/rbloom-gcs.
  • Bloom filters were originally proposed in (Bloom, 1970). Furthermore, this implementation makes use of a constant recommended by (L'Ecuyer, 1999) for redistributing the entropy of a single hash over multiple integers using a linear congruential generator.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

marin_dupekit-0.1.2.dev202606111008.tar.gz (29.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

marin_dupekit-0.1.2.dev202606111008-cp311-abi3-manylinux_2_28_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ x86-64

marin_dupekit-0.1.2.dev202606111008-cp311-abi3-manylinux_2_28_aarch64.whl (4.4 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ ARM64

marin_dupekit-0.1.2.dev202606111008-cp311-abi3-macosx_11_0_arm64.whl (4.2 MB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

marin_dupekit-0.1.2.dev202606111008-cp311-abi3-macosx_10_12_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.11+macOS 10.12+ x86-64

File details

Details for the file marin_dupekit-0.1.2.dev202606111008.tar.gz.

File metadata

File hashes

Hashes for marin_dupekit-0.1.2.dev202606111008.tar.gz
Algorithm Hash digest
SHA256 48fadef6f5b03bada20e100839c16bbcd801defd8f87784f497a1db54614df70
MD5 b8f73dc815f96eb7fe724022c05dd7f3
BLAKE2b-256 4ffef3df10c01b222292eea0ac74ce0ca28374f670e47156c03e372e5a231038

See more details on using hashes here.

Provenance

The following attestation bundles were made for marin_dupekit-0.1.2.dev202606111008.tar.gz:

Publisher: dupekit-release-wheels.yaml on marin-community/marin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file marin_dupekit-0.1.2.dev202606111008-cp311-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for marin_dupekit-0.1.2.dev202606111008-cp311-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 de3904420c492ad145b479b077f0033f15128de81700a94d47675bc0d422f162
MD5 36de32a7dabafd4608b765e498a25f1a
BLAKE2b-256 b04457e534fc118125b6c28ba31dd36344a09908e13391b43a63575dd82df431

See more details on using hashes here.

Provenance

The following attestation bundles were made for marin_dupekit-0.1.2.dev202606111008-cp311-abi3-manylinux_2_28_x86_64.whl:

Publisher: dupekit-release-wheels.yaml on marin-community/marin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file marin_dupekit-0.1.2.dev202606111008-cp311-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for marin_dupekit-0.1.2.dev202606111008-cp311-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 14b688d263e5e743c255320df63c93f7af964eccc04a57dc042987efc8471fd6
MD5 22835a7ee5ee0e3474bc39b234299498
BLAKE2b-256 e4626cc6bef47e88a4143377df82850a21a6c7ff16fac8bd98a851cc1efbf48a

See more details on using hashes here.

Provenance

The following attestation bundles were made for marin_dupekit-0.1.2.dev202606111008-cp311-abi3-manylinux_2_28_aarch64.whl:

Publisher: dupekit-release-wheels.yaml on marin-community/marin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file marin_dupekit-0.1.2.dev202606111008-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for marin_dupekit-0.1.2.dev202606111008-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3c4fdc17d6ea25181792258acf6dbc534def2bf3d8a77ab70bacd93efb7a0089
MD5 60b5003bef04b7b4c88040932eb86ff2
BLAKE2b-256 ab3f097481b2bf727f91e5785da742e151e34b522c5a6d1aa55e65b0e3767424

See more details on using hashes here.

Provenance

The following attestation bundles were made for marin_dupekit-0.1.2.dev202606111008-cp311-abi3-macosx_11_0_arm64.whl:

Publisher: dupekit-release-wheels.yaml on marin-community/marin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file marin_dupekit-0.1.2.dev202606111008-cp311-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for marin_dupekit-0.1.2.dev202606111008-cp311-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3a1874b21214927bf1064ca9b48bc3a90cd8bab2debdf30908a6d60dcc1d850e
MD5 5dd8e18b3fecc5f99d9bf9401c7fb1a7
BLAKE2b-256 9cc71768e90fdbcd585d740a03705f09c37b18700d64cae4e0732fa7f76f24a6

See more details on using hashes here.

Provenance

The following attestation bundles were made for marin_dupekit-0.1.2.dev202606111008-cp311-abi3-macosx_10_12_x86_64.whl:

Publisher: dupekit-release-wheels.yaml on marin-community/marin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page