Skip to main content

去重过滤器,提供常见的去重方案,开发便捷、性能极高。

Project description

简介

去重过滤器,提供常见的去重方案,开发便捷、性能极高。

去重方案

种类 去重方案 说明 特点 缺点 置出方案
Memory MemoryFilter 基于内存集合类型实现 准确性高 不能持久化 随机删除
File FileFiler 基于文件+集合类型实现 准确性高 本地内存和存储占用大 利用文件指针区间删除
Redis RedisBloomFilter
AsyncRedisBloomFilter
基于Redis Bitmap和布隆过滤器算法实现 占用内存极小 有误判的情况且不容易删除元素 随机删除
RedisStringFilter
AsyncRedisStringFilter
基于Redis String数据结构实现 不会误判,能基于过期时间实现查询去重和确认机制 占用资源很大,需尽可能压缩和设置过期时间 设置过期时间
RedisSetFilter
AsyncRedisSetFilter
基于Redis Set数据结构实现 准确性高 占用资源较大 随机删除
RedisSortedSetFilter
AsyncRedisSortedSetFilter
基于Redis SortedSet数据结构实现 准确性高 占用资源较大 根据分值删除
SQL SQLFilter 基于SQL关系数据库表主键来实现 准确性高 在大规模去重场景性能差 按时间删除

项目特点

  1. 多种方案提供不同场景需求。
  2. 基于Lua脚本支持批量操作,速度快。
  3. 支持异步,可快速集成到异步代码和异步框架中。

去重示例

RedisBloomFilter

import redis
from dupfilter import RedisBloomFilter

server = redis.Redis(host="127.0.0.1", port=6379)
rbf = RedisBloomFilter(server=server, key="bf", block_num=2)
print(rbf.exists_many(["1", "2", "3"]))
rbf.insert_many(["1", "2", "3"])
print(rbf.exists_many(["1", "2", "3"]))

AsyncRedisBloomFilter

import asyncio
import aioredis
from dupfilter import AsyncRedisBloomFilter


async def test():
    server = aioredis.from_url('redis://127.0.0.1:6379/0')
    arbf = AsyncRedisBloomFilter(server, key='bf')
    stats = await arbf.exists_many(["1", "2", "3"])
    print(stats)
    await arbf.insert_many(["1", "2", "3"])
    stats = await arbf.exists_many(["1", "2", "3"])
    print(stats)


loop = asyncio.get_event_loop()
loop.run_until_complete(test())

DefaultFilter

在项目中,可能在外层参数确认是否走去重逻辑,这时为了方法的逻辑一致性,预留默认去重类。

from dupfilter import MemoryFilter
from dupfilter import DefaultFilter

is_dup = True  # 全局设置是否去重
if is_dup:
    flr = MemoryFilter()
else:
    flr = DefaultFilter(default_stat=False)

print(flr.exists("1"))

FilterCounter

对去重结果进行统计判断

from dupfilter import MemoryFilter
from dupfilter import FilterCounter
flt = MemoryFilter()
flt_counter = FilterCounter()
values = ['1', '2', '3']
for value in values:
    flt_counter.insert_stat(flt.exists(value))

# 进行判断和统计
print(flt_counter.any(), flt_counter.all(), flt_counter.count())

Others

和上述示例类似

相关库

  1. redis:redis/aioredis
  2. mysql:pymysql/aiomysql
  3. sqlite:sqlite3
  4. oracle:cx_Oracle/cx_Oracle_async

后续优化

  1. 部分去重方案的重置逻辑完善

关于作者

  1. 邮箱:1194542196@qq.com
  2. 微信:hu1194542196

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dupfilter-0.0.5.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

dupfilter-0.0.5-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file dupfilter-0.0.5.tar.gz.

File metadata

  • Download URL: dupfilter-0.0.5.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.5

File hashes

Hashes for dupfilter-0.0.5.tar.gz
Algorithm Hash digest
SHA256 5124bfc61b83fc14898d7d92f89ced9af360732c6681ca652f1954b9a429ffb2
MD5 66d7873adff5bc23eeaa24be5eabd110
BLAKE2b-256 f96465b08506342f1edff2904a708441074e94279a71410ca5abc63b78354a06

See more details on using hashes here.

File details

Details for the file dupfilter-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: dupfilter-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.5

File hashes

Hashes for dupfilter-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 93d879807ea9c60781274e7400db2f6f84de86ae93cb04efb41f01d63ecf7c03
MD5 a98612319b6170df5552b3347bb924c2
BLAKE2b-256 defaa7fb296fa9e4e2bb78458a4d74e7ae1271605528beae26c9be5bf9a9b6a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page