Skip to main content

DashText is a Text Modal Data Library

Project description

DashText Python Library

DashText is a Python package for DashVector's sparse-dense (hybrid) semantic search which contains a series of text utilities and an integrated tool named SparseVectorEncoder.

Installation

To install the DashText Client, simply run:

pip install dashtext

QuickStart

SparseVector Encoding

It's easy to convert text corpus to sparse vectors in DashText with default models.

from dashtext import SparseVectorEncoder

# Initialize a Encoder Instance and Load a Default Model in DashText
encoder = SparseVectorEncoder.default('zh')

# Encode a new document (for upsert to DashVector)
document = "向量检索服务DashVector基于达摩院自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务。"
print(encoder.encode_documents(document))
# {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802: 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351: 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977: 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398: 0.7262431704356519, 4106887295: 0.7262431704356519}

# Encode a query (for search in DashVector)
query = "什么是向量检索服务?"
print(encoder.encode_queries(document))
# {380823393: 0.08361891359384604, 414191989: 0.09229860190522488, 565176162: 0.04535506923676476, 904594806: 0.020073288360284405, 1005505802: 0.027556881447714194, 1169440797: 0.04022365461249135, 1240922502: 0.050572420319144815, 1313971048: 0.01574978858878569, 1317077351: 0.03899710322573238, 1490140460: 0.03401309416846664, 1574737055: 0.03240084602715354, 1760434515: 0.11848476345398339, 2045788977: 0.09625917015244072, 2141666983: 0.11848476345398339, 2509543087: 0.05570020739487387, 3180265193: 0.023553249869916984, 3845702398: 0.05542717955003807, 4106887295: 0.05123100463915489}

SparseVector Parameters

The SparseVectorEncoder class is based on BM25 Algorithm, so it contains some parameters required for the BM25 algorithm and some text utilities parameters for text processing.

  • b: Document length normalization required by BM25 (default: 0.75).
  • k1: Term frequency saturation required by BM25 (default: 1.2).
  • tokenize_function: Tokenization process function, such as SentencePiece or GPTTokenizer in Transformers, outputs may by a string or integer array (default: Jieba).
  • hash_function: Hash process function when need to convert text to number after tokenize (default: None).
  • hash_bucket_function: Dividing process function when need to dividing hash values into finite buckets (default: None).
from dashtext import SparseVectorEncoder
from dashtext import TextTokenizer

tokenizer = TextTokenizer().from_pretrained("Jieba", stop_words=True)

encoder = SparseVectorEncoder(b=0.75, k1=1.2, tokenize_function=tokenizer.tokenize)

Reference

Encode Documents

encode_documents(texts: Union[str, List[str], List[int], List[List[int]]]) -> Union[Dict, List[Dict]]

Parameters Type Required Description
texts str
List[str]
List[int]
List[List[int]]
Yes str : single text
List[str]:mutiple texts
List[int]:hash representation of a single text
List[List[int]]:hash representation of mutiple texts

Example:

# single text
texts1 = "DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成"
result = encoder.encode_documents(texts1)

# mutiple texts
texts2 = ["DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成",
        "从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景,提供所需的高效向量检索能力"]     
result = encoder.encode_documents(texts2)

# hash representation of a single text
texts3 = [1218191817, 2673099881, 2982218203, 3422996809]
result = encoder.encode_documents(texts3)

# hash representation of mutiple texts
texts4 = [[1218191817, 2673099881, 2982218203, 3422996809], [2673099881, 2982218203, 3422996809, 771291085, 741580288]]
result = encoder.encode_documents(texts4)

# result example
# {59256732: 0.7340568742689919, 863271227: 0.7340568742689919, 904594806: 0.7340568742689919, 942054413: 0.7340568742689919, 1169440797: 0.8466352922575744, 1314384716: 0.7340568742689919, 1554119115: 0.7340568742689919, 1736403260: 0.7340568742689919, 2029341792: 0.7340568742689919, 2141666983: 0.7340568742689919, 2367386033: 0.7340568742689919, 2549501804: 0.7340568742689919, 3869223639: 0.7340568742689919, 4130523965: 0.7340568742689919, 4162843804: 0.7340568742689919, 4202556960: 0.7340568742689919}

Encode Queries

encode_queries(texts: Union[str, List[str], List[int], List[List[int]]]) -> Union[Dict, List[Dict]]
The input format is the same as the encode_documents method.

Example:

# single text
texts = "什么是向量检索服务?"
result = encoder.encode_queries(texts)

Train / Dump / Load DashText Model

Train

train(corpus: Union[str, List[str], List[int], List[List[int]]]) -> None

Parameters Type Required Description
corpus str
List[str]
List[int]
List[List[int]]
Yes str : single text
List[str]:mutiple texts
List[int]:hash representation of a single text
List[List[int]]:hash representation of mutiple texts

Example:

corpus = [
    "向量检索服务DashVector基于达摩院自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务",
    "DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成",
    "从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景,提供所需的高效向量检索能力",
    "简单灵活、开箱即用的SDK,使用极简代码即可实现向量管理",
    "自研向量相似性比对算法,快速高效稳定服务",
    "Schema-free设计,通过Schema实现任意条件下的组合过滤查询"
]

encoder.train(corpus)

# use dump method to check parameters
encoder.dump("./dump_paras.json")

Dump and Load

dump(path: str) -> None
load(path: str) -> None

Parameters Type Required Description
path str Yes Use the dump method to dump the model parameters as a JSON file to the specified path;
Use load method to load a model parameters from a JSON file path or URL

The input path can be either relative or absolute, but it should be specific to the file, Example:". /test_dump.json", URL starts with "http://" or "https://"

Example:

# dump model
encoder.dump("./model.json")

# load model from path
encoder.load("./model.json")

# load model from url
encoder.load("https://example.com/model.json")

Default DashText Models

If you want to use the default BM25 model of SparseVectorEncoder, you can call the default method.

default(name : str = 'zh') -> "SparseVectorEncoder"
Parameters Type Required Description
name str No Currently supports both Chinese and English default models,Chinese model name is 'zh'(default), English model name is 'en'.

Example:

# default method
encoder = dashtext.SparseVectorEncoder.default()

# using default model, you can directly encode documents and queries
encoder.encode_documents("DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成")
encoder.encode_queries("什么是向量检索服务?")

Combining Sparse and Dense Encodings for Hybrid Search in DashVector

combine_sparse_and_dense(sparse_vector: Dict, dence_vector: List[float], alpha: float) -> Tuple[List[float], Dict]

Parameters Type Required Description
sparse_vector Dict Yes sparse vector generated by encode_documents or encode_query method
dense_vector List[float] Yes dense vector represented as a list of floats
alpha float Yes alpha controls the computational weights of sparse and dense vectors and is a floating point number between 0 and 1, where 0 == sparse vectors only and 1 == dense vectors only.

Example:

from dashtext import combine_sparse_and_dense

dense_vector = [0.02428389742874429,0.02036450577918233,0.00758973862139133,-0.060652585776971274,0.03321684423003758,-0.019009049500375488,0.015808212986566556,0.0037662904132509424,-0.0178332320055069]
sparse_vector = encoder.encode_documents("DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成")

# using convex combination to generate hybrid vector
scaled_dense_vector, scaled_sparse_vector = combine_sparse_and_dense(sparse_vector, dense_vector,0.8)

# result example
# scaled_dense_vector: [0.019427117942995432, 0.016291604623345866, 0.006071790897113065, -0.04852206862157702, 0.026573475384030067, -0.01520723960030039, 0.012646570389253245, 0.003013032330600754, -0.014266585604405522]
# scaled_sparse_vector: {59256732: 0.14681137485379836, 863271227: 0.14681137485379836, 904594806: 0.14681137485379836, 942054413: 0.14681137485379836, 1169440797: 0.16932705845151483, 1314384716: 0.14681137485379836, 1554119115: 0.14681137485379836, 1736403260: 0.14681137485379836, 2029341792: 0.14681137485379836, 2141666983: 0.14681137485379836, 2367386033: 0.14681137485379836, 2549501804: 0.14681137485379836, 3869223639: 0.14681137485379836, 4130523965: 0.14681137485379836, 4162843804: 0.14681137485379836, 4202556960: 0.14681137485379836}

License

This project is licensed under the Apache License (Version 2.0).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dashtext-0.0.4-cp311-cp311-win_amd64.whl (6.7 MB view details)

Uploaded CPython 3.11Windows x86-64

dashtext-0.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

dashtext-0.0.4-cp311-cp311-macosx_10_9_x86_64.whl (6.6 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

dashtext-0.0.4-cp310-cp310-win_amd64.whl (6.7 MB view details)

Uploaded CPython 3.10Windows x86-64

dashtext-0.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

dashtext-0.0.4-cp310-cp310-macosx_10_9_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

dashtext-0.0.4-cp39-cp39-win_amd64.whl (6.7 MB view details)

Uploaded CPython 3.9Windows x86-64

dashtext-0.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

dashtext-0.0.4-cp39-cp39-macosx_10_9_x86_64.whl (6.6 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

dashtext-0.0.4-cp38-cp38-win_amd64.whl (6.7 MB view details)

Uploaded CPython 3.8Windows x86-64

dashtext-0.0.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

dashtext-0.0.4-cp38-cp38-macosx_10_9_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

dashtext-0.0.4-cp37-cp37m-win_amd64.whl (6.7 MB view details)

Uploaded CPython 3.7mWindows x86-64

dashtext-0.0.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

dashtext-0.0.4-cp37-cp37m-macosx_10_9_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

File details

Details for the file dashtext-0.0.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: dashtext-0.0.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 6.7 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.0

File hashes

Hashes for dashtext-0.0.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 645a63b56685036957901246d777b937a0cef22eda707ef63473f484f345bb3e
MD5 3f791cca5b1ee637274348dd3eccbe63
BLAKE2b-256 fd1770c7e1f2491270f555c8105abde5058be01f4d5a5af087e38a01f3a684af

See more details on using hashes here.

File details

Details for the file dashtext-0.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6759ce982dcd2a7ecde5efc98b93e5bdc01f3705e6f7e4202afe130726037d2e
MD5 58b959a92f47f53bbd240ed08fcc8838
BLAKE2b-256 7223f5ca60038ed8ed2609e0589cf7c7f7e1978629fc846993d3b263aa40e33f

See more details on using hashes here.

File details

Details for the file dashtext-0.0.4-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.4-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4538b8ed4b332fa630a44863d7cfc75ac97dfbc3d072b338a388110eb8d4cd29
MD5 20f2e9e64f022cd2eee9790317257b03
BLAKE2b-256 5ed55075c5ad48a6682377779a658ef7df98f4b61bd22a640d527b6a7a9edbae

See more details on using hashes here.

File details

Details for the file dashtext-0.0.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: dashtext-0.0.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 6.7 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.0

File hashes

Hashes for dashtext-0.0.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 acc58b8e8d4a756503bf3a1afc8fea17752150b748bb8997e4b3402a5cc3f5a3
MD5 4c00fd88963b58165f5b3972181eb824
BLAKE2b-256 9058f8091dcc3f20cc6bc7ebaf919034d13b6dfc56929d8cb8ff4efc0afaa730

See more details on using hashes here.

File details

Details for the file dashtext-0.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1577cb79ea23733b546ef78dc9cde907f45165e34ae56553ffd810b47a1cc31c
MD5 156f86ce3b76c9ea402467428c11bd4f
BLAKE2b-256 7bdba8df74d4cefa471d02d115878803f9e8a7e9ad47a03f94d543c32c7095df

See more details on using hashes here.

File details

Details for the file dashtext-0.0.4-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.4-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 26adb815ae5d8ceeb0147a0f4e07de84b1662ecc93c672a90e39f109d734ffa2
MD5 daa795bb009ded7d94a8923e04c73024
BLAKE2b-256 5086eb40a22eabd3db7f85927900aed1d6654b734f5919df20edbfb735d07d74

See more details on using hashes here.

File details

Details for the file dashtext-0.0.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: dashtext-0.0.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 6.7 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.0

File hashes

Hashes for dashtext-0.0.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 797a282624b85940d1b9066146ef02fa9c8b6715cd9186bc339fa8421ced096a
MD5 76f6211caac18df27bd20662db1e2a24
BLAKE2b-256 43dcfd758a415a80634b371569eafdc95a647f10bb3c6c25e2032dc15595841b

See more details on using hashes here.

File details

Details for the file dashtext-0.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6a4e9f4c8b52e0ff3d6dc3d0bae7e93fb6d796dd2401c5d9b47930300fdf6ade
MD5 0564c0f204d80a85ebe509f7d9594d6e
BLAKE2b-256 148e8fef4bfe039741469686be6d0bdd11025e7ddc9cf6b5054349ed4bafc092

See more details on using hashes here.

File details

Details for the file dashtext-0.0.4-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.4-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d08f831c5c498a467c48db46243636e1fb442bcaefc06b43bd5b49dee36bebd4
MD5 489b723a613bc2fe7ab2157fb567789e
BLAKE2b-256 e5e854511842bb49bb52a1794186d2af04b8696892fadbbaa2346b4d19c42541

See more details on using hashes here.

File details

Details for the file dashtext-0.0.4-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: dashtext-0.0.4-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 6.7 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.0

File hashes

Hashes for dashtext-0.0.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 4c53100fec9bbe92973969547cb1e8083ed589ca17d589ede5a338247bbc877f
MD5 a51eb60fe7dbf27ea365338869dca18b
BLAKE2b-256 0053eee6401d597f3a6a1e81672a24258b0bd1a05b81d86cadedfc37fed222ae

See more details on using hashes here.

File details

Details for the file dashtext-0.0.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bafeb7cd1017b149da29f86ddd0305fbe2831020d88d16734c3ac566a9ebf606
MD5 42b7f53b517105798d8e7102ca4e7148
BLAKE2b-256 82f55fa0c1a8e2fb4cddbeeadcdc630106bae9085d3133fec06376c47e647e39

See more details on using hashes here.

File details

Details for the file dashtext-0.0.4-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.4-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 83d7784990bc7fc7e702f1788bd5b7223f10d5b6921e2d9bb3f715ace1a4a6b7
MD5 5d8306aab44cebc3d7fc57b347e137f3
BLAKE2b-256 daefa4e88125c6715c0e2ec63fb075845e3ca9af5a97cfcd29fa6f09c8a25dff

See more details on using hashes here.

File details

Details for the file dashtext-0.0.4-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: dashtext-0.0.4-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 6.7 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.0

File hashes

Hashes for dashtext-0.0.4-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 7200d79841b5fcc45d97439c8bea9be46e71c2daffc7b3a4f2398667ef02996c
MD5 159596df8985d1dbf5de5e7e1804c0b0
BLAKE2b-256 5364c5afd037d38da5ddd6ca2e8939c484c82633d97bb45655f90bf8bada35cb

See more details on using hashes here.

File details

Details for the file dashtext-0.0.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 63e3554c0764ee85a17ec69ef0f517761c9deb3ca14ff15c8dc969e554828222
MD5 80677e6d5adf0ba9491b491073273e55
BLAKE2b-256 928a837e1caea3d595baa017c214eae088c0ef3d7503fe5d1812b990e4d74bb1

See more details on using hashes here.

File details

Details for the file dashtext-0.0.4-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.4-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ab5369618fefb4bd0c32b2b1077817482293251b4cd663a9927fde3aa9107563
MD5 cce59168fdcd9175826a5f7bc06fc6fc
BLAKE2b-256 285ed6f0c45933a2024abaaeb6b0b291cf55f6aa81e4cc628b78a5045c4a128c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page