Skip to main content

DashText is a Text Modal Data Library

Project description

DashText Python Library

DashText is a Python package for DashVector's sparse-dense (hybrid) semantic search which contains a series of text utilities and an integrated tool named SparseVectorEncoder.

Installation

To install the DashText Client, simply run:

pip install dashtext

QuickStart

SparseVector Encoding

It's easy to convert text corpus to sparse vectors in DashText with default models.

from dashtext import SparseVectorEncoder

# Initialize a Encoder Instance and Load a Default Model in DashText
encoder = SparseVectorEncoder.default('zh')

# Encode a new document (for upsert to DashVector)
document = "向量检索服务DashVector基于达摩院自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务。"
print(encoder.encode_documents(document))
# {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802: 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351: 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977: 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398: 0.7262431704356519, 4106887295: 0.7262431704356519}

# Encode a query (for search in DashVector)
query = "什么是向量检索服务?"
print(encoder.encode_queries(document))
# {380823393: 0.08361891359384604, 414191989: 0.09229860190522488, 565176162: 0.04535506923676476, 904594806: 0.020073288360284405, 1005505802: 0.027556881447714194, 1169440797: 0.04022365461249135, 1240922502: 0.050572420319144815, 1313971048: 0.01574978858878569, 1317077351: 0.03899710322573238, 1490140460: 0.03401309416846664, 1574737055: 0.03240084602715354, 1760434515: 0.11848476345398339, 2045788977: 0.09625917015244072, 2141666983: 0.11848476345398339, 2509543087: 0.05570020739487387, 3180265193: 0.023553249869916984, 3845702398: 0.05542717955003807, 4106887295: 0.05123100463915489}

SparseVector Parameters

The SparseVectorEncoder class is based on BM25 Algorithm, so it contains some parameters required for the BM25 algorithm and some text utilities parameters for text processing.

  • b: Document length normalization required by BM25 (default: 0.75).
  • k1: Term frequency saturation required by BM25 (default: 1.2).
  • tokenize_function: Tokenization process function, such as SentencePiece or GPTTokenizer in Transformers, outputs may by a string or integer array (default: Jieba).
  • hash_function: Hash process function when need to convert text to number after tokenize (default: None).
  • hash_bucket_function: Dividing process function when need to dividing hash values into finite buckets (default: None).
from dashtext import SparseVectorEncoder
from dashtext import TextTokenizer

tokenizer = TextTokenizer().from_pretrained("Jieba", stop_words=True)

encoder = SparseVectorEncoder(b=0.75, k1=1.2, tokenize_function=tokenizer.tokenize)

Reference

Encode Documents

encode_documents(texts: Union[str, List[str], List[int], List[List[int]]]) -> Union[Dict, List[Dict]]

Parameters Type Required Description
texts str
List[str]
List[int]
List[List[int]]
Yes str : single text
List[str]:mutiple texts
List[int]:hash representation of a single text
List[List[int]]:hash representation of mutiple texts
Example:
# single text
texts1 = "DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成"
result = encoder.encode_documents(texts1)

# mutiple texts
texts2 = ["DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成",
        "从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景,提供所需的高效向量检索能力"]     
result = encoder.encode_documents(texts2)

# hash representation of a single text
texts3 = [1218191817, 2673099881, 2982218203, 3422996809]
result = encoder.encode_documents(texts3)

# hash representation of mutiple texts
texts4 = [[1218191817, 2673099881, 2982218203, 3422996809], [2673099881, 2982218203, 3422996809, 771291085, 741580288]]
result = encoder.encode_documents(texts4)

# result example
# {59256732: 0.7340568742689919, 863271227: 0.7340568742689919, 904594806: 0.7340568742689919, 942054413: 0.7340568742689919, 1169440797: 0.8466352922575744, 1314384716: 0.7340568742689919, 1554119115: 0.7340568742689919, 1736403260: 0.7340568742689919, 2029341792: 0.7340568742689919, 2141666983: 0.7340568742689919, 2367386033: 0.7340568742689919, 2549501804: 0.7340568742689919, 3869223639: 0.7340568742689919, 4130523965: 0.7340568742689919, 4162843804: 0.7340568742689919, 4202556960: 0.7340568742689919}

Encode Queries

encode_queries(texts: Union[str, List[str], List[int], List[List[int]]]) -> Union[Dict, List[Dict]]
The input format is the same as the encode_documents method.

Example:

# single text
texts = "什么是向量检索服务?"
result = encoder.encode_queries(texts)

Train / Dump / Load DashText Model

Train

train(corpus: Union[str, List[str], List[int], List[List[int]]]) -> None

Parameters Type Required Description
corpus str
List[str]
List[int]
List[List[int]]
Yes str : single text
List[str]:mutiple texts
List[int]:hash representation of a single text
List[List[int]]:hash representation of mutiple texts

Example:

corpus = [
    "向量检索服务DashVector基于达摩院自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务",
    "DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成",
    "从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景,提供所需的高效向量检索能力",
    "简单灵活、开箱即用的SDK,使用极简代码即可实现向量管理",
    "自研向量相似性比对算法,快速高效稳定服务",
    "Schema-free设计,通过Schema实现任意条件下的组合过滤查询"
]

encoder.train(corpus)

# use dump method to check parameters
encoder.dump("./dump_paras.json")

Dump and Load

dump(path: str) -> None
load(path: str) -> None

Parameters Type Required Description
path str Yes Use the dump method to dump the model parameters as a JSON file to the specified path;
Use load method to load a model parameters from a JSON file path or URL

The input path can be either relative or absolute, but it should be specific to the file, Example:". /test_dump.json", URL starts with "http://" or "https://"

Example:

# dump model
encoder.dump("./model.json")

# load model from path
encoder.load("./model.json")

# load model from url
encoder.load("https://example.com/model.json")

Default DashText Models

If you want to use the default BM25 model of SparseVectorEncoder, you can call the default method.

default(name : str = 'zh') -> "SparseVectorEncoder"
Parameters Type Required Description
name str No Currently supports both Chinese and English default models,Chinese model name is 'zh'(default), English model name is 'en'.

Example:

# default method
encoder = dashtext.SparseVectorEncoder.default()

# using default model, you can directly encode documents and queries
encoder.encode_documents("DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成")
encoder.encode_queries("什么是向量检索服务?")

Combining Sparse and Dense Encodings for Hybrid Search in DashVector

combine_sparse_and_dense(sparse_vector: Dict, dence_vector: List[float], alpha: float) -> Tuple[List[float], Dict]

Parameters Type Required Description
sparse_vector Dict Yes sparse vector generated by encode_documents or encode_query method
dense_vector List[float] Yes dense vector represented as a list of floats
alpha float Yes alpha controls the computational weights of sparse and dense vectors and is a floating point number between 0 and 1, where 0 == sparse vectors only and 1 == dense vectors only.

Example:

from dashtext import combine_sparse_and_dense

dense_vector = [0.02428389742874429,0.02036450577918233,0.00758973862139133,-0.060652585776971274,0.03321684423003758,-0.019009049500375488,0.015808212986566556,0.0037662904132509424,-0.0178332320055069]
sparse_vector = encoder.encode_documents("DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成")

# using convex combination to generate hybrid vector
scaled_dense_vector, scaled_sparse_vector = combine_sparse_and_dense(sparse_vector, dense_vector,0.8)

# result example
# scaled_dense_vector: [0.019427117942995432, 0.016291604623345866, 0.006071790897113065, -0.04852206862157702, 0.026573475384030067, -0.01520723960030039, 0.012646570389253245, 0.003013032330600754, -0.014266585604405522]
# scaled_sparse_vector: {59256732: 0.14681137485379836, 863271227: 0.14681137485379836, 904594806: 0.14681137485379836, 942054413: 0.14681137485379836, 1169440797: 0.16932705845151483, 1314384716: 0.14681137485379836, 1554119115: 0.14681137485379836, 1736403260: 0.14681137485379836, 2029341792: 0.14681137485379836, 2141666983: 0.14681137485379836, 2367386033: 0.14681137485379836, 2549501804: 0.14681137485379836, 3869223639: 0.14681137485379836, 4130523965: 0.14681137485379836, 4162843804: 0.14681137485379836, 4202556960: 0.14681137485379836}

License

This project is licensed under the Apache License (Version 2.0).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dashtext-0.0.2-cp311-cp311-win_amd64.whl (6.7 MB view details)

Uploaded CPython 3.11Windows x86-64

dashtext-0.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

dashtext-0.0.2-cp311-cp311-macosx_11_0_arm64.whl (6.6 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

dashtext-0.0.2-cp311-cp311-macosx_10_9_x86_64.whl (6.6 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

dashtext-0.0.2-cp310-cp310-win_amd64.whl (6.7 MB view details)

Uploaded CPython 3.10Windows x86-64

dashtext-0.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

dashtext-0.0.2-cp310-cp310-macosx_10_9_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

dashtext-0.0.2-cp39-cp39-win_amd64.whl (6.7 MB view details)

Uploaded CPython 3.9Windows x86-64

dashtext-0.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

dashtext-0.0.2-cp39-cp39-macosx_11_0_arm64.whl (6.6 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

dashtext-0.0.2-cp39-cp39-macosx_10_9_x86_64.whl (6.6 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

dashtext-0.0.2-cp38-cp38-win_amd64.whl (6.7 MB view details)

Uploaded CPython 3.8Windows x86-64

dashtext-0.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

dashtext-0.0.2-cp38-cp38-macosx_10_9_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

dashtext-0.0.2-cp37-cp37m-win_amd64.whl (6.7 MB view details)

Uploaded CPython 3.7mWindows x86-64

dashtext-0.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

dashtext-0.0.2-cp37-cp37m-macosx_10_9_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

File details

Details for the file dashtext-0.0.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: dashtext-0.0.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 6.7 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.0

File hashes

Hashes for dashtext-0.0.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 c369d70f0c508036ba1883a9b8e0a13ddecabedeefed2870fb9f9c6584be96b9
MD5 8685bd198331ce78bb3e5f2115f744d0
BLAKE2b-256 0b59076f4c72d9c4c7c17404a042b20fa869a24a77e18313a1282eb49c8a8f87

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1b61d5ed52a31cd0dcc0bc3abccc406b44051f7e124d421748d9b023296e3ec8
MD5 f59340fd7d8547efa9faba6e53c4fe42
BLAKE2b-256 a3a529339266dd20ed2a44c09fc70c1aaf4ecacec7f4e428b543ee71a6af5612

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c2b9983fc055c2c4795fec67d760f9b5f39c1b088ccf3fc2db1b4c0ce9576b2b
MD5 e7a82522fb00df49a99d6da1b049603f
BLAKE2b-256 1538109671fe7dd45efb1c8e609881b36ff8c0174893c4c63f87fd9a6016d9b2

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.2-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2433567b27a3eb2dbbab99a3d65390818ba5e8eddcccb9309c9812c569bc5485
MD5 0464180681a85ea9f507bcd3699d1527
BLAKE2b-256 e1107567c706376859959101018879596ef66d292f7867c0306f915e0bbf405e

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: dashtext-0.0.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 6.7 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.0

File hashes

Hashes for dashtext-0.0.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 464e3c8a677ac5fa57a2063d18bf2297f5aca4eceb1e02b917ff56feab2ce20f
MD5 3aff3ee999607227541ed9a57a73f41f
BLAKE2b-256 6777320ef829dc093bb26eed8d211bd38a5abfc45b7f41adee2ee75f2cf23e74

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 94a8d95c241b70bc7608262a37515e1d5c9b6d12a0f170b38095724034dfeedf
MD5 2d598e5019845c00d56ce36f15a56528
BLAKE2b-256 ba6a0417bf886d3cfc95b857570564954992b83c26ddf20737d1d79fcfdc21ca

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.2-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 266c2d99e51ab128235e8877004a155d850e9914b269866c54bf1e3b9bd03694
MD5 e55c6e43d2882ae0a9180d33868ce7d4
BLAKE2b-256 376b7d38544844372559a7937c3f8d499bf90911c0a7311798b7db1d7add6e76

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: dashtext-0.0.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 6.7 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.0

File hashes

Hashes for dashtext-0.0.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 9f3af055f7d93bd4b31347c4c46216aeb9fa05a82189f57137e2107f98ca4115
MD5 fcd56c1a3066455492e623c7a05b24a9
BLAKE2b-256 034c685c8df7edb0efeb0fbb715fc7538431060845480d36e33c3b2b97ecec42

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8961b0aded54688c5eb39e6638b08fb85c0a8100e58d5268026aa83f1c50db55
MD5 0c637a85ed3a4570cbbcc93db1b1a09e
BLAKE2b-256 6f266f8b4ed2d497f342a6cbf0d5d32b10eb03d589f17c52bf18c233eb86fa93

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.2-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 93bdb6caa4112414ced822fb3672589f86425d695234665c462383f54fadb7be
MD5 8ee316f25d21bcfe62a9d336cbcbd45a
BLAKE2b-256 0f1ba0438a79f849c4581e1f3f91e701f9d71ef530e0608f4052baa6072499ac

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.2-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 11bbb8267963d854875bde9a36da8677cbfd3f4c224781c9f52340d0be5e9282
MD5 3a65b12a639bc8224ec855b0f4c05685
BLAKE2b-256 039b79b1eeda78c3051e5f2acb84060d1a686f52cd3bde0f62e039fcda1303a1

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: dashtext-0.0.2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 6.7 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.0

File hashes

Hashes for dashtext-0.0.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 28b41f2dac4efc88777e0cec0009baed553cf2e8692ec2c99ecbdd4c4103c56b
MD5 586c980f61e9269b56b5492137b6df91
BLAKE2b-256 63054a524df9d092c80078c80fbf86e73cd45190347e1234e305efe649a987d9

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b556773eebcefbb62efee6e8d4767d38dac9ea0bc9fd9604d692c774257a1fe0
MD5 751de3bfcd8cc39c99ebf0eeea58465c
BLAKE2b-256 78ce0c891590aa928a67704220ebf198f6414439cd7b88809ef80d409423cd4c

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.2-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3b5e82ace862c923a4aad754259b2200fb117898140cd997524d685e6910fd51
MD5 ea11ee444eb5e6e389897e80a2f4a1d4
BLAKE2b-256 cfc4b87858960088230c2bb072ab1408c43805d36cdb9d9e9a7a39ed78f6a68b

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: dashtext-0.0.2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 6.7 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.0

File hashes

Hashes for dashtext-0.0.2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 4aa13a3278121e1f6c16f43460b15b71610be60eb01d052f616b8463af08fe8e
MD5 ead0a2b31bdf91f566a94f5ef0a5a073
BLAKE2b-256 50146104ffc3e935b391a0a121d645b7880c37a9c8be294e0fc9e653380b6e0e

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6a3ca6ecf0157ad71b4e47b18af47b718582c36a0b12990e83f25754074f6ff5
MD5 56a566dfe4bb248ed65c17b094a2b809
BLAKE2b-256 247e30f688122c7e50ffea3d07f81ef3392dbb72d17cd92cc62e3fd88a9a9ccd

See more details on using hashes here.

File details

Details for the file dashtext-0.0.2-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dashtext-0.0.2-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 66490b9612e3b1017516449346b3ac86132cc2746c28f6a855f0cf58acc74e00
MD5 d18e4cad86cec6ff650280479d963196
BLAKE2b-256 f33c23647f933ae1ad948c280f7c46f97ed77b492c3ad8e1e25aa2f6f3a4966c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page