Skip to main content

A set of data tools in Python

Project description

demeterchain

透過retriever檢索文章
並從文章中萃取答案

Installation

Install via PyPI

pip install demeterchain

為了避免下載過多無用套件
使用時須根據自己的需求自行安裝其他套件

流程簡介

建立retriever流程

graph TD;
讀取本地文件-->分割文件;
分割文件-->建立retriever;

檢索流程

graph TD;
使用者輸入問題-->Hyde["(可省略)使用HyDE擴增問題"];
Hyde-->檢索相關文章;
檢索相關文章-->從每篇檢索到的文章尋找答案;
從每篇檢索到的文章尋找答案-->Summary["(可省略)摘要並統整所有答案"];
Summary-->回傳結果;

使用說明

使用examples/demo.ipynb進行簡單測試
使用examples/complete_demo.ipynb進行完整測試

功能介紹

此處並不會介紹全部功能
僅針對部分功能進行介紹

TextSplitter

將文檔進行分割

  • separator : 分割文檔時只能在遇到separator才進行分割,若不設定則會以長度進行分割
  • chunk_size : 預期的分割文檔長度,當使用separator時可能會有比chunk_size長或短的結果出現
  • chunk_overlap : 分割文檔之間重疊的長度

PyseriniBM25Retriever

使用Pyserini的bm25為基底的retriever
需安裝jdk11

sudo apt-get update
sudo apt-get install openjdk-11-jdk

與Pyserini, faiss-cpu

pip install pyserini==0.22.1 faiss-cpu==1.7.4

RankBM25Retriever

使用rank_bm25的BM25Okapi為基底的retriever
需安裝rank_bm25

pip install rank_bm25

QAModelConfig

設定讀取模型時的各種參數

  • model_name_or_path : str,本地路徑或huggingface上模型的路徑,建議使用"NchuNLP/taide-qa"
  • template : 建議直接參考examples/demo.ipynb
  • device_map : str,模型要放在甚麼裝置
  • dtype : str,模型讀取的型態,可使用float32, float16, bfloat16
  • quantize : str,量化模型,提供以下兩種選擇
    • bitsandbytes : 等同於load_in_8bit
    • bitsandbytes-nf4 : 等同於load_in_4bit並使用nf4
  • use_flash_attention : bool,是否啟用flash_attention_2,安裝方式
  • noanswer_str : 建議直接參考examples/demo.ipynb
  • noanswer_ids : 建議直接參考examples/demo.ipynb

QAConfig

設定檢索及回答問題時的各種參數

  • retrieve_k : int,retriever檢索的篇數
  • batch_size : int,模型同時處理的文章數量,請依照自身顯卡vram進行調整
  • max_length : int,模型所能接受的最大長度,請依照自身顯卡vram進行調整,預設為768
  • max_new_tokens : int,模型預測的最大長度
  • num_beams : int, 生成過程中的答案數量,用來提升解碼的精準度
  • answer_strategy : 如何決定一篇文章的答案
    • best : 模型預測的最佳結果
    • longest : 模型產生的答案中最長的一個

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

demeterchain-1.0.2.tar.gz (22.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

demeterchain-1.0.2-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file demeterchain-1.0.2.tar.gz.

File metadata

  • Download URL: demeterchain-1.0.2.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for demeterchain-1.0.2.tar.gz
Algorithm Hash digest
SHA256 63b937b871f4064940a95e3759edc275bda9f0f5051d0884f5b672115ccfdd80
MD5 701f2271fdf8cf0cb5a844c69293e5ea
BLAKE2b-256 78ab89f942c05b590e5317efed120c43158079bf05b07618903701601708ed6f

See more details on using hashes here.

File details

Details for the file demeterchain-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: demeterchain-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 28.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for demeterchain-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4cd0829130cd68147b5d373f6d5310ba3b3c010063eb7a7bfc207e16adcaaa06
MD5 ce594a4567adbed9c02560db669e2a79
BLAKE2b-256 b12704dffd3e70ed712b080f8afcbc775f1714648cf8904b3f218d8d32f1ab58

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page