Skip to main content

A semantic search CLI tool

Project description

(Work-in-progress documentation; the tool is not yet ready)

# Semantra

Semantra is a multipurpose tool for semantically searching documents. Query by meaning rather than just by matching text.

The tool, made to run on the command line, analyzes specified text and PDF files on your computer and launches a local web search application for interactively querying them. The purpose of Semantra is to make running a specialized semantic search engine easy, friendly, configurable, and private/secure.

Semantra is built for individuals seeking needles in haystacks — journalists sifting through leaked documents on deadline, researchers seeking insights within papers, students engaging with literature by querying themes, historians connecting events across books, and so forth.

## Questions

### Can it use ChatGPT?

No, and this is by design.

Semantra does not use any generative models like ChatGPT. It is built only to query text semantically without any layers on top to attempt explaining, summarizing, or synthesizing results. Generative language models occasionally produce outwardly plausible but ultimately incorrect information, placing the burden of verification on the user. Semantra treats primary source material as the only source of truth and endeavors to show that a human-in-the-loop search experience on top of simpler embedding models is more serviceable to users.

### What are embeddings

Assume a mystical machine exists that converts text you give it to a number. “Dog” → 352. “Cat” → 361. “Toasters are cool” → 6,723,412. The closer the meaning of the text you give it, the closer the output number will be. Here, “Dog” and “Cat” are really close at only 9 apart, while the text “Toasters are cool” is off by millions.

Well, these mystical machines are essentially machine learning models. These models are fed vast amounts of text and “learn” to output numbers that correspond to meaning by setting up internal networks of weights and adjusting them slightly over many millions of iterations based on if text fits expected patterns or not. In a rudimentary example, “dog” and “cat” may be relatively interchangeable in sentences talking about pets and so the model infers they are similar.

The mystical machine above only outputs one number, but in practice more are needed to meaningfully encode text semantically. Two numbers enables placing text in two-dimensional space. Three numbers gives you a three-dimensional representation, which provides more directions along which placed text can be close together (x ↔ y, y ↔ z, z ↔ x).

These output numbers are called “embeddings” and are treated as vectors which can be pictured as arrows pointing in a direction. To convert text to numbers is to “embed” the text into the “vector space.” Multiple passages of text are semantically similar according to embedding models if their embedding vectors point in similar directions. Though it’s hard to picture, popular embedding models typically output embeddings that are hundreds or even thousands of dimensions!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantra-0.0.2.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

semantra-0.0.2-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file semantra-0.0.2.tar.gz.

File metadata

  • Download URL: semantra-0.0.2.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.16

File hashes

Hashes for semantra-0.0.2.tar.gz
Algorithm Hash digest
SHA256 990fe1cb64905fac04f8e18f33e620ac0f1f718f5b792b92a275b78697a5c231
MD5 dbc7315ffd7b38a3e73a78761fc2ca48
BLAKE2b-256 103f1356e80215ac29dc9b00b44adce3a0b7f1c50c625b27edf82969ef132408

See more details on using hashes here.

File details

Details for the file semantra-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: semantra-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.16

File hashes

Hashes for semantra-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bc8f483daa4300abd4d4a6bc9cf33eea4bba472693a0a26fcd95ce74db987fd5
MD5 a35a3d46245290befec1d459de320688
BLAKE2b-256 9ecc6d1d2a033614150fa64e9e49e6556a32d054839c7511bb330a7129192443

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page