Easily create semantic search based LLM applications on your own data

These details have not been verified by PyPI

Project description

LangSearch: Easily create semantic search based LLM applications for your own data

Spiders

Webspider

Usage

from langsearch.spiders.webspider import WebSpider


class MyWebSpider(WebSpider):
    name = "my_web_spider"

Settings for `WebSpider`

LANGSEARCH_WEB_SPIDER_START_URLS: list of seed URLs
LANGSEARCH_WEB_SPIDER_LINK_EXTRACTOR_ALLOW: list of regex patterns that absolute URLs must match to be extracted and followed
LANGSEARCH_WEB_SPIDER_LINK_EXTRACTOR_DENY: list of regex patterns. Matching links will not be extracted and followed. Has precedence over LANGSEARCH_WEB_SPIDER_LINK_EXTRACTOR_ALLOW

If you already have a list of target URLs, use the following settings in settings.py.

# You can also write code in settings.py e.g. to load START_URLS from a file
LANGSEARCH_WEB_SPIDER_START_URLS = ["<first_link>", "<second_link>", ... ]
LANGSEARCH_WEB_SPIDER_LINK_EXTRACTOR_ALLOW = []
# We use an all-matching regex to ensure that only links in START_URLS are downloaded and no further links are followed
LANGSEARCH_WEB_SPIDER_LINK_EXTRACTOR_DENY = [".*"]

Here is an example for the second use case, where you don't have a list of target URLs but want to auto-discover links by crawling from some seed URLs.

LANGSEARCH_WEB_SPIDER_START_URLS = ["https://docs.python.org"]
# Follow links to docs.python.org only. Links to wiki.python.org will not be extracted or followed, for example.
LANGSEARCH_WEB_SPIDER_LINK_EXTRACTOR_ALLOW = ["docs.\python.\org"]
# Deny old documentation versions and non-english pages
LANGSEARCH_WEB_SPIDER_LINK_EXTRACTOR_DENY = ["docs\.python\.org(?!/3/)",
                                        "/es|fr|ja|ko|pt-br|tr|zh-cn|zh-tw/",
                                        ]

Filespider

Usage

from langsearch.spiders.filespider import FileSpider


class MyFileSpider(FileSpider):
    name = "my_file_spider"

Settings for `FileSpider`

LANGSEARCH_FILE_SPIDER_START_FOLDERS: list of folders to start from
LANGSEARCH_FILE_SPIDER_FOLLOW_SUBFOLDERS: boolean that indicates whether documents in subfolders will be fetched
LANGSEARCH_FILE_SPIDER_FOLLOW_SYMLINKS: boolean that indicates whether symbolic links will be followed
LANGSEARCH_FILE_SPIDER_ALLOW: list of regex that the absolute filepath (including extension) must match to be fetched
LANGSEARCH_FILE_SPIDER_DENY: absolute filepaths (including extension) matching any regex in this list will not be fetched. Has precedence over LANGSEARCH_FILE_SPIDER_ALLOW.

Middlewares

Spider Middlewares

RegexFilterMiddleware

Usage

Include the following in your settings.py

SPIDER_MIDDLEWARES = {
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,   # Disable scrapy's OffsiteMiddleware
    'langsearch.middlewares.spider_middlewares.RegexFilterMiddleware': 500,    # Use langsearch's RegexFilterMiddleware instead
}

Settings

LANGSEARCH_REGEX_FILTER_MIDDLEWARE_ALLOW
LANGSEARCH_REGEX_FILTER_MIDDLEWARE_DENY

Pipelines

Include the following in your settings.py

ITEM_PIPELINES = {
   # The item that you send down the pipeline must have the fields "body", "text" and "url"
   # this pipeline detects the item type and sends it down the processors that handle that type
   # must be the first one in the list
   "langsearch.pipelines.DetectItemTypePipeline": 100,   
   # this builds a pipeline placing all components in the correct order 
   **assemble(pipelines=[TextPipeline,    # Add the pipelines for the types you want extracted
                         AudioVideoPipeline,   # This one extracts text from audio and video
                         OtherPipeline
                         ],
               )                          
   # attributes in item are transparently passed though
}

Available settings

LANGSEARCH_TRAFILATURA_PIPELINE_EXTRACT_ARGUMENTS = {...}


LANGSEARCH_WHISPER_PIPELINE_ALLOWED_LANGUAGES=[...]
LANGSEARCH_WHISPER_PIPELINE_MODEL=...

LANGSEARCH_TEXT_LANGUAGE_FILTER_PIPELINE_ALLOWED_LANGUAGES=[...]

LANGSEARCH_STORE_ITEM_PIPELINE_EMBEDDING_MODEL=langsearch.EmbeddingModel.GPT3  # None will lead to no embedding and Weaviate's automatic embedding being used
LANGSEARCH_STORE_ITEM_PIPELINE_WEAVIATE_BASE_URL=http://localhost:...   # If not specified, will look for env variable
LANGSEARCH_STORE_ITEM_PIPELINE_DATABASE_URL=http://localhost:...   # If not specified, will look for env variable
LANGSEARCH_STORE_ITEM_PIPELINE_WEAVIATE_CLASS = ...  # If not specified, will use BOT_NAME setting
LANGSEARCH_STORE_ITEM_PIPELINE_DUPLICATE_CUTOFF = ...  # Default: 95

`DryRunPipeline`

We additionally make a DryRunPipeline available that simply dumps the URLs to a file. This is useful to check that your allow/deny rules are working as expected.

Philosophy for pipeline classes

Some pipelines have generic components that could have many implementations. For example, we use LLMs to generate answers etc., and there are many LLMs. If there is a standard interface already available that abstracts out the various implementations, and this standard interface has everything we need to implement the pipeline, then we will use that standard interface to write flexible pipeline classes. The exact implementation is then specified using settings or environment variables.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.9

Aug 30, 2023

0.1.8

Aug 30, 2023

0.1.7

Aug 29, 2023

0.1.6

Aug 29, 2023

0.1.5

Aug 29, 2023

0.1.4

Aug 29, 2023

0.1.3

Aug 29, 2023

0.1.2

Apr 5, 2023

This version

0.1.1

Mar 16, 2023

0.1.0

Mar 16, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langsearch-0.1.1.tar.gz (3.5 kB view hashes)

Uploaded Mar 16, 2023 Source

Built Distribution

langsearch-0.1.1-py2.py3-none-any.whl (4.2 kB view hashes)

Uploaded Mar 16, 2023 Python 2 Python 3

Hashes for langsearch-0.1.1.tar.gz

Hashes for langsearch-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`148ef50369409626788e75b6ff2ef3b29ba15c9df3c6133a74c3d0ba0af7ba16`
MD5	`f462daf1bb79e0af75af5c98bd253ffb`
BLAKE2b-256	`49c2d505261a2ac4c9fca7cb491374db34268b607843a02874e91412cf6cc6d4`

Hashes for langsearch-0.1.1-py2.py3-none-any.whl

Hashes for langsearch-0.1.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`a244ac5aa053c86586b19a25d172dccf7bef46cd855a888c838e78d49637fbfd`
MD5	`85d123ca5955200131339d42f9fd503e`
BLAKE2b-256	`14e04c3b11509213427cf51639f88171d8cf13397144026b5939593b1e8004dc`

langsearch 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

LangSearch: Easily create semantic search based LLM applications for your own data

Spiders

Webspider

Usage

Settings for `WebSpider`

Filespider

Usage

Settings for `FileSpider`

Middlewares

Spider Middlewares

RegexFilterMiddleware

Usage

Settings

Pipelines

Available settings

`DryRunPipeline`

Philosophy for pipeline classes

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

langsearch 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

LangSearch: Easily create semantic search based LLM applications for your own data

Spiders

Webspider

Usage

Settings for WebSpider

Filespider

Usage

Settings for FileSpider

Middlewares

Spider Middlewares

RegexFilterMiddleware

Usage

Settings

Pipelines

Available settings

DryRunPipeline

Philosophy for pipeline classes

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Settings for `WebSpider`

Settings for `FileSpider`

`DryRunPipeline`