Skip to main content

Tourist Framework

Project description

Tourist🤳

PyPI version

An open-source, low-cost, serverless application for SERP extraction and web scraping.

Work on LLM projects without worrying about credits, subscriptions, or rate-limits. Tourist is a free alternative to many mainstream SERP API services. Run Tourist on your machine or deploy it into your own AWS account.

[!IMPORTANT]
Tourist is in early development. Features and API's may change unexpectedly.

Overview

tourist-architecture

Tourist has both Service and Client components. The Service (HTTP API) handles requests from the Client (your app, agent, or scraper scripts). You're in control of both components! None of your data is ever processed or stored by third parties.

Service

Local deployment (for testing...)

[!TIP]
Docker is recommended for running Tourist locally to handle dependencies for headless browsing.

  1. Have Docker installed
  2. docker pull ghcr.io/pogzyb/tourist:latest
  3. docker run -p 8000:8000 ghcr.io/pogzyb/tourist:latest

If the service came up correctly, you should see:

INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Check the docs at http://localhost:8000/docs

AWS deployment (for real...)

Deploy your own instance of Tourist into AWS with Terraform:

  1. Have Docker installed
  2. Clone this repo
  3. Have an AWS account with credentials copied to .env.aws in the root of this project
  4. make tourist-iac-interactive
  5. terraform apply - deploys the infrastructure into your AWS account

Use your endpoint: https://<uuid>.execute-api.us-east-1.amazonaws.com/main (available in terraform outputs)

[!WARNING]
Tourist uses serverless infrastructure to keep costs extremely low; however these costs will not be $0.00 depending on how heavily you use your API.

[!IMPORTANT]
Tourist uses the X-SECRET authorization header to protect your API, you should set this value in terraform/aws/terraform.tfvars

Client

Build your own LLM tools, web scraping apps, or automated testing workflows with the Tourist client.

Python

You can use the python client to interact with your Tourist service. Check out the examples folder for the complete code.

pip install tourist
LLM Tools

For example, create a LangChain Tool for your LLM Agent.

from bs4 import BeautifulSoup as bs
from tourist.core import TouristScraper
from langchain_core.tools import tool

# Assumes you're running locally,
# change this to your cloud endpoint if you've deployed via terraform.
scraper = TouristScraper(
    "http://localhost:8000",
    secret="doesntmatterlocally",  # authorization secret  
    concurrency=1,  # control concurrent searches/scrapes
)


@tool
def scrape_tool(url: str) -> str:
    """
    A web scraper. 
    Useful for when you need to answer questions related the contents of a website or URL.
    Input should be a URL.
    """
    results = scraper.get_page(url)
    if source_html := results.get("source_html"):
        soup = bs(source_html, "html.parser")
        return soup.get_text()
    else:
        return "Could not scrape that page. Try again."


@tool
def search_tool(query: str) -> str:
    """
    A search tool.
    Useful for when you need to answer questions about current events, people, places, or things.
    Input should be a search query.
    """
    results = scraper.get_serp(query, max_results=3):
    return " ".join([r["content"] for r in results])


# ... use the tools

Selenium

You can submit selenium scripts (human or AI-generated) to Tourist for execution.

from pprint import pprint

from tourist.core import TouristScraper


scraper = TouristScraper("http://localhost:8000", "no-secret")

my_selenium_code = """
# `driver`, a selenium `webdriver.Chrome` object, is available in globals
driver.get("https://www.example.com")
html = driver.page_source
# any key:values stored in `actions_output` will be available in the response
actions_output["html"] = html
actions_output["current_url"] = driver.current_url
"""

result = scraper.get_page_with_actions(my_selenium_code)
assert result["current_url"] == "https://www.example.com/"
pprint(result["html"])

Contributions

This is an open-source project. Please consider adding improvements or features related to your specific use-case. Chances are someone else is also facing the issue or limitation. Some ready-to-do tasks can be found in the source code as TODO/Contribution: ....

To run the Tourist service on your local machine for testing or prototyping:

  1. Have Docker installed
  2. Clone this repo
  3. Add your contributions/modifications
  4. make tourist-local - build the container from source code in src/

Credits

Components of this repository were influenced by these projects! Check them out.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tourist-0.1.6.tar.gz (385.4 kB view details)

Uploaded Source

Built Distribution

tourist-0.1.6-py3-none-any.whl (14.1 kB view details)

Uploaded Python 3

File details

Details for the file tourist-0.1.6.tar.gz.

File metadata

  • Download URL: tourist-0.1.6.tar.gz
  • Upload date:
  • Size: 385.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for tourist-0.1.6.tar.gz
Algorithm Hash digest
SHA256 a9fca2ad7a21fe126874f80e5f7d66d4d45ca6e56f64c1314685c876876c052a
MD5 8bbf0733c928610ae6e74e7f757996fb
BLAKE2b-256 f2f82e82ed2ec07e576d6315bfcfebd705cbfed4c9dd278eb0586d0ec543c970

See more details on using hashes here.

File details

Details for the file tourist-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: tourist-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 14.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for tourist-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 f1af9b7bad34704f40a7f73aed0ccb7d9a8632cedddbf924abcbf360b7f74a92
MD5 e9cd4243644efa9e8bab3174aed24ba6
BLAKE2b-256 3abdd28cd2ead70c985cf96538461afeabd078ccb1cce096e4fe38a1b2cfd722

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page