Skip to main content

Apify-haystack integration

Project description

Apify-Haystack integration

License PyPi Package Python

The Apify-Haystack integration allows easy interaction between the Apify platform and Haystack.

Apify is a platform for web scraping, data extraction, and web automation tasks. It provides serverless applications called Actors for different tasks, like crawling websites, and scraping Facebook, Instagram, and Google results, etc.

Haystack offers an ecosystem of tools for building, managing, and deploying search engines and LLM applications.

Installation

Apify-haystack is available at the apify-haystack PyPI package.

pip install apify-haystack

Examples

Crawl a website using Apify's Website Content Crawler and convert it to Haystack Documents

You need to have an Apify account and API token to run this example. You can start with a free account at Apify and get your API token.

In the example below, specify apify_api_token and run the script:

from dotenv import load_dotenv
from haystack import Document

from apify_haystack import ApifyDatasetFromActorCall

# Set APIFY_API_TOKEN here or load it from .env file
apify_api_token = "" or load_dotenv()

actor_id = "apify/website-content-crawler"
run_input = {
    "maxCrawlPages": 3,  # limit the number of pages to crawl
    "startUrls": [{"url": "https://haystack.deepset.ai/"}],
}


def dataset_mapping_function(dataset_item: dict) -> Document:
    return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")})


actor = ApifyDatasetFromActorCall(
    actor_id=actor_id, run_input=run_input, dataset_mapping_function=dataset_mapping_function
)
print(f"Calling the Apify actor {actor_id} ... crawling will take some time ...")
print("You can monitor the progress at: https://console.apify.com/actors/runs")

dataset = actor.run().get("documents")

print(f"Loaded {len(dataset)} documents from the Apify Actor {actor_id}:")
for d in dataset:
    print(d)

More examples

See other examples in the examples directory for more examples, here is a list of few of them

  • Load a dataset from Apify and convert it to a Haystack Document
  • Call Website Content Crawler and convert the data into the Haystack Documents
  • Crawl websites, retrieve text content, and store it in the InMemoryDocumentStore
  • Retrieval-Augmented Generation (RAG): Extracting text from a website & question answering Open In Colab
  • Analyze Your Instagram Comments’ Vibe with Apify and Haystack Open In Colab

Support

If you find any bug or issue, please submit an issue on GitHub. For questions, you can ask on Stack Overflow, in GitHub Discussions or you can join our Discord server.

Contributing

Your code contributions are welcome. If you have any ideas for improvements, either submit an issue or create a pull request. For contribution guidelines and the code of conduct, see CONTRIBUTING.md.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apify_haystack-0.1.7.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

apify_haystack-0.1.7-py3-none-any.whl (16.7 kB view details)

Uploaded Python 3

File details

Details for the file apify_haystack-0.1.7.tar.gz.

File metadata

  • Download URL: apify_haystack-0.1.7.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for apify_haystack-0.1.7.tar.gz
Algorithm Hash digest
SHA256 2d5ff1efd4e05468413eb87ec62c24ba977768774b3343f90eaaacb57c14cac0
MD5 506b41060ab6c86cfa6e3f382277ad62
BLAKE2b-256 6344abc646e03b235eac083017ae642a27f734422c682129eda3cf73f935b391

See more details on using hashes here.

Provenance

The following attestation bundles were made for apify_haystack-0.1.7.tar.gz:

Publisher: release.yaml on apify/apify-haystack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file apify_haystack-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: apify_haystack-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 16.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for apify_haystack-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 a0c7afa3e73b919eb17e81e272b77100b3112f699626cf1ac268ff02a6397690
MD5 01cf1841101ecde7a41d6756bd38194c
BLAKE2b-256 b1eb98c8356d86122db72bf88a2375b887485cf3a011f9290189ea9e26c9feca

See more details on using hashes here.

Provenance

The following attestation bundles were made for apify_haystack-0.1.7-py3-none-any.whl:

Publisher: release.yaml on apify/apify-haystack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page