Skip to main content

Intelligent Market Monitoring

Project description

fraudcrawler

CI Status Python Version License PyPI

Fraudcrawler is an intelligent market monitoring tool that searches the web for products, extracts product details, and classifies them using LLMs. It combines search APIs, web scraping, and AI to automate product discovery and relevance assessment.

Features

  • Asynchronous pipeline - Products move through search, extraction, and classification stages independently
  • Multiple search engines - Google Search, Google Shopping, and more...
  • Search term enrichment - Automatically find related terms and expand your search
  • Product extraction - Get structured product data via Zyte API
  • LLM classification - Assess product relevance using OpenAI API with custom prompts
  • Marketplace filtering - Focus searches on specific domains
  • Deduplication - Avoid reprocessing previously collected URLs
  • CSV export - Results saved with timestamps for easy tracking

Prerequisites

  • Python 3.11 or higher
  • API keys for:
    • SerpAPI - Google search results
    • Zyte API - Product data extraction
    • OpenAI API - Product classification
    • DataForSEO (optional) - Search term enrichment

Installation

python3.11 -m venv .venv
source .venv/bin/activate
pip install fraudcrawler

Using Poetry:

poetry install

Configuration

Create a .env file with your API credentials (see .env.example for template):

SERPAPI_KEY=your_serpapi_key
ZYTEAPI_KEY=your_zyte_key
OPENAIAPI_KEY=your_openai_key
DATAFORSEO_USER=your_user  # optional
DATAFORSEO_PWD=your_pwd    # optional

Usage

Basic Configuration

For a complete working example, see fraudcrawler/launch_demo_pipeline.py. After setting up the necessary parameters you can launch and analyse the results with:

# Run pipeline
await client.run(
    search_term=search_term,
    search_engines=search_engines,
    language=language,
    location=location,
    deepness=deepness,
    excluded_urls=excluded_urls,
)

# Load results
df = client.load_results()
print(df.head())

Advanced Configuration

Search term enrichment - Find and search related terms:

from fraudcrawler import Enrichment

deepness.enrichment = Enrichment(
    additional_terms=5,
    additional_urls_per_term=10
)

Marketplace filtering - Focus on specific domains:

from fraudcrawler import Host

marketplaces = [
    Host(name="International", domains="zavamed.com,apomeds.com"),
    Host(name="National", domains="netdoktor.ch,nobelpharma.ch"),
]

await client.run(..., marketplaces=marketplaces)

Exclude domains - Exclude specific domains from your results:

excluded_urls = [
    Host(name="Compendium", domains="compendium.ch"),
]

await client.run(..., excluded_urls=excluded_urls)

Skip previously collected URLs:

previously_collected_urls = [
    "https://example.com/product1",
    "https://example.com/product2",
]

await client.run(..., previously_collected_urls=previously_collected_urls)

Redis cache – Set REDIS_USE_CACHE=true and run Redis to cache API and scrape calls (Searcher, Enricher, Zyte, Workflow).

View all results from a client instance:

client.print_available_results()

Output

Results are saved as CSV files in data/results/ with the naming pattern:

<search_term>_<language_code>_<location_code>_<timestamp>.csv

Example: sildenafil_de_ch_20250115143022.csv

The CSV includes product details, URLs, and classification scores from your workflows.

Development

For detailed contribution guidelines, see CONTRIBUTING.md.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Architecture

Fraudcrawler uses an asynchronous pipeline where products can be at different processing stages simultaneously. Product A might be in classification while Product B is still being scraped. This is enabled by async workers for each stage (Search, Context Extraction, Processing) using httpx.AsyncClient.

Async Setup

For more details on the async design, see the httpx documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fraudcrawler-0.8.6.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fraudcrawler-0.8.6-py3-none-any.whl (1.0 MB view details)

Uploaded Python 3

File details

Details for the file fraudcrawler-0.8.6.tar.gz.

File metadata

  • Download URL: fraudcrawler-0.8.6.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.13.12 Linux/6.18.15-1.qubes.fc37.x86_64

File hashes

Hashes for fraudcrawler-0.8.6.tar.gz
Algorithm Hash digest
SHA256 146596c1f3e927982da08b5aa4c04f1de9bb6342ac1b08b944d1e22bc2c7cdc8
MD5 e5f979c1c7bc3d0a548298dc4a2ebe84
BLAKE2b-256 3157527c3eeb1c93bbb98ad84a09ee9b8eb9a8a6df3bddaf2f60480c46f816ee

See more details on using hashes here.

File details

Details for the file fraudcrawler-0.8.6-py3-none-any.whl.

File metadata

  • Download URL: fraudcrawler-0.8.6-py3-none-any.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.13.12 Linux/6.18.15-1.qubes.fc37.x86_64

File hashes

Hashes for fraudcrawler-0.8.6-py3-none-any.whl
Algorithm Hash digest
SHA256 d69d7dc0e1e48a95f370565b897890ae98f368157f7d8748d9fe2ce4e915e7a1
MD5 ce7702b0278cff4ebc41d2f7cf0ff3d8
BLAKE2b-256 5d667ca5dc792e7f0d3d6c5b8c7d313c70c76a0be8dfba4ac6a636479d7ef797

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page