Skip to main content

LLM-friendly scraper for media and text from social media and the open web.

Project description

scrapeMM: Multimodal Web Retrieval

Simple web scraper to asynchronously retrieve webpages and access social media contents, fetching text along with media, i.e., images and videos.

This library aims to help developers and researchers to easily access multimodal data from the web and use it for LLM processing.

Usage

from scrapemm import retrieve

url = "https://example.com"
result = retrieve(url)
result.render()

How it works

Input:                                  Output:
URL (string)   -->   retrieve()   -->   MultimodalSequence

The MultimodalSequence is a sequence of Markdown-formatted text and media provided by the ezMM library.

Web scraping is done with Firecrawl.

Supported Proprietary APIs

  • ✅ X/Twitter
  • ✅ Telegram
  • ⏳ Facebook
  • ⏳ Instagram
  • ⏳ Threads
  • ⏳ TikTok

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemm-0.1.1.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapemm-0.1.1-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file scrapemm-0.1.1.tar.gz.

File metadata

  • Download URL: scrapemm-0.1.1.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1391bc24c1e66d3724fe1193fe4f1b0c8670863c1f4f699173784fe25a8e7b87
MD5 9c71710eb120cbecafede39c4d8ed759
BLAKE2b-256 9d1c15fc9e05c5b5773883afa646490b0edb10619db202ef4396e81945ba5354

See more details on using hashes here.

File details

Details for the file scrapemm-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: scrapemm-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 20.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 70f883e24cf6f392428519153cd04892d9451af2efdb8c4277fce084e30c013b
MD5 b18b431caa990229cb9794ce32ff1d39
BLAKE2b-256 6cfa0af94a60c15e9d9f4b13371422d311d2ba79a4916cb00f1e3ef3ffd96dc6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page