Skip to main content

LLM-friendly scraper for media and text from social media and the open web.

Project description

scrapeMM: Multimodal Web Retrieval

Simple web scraper to asynchronously retrieve webpages and access social media contents, fetching text along with media, i.e., images and videos.

This library aims to help developers and researchers to easily access multimodal data from the web and use it for LLM processing.

Usage

from scrapemm import retrieve

url = "https://example.com"
result = retrieve(url)
result.render()

How it works

Input:                                  Output:
URL (string)   -->   retrieve()   -->   MultimodalSequence

The MultimodalSequence is a sequence of Markdown-formatted text and media provided by the ezMM library.

Web scraping is done with Firecrawl.

Supported Proprietary APIs

  • ✅ X/Twitter
  • ✅ Telegram
  • ⏳ Facebook
  • ⏳ Instagram
  • ⏳ Threads
  • ⏳ TikTok

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemm-0.1.3.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapemm-0.1.3-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file scrapemm-0.1.3.tar.gz.

File metadata

  • Download URL: scrapemm-0.1.3.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.1.3.tar.gz
Algorithm Hash digest
SHA256 282c9c1d49b29d7367e609ed6ac6db8cdc361398ea73080c1b1e59f22c6ec4e8
MD5 ae3e013075a2be36e7f72bb320ca0854
BLAKE2b-256 f4c3220a322d45230a70079a407977f99733a5b912f1d7fcff54e4b7c7df1f5c

See more details on using hashes here.

File details

Details for the file scrapemm-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: scrapemm-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 21.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 dd8b9b7fa9142475fe0d1e65b70dc1fb10715a06201f28428c510eaf445407cb
MD5 000363866677261fe9d37d4b9eb6a822
BLAKE2b-256 0ac1ca9b55b381c7cde93c76cab2a9f3da3214b8f57b51076c3c2f6d97e054f7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page