Skip to main content

LLM-friendly scraper for media and text from social media and the open web.

Project description

scrapeMM: Multimodal Web Retrieval

Simple web scraper to asynchronously retrieve webpages and access social media contents, fetching text along with media, i.e., images and videos.

This library aims to help developers and researchers to easily access multimodal data from the web and use it for LLM processing.

Usage

from scrapemm import retrieve
import asyncio

url = "https://example.com"
loop = asyncio.get_event_loop()
result = loop.run_until_complete(retrieve(url))
result.render()

scrapeMM will ask you for the API keys needed for the social media integrations. You may skip them if you don't need them. You will also be prompted to choose a password that is used to secure the secrets in an encrypted file.

How it works

Input:                                  Output:
URL (string)   -->   retrieve()   -->   MultimodalSequence

The MultimodalSequence is a sequence of Markdown-formatted text and media provided by the ezMM library.

Web scraping is done with Firecrawl and Decodo.

Supported Proprietary APIs

  • ✅ X/Twitter
  • ✅ Telegram
  • ✅ Bluesky
  • ✅ TikTok
  • ⚠️ Facebook (working only sometimes and only with yt-dlp and Decodo)
  • ⚠️ Instagram (done for videos but not for images yet)
  • ⚠️ YouTube (working sometimes)
  • ⏳ Threads
  • ⏳ Reddit

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemm-0.3.3.tar.gz (32.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapemm-0.3.3-py3-none-any.whl (39.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapemm-0.3.3.tar.gz.

File metadata

  • Download URL: scrapemm-0.3.3.tar.gz
  • Upload date:
  • Size: 32.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.3.3.tar.gz
Algorithm Hash digest
SHA256 96cfc5022316b77a6f7a78967e0c7ecd04978c17db50cf05ba437a6349b8d14a
MD5 13e6b88a03c2983b191e840447d22e05
BLAKE2b-256 e2d526c5804ebca89d7d0be93bf73e5f4b78ee74bb41bfbf101f2821734d98fd

See more details on using hashes here.

File details

Details for the file scrapemm-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: scrapemm-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 39.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 16ec687e68ce3c1487a13db69ad21764d41284d5256edf25c6755a1f6d9e6f0a
MD5 2d7ac0c9dba420b3b9657ac183dcf126
BLAKE2b-256 ceca3d8b9e1e14a3ca2a6f913f528ba67bdf7283ff839e9bd4f09c9864da9169

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page