Skip to main content

LLM-friendly scraper for media and text from social media and the open web.

Project description

scrapeMM: Multimodal Web Retrieval

Simple web scraper to asynchronously retrieve webpages and access social media contents, fetching text along with media, i.e., images and videos.

This library aims to help developers and researchers to easily access multimodal data from the web and use it for LLM processing.

Usage

from scrapemm import retrieve
import asyncio

url = "https://example.com"
loop = asyncio.get_event_loop()
result = loop.run_until_complete(retrieve(url))
result.render()

scrapeMM will ask you for the API keys needed for the social media integrations. You may skip them if you don't need them. You will also be prompted to choose a password that is used to secure the secrets in an encrypted file.

How it works

Input:                                  Output:
URL (string)   -->   retrieve()   -->   MultimodalSequence

The MultimodalSequence is a sequence of Markdown-formatted text and media provided by the ezMM library.

Web scraping is done with Firecrawl and Decodo.

Supported Proprietary APIs

  • ✅ X/Twitter
  • ✅ Telegram
  • ✅ Bluesky
  • ✅ TikTok
  • ✅ YouTube (working sometimes)
  • ⚠️ Facebook (done for videos but not for images yet)
  • ⚠️ Instagram (done for videos but not for images yet)
  • ⏳ Threads
  • ⏳ Reddit

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemm-0.3.6.tar.gz (33.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapemm-0.3.6-py3-none-any.whl (39.6 kB view details)

Uploaded Python 3

File details

Details for the file scrapemm-0.3.6.tar.gz.

File metadata

  • Download URL: scrapemm-0.3.6.tar.gz
  • Upload date:
  • Size: 33.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.3.6.tar.gz
Algorithm Hash digest
SHA256 d75408e6ca0afffea4da1b927b80db9df6f72c37cf86773df5a8722c2f4acb12
MD5 dd1f7ef6dd983bca604c37db205fc4b6
BLAKE2b-256 3bc4a1661b9387f53cc12f39c908d338818a56e42ac0276e2bb17e1f8533501f

See more details on using hashes here.

File details

Details for the file scrapemm-0.3.6-py3-none-any.whl.

File metadata

  • Download URL: scrapemm-0.3.6-py3-none-any.whl
  • Upload date:
  • Size: 39.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 c0538542cf6155b82a4ec0679f8749954880a558d4c4f05aa6b6e5bb92fa91ff
MD5 99214cb1e454f49aa0a00ec0fb4d1805
BLAKE2b-256 7d90efa0529f377493be9ab58a2a9553c3032509f668631c69609a3cfb7bf200

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page