Skip to main content

Flexible multimodal scraper for social media and the open web.

Project description

scrapeMM: Multimodal Web Retrieval

Simple web scraper to asynchronously retrieve webpages and access social media contents, fetching text along with media, i.e., images and videos.

This library aims to help developers and researchers to easily access multimodal data from the web and use it for LLM processing.

Usage

from scrapemm import retrieve
import asyncio

url = "https://example.com"
loop = asyncio.get_event_loop()
result = loop.run_until_complete(retrieve(url))
result.render()

scrapeMM will ask you for the API keys needed for the social media integrations. You may skip them if you don't need them. You will also be prompted to choose a password that is used to secure the secrets in an encrypted file.

How it works

Input:                                  Output:
URL (string)   -->   retrieve()   -->   MultimodalSequence

The MultimodalSequence is a sequence of Markdown-formatted text and media provided by the ezMM library.

Web scraping is done with Firecrawl and Decodo.

Supported Platforms

Social Media

  • ✅ X/Twitter
  • ✅ Telegram
  • ✅ Bluesky
  • ✅ TikTok
  • ✅ YouTube
  • (✅️) Instagram: works for most content
  • ⏳ Facebook: done for videos but not for images yet
  • ❌ Threads: TBD
  • ❌ Reddit: TBD

Archiving Services

  • ❌ Perma.cc
  • ❌ Archive.today
  • ❌ Wayback Machine, Internet Archive (web.archive.org)
  • ❌ AwesomeScreenshot.com
  • ⏳ MediaVault (mvau.lt): Works for images but not for videos yet
  • ❌ Ghost Archive (ghostarchive.org)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemm-0.4.2.tar.gz (33.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapemm-0.4.2-py3-none-any.whl (40.5 kB view details)

Uploaded Python 3

File details

Details for the file scrapemm-0.4.2.tar.gz.

File metadata

  • Download URL: scrapemm-0.4.2.tar.gz
  • Upload date:
  • Size: 33.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.4.2.tar.gz
Algorithm Hash digest
SHA256 5ef9fcc823f9d7a8cfe21a2def0a1c5f32ef1fc2ca464a31161bf40d4ea67e39
MD5 7af138c054a25d70a559a8d2394eb516
BLAKE2b-256 0669e4e8aa3f85c16d458650ea8279bee35a23ea31c2c00b47a265bd40c29b8c

See more details on using hashes here.

File details

Details for the file scrapemm-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: scrapemm-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 40.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0a9b751e05cf19e75a311c9f58328891f4cc9311e532d765e85c4d6a98a87f06
MD5 176e938d0a4528ca08987c5250c708d1
BLAKE2b-256 5e5a449f8b6a7846e2d107694c89dc93aab63d78e2e1ab41aea09a3acdd126bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page