Skip to main content

Flexible multimodal scraper for social media and the open web.

Project description

scrapeMM: Multimodal Web Retrieval

Simple web scraper to asynchronously retrieve webpages and access social media contents, fetching text along with media, i.e., images and videos.

This library aims to help developers and researchers to easily access multimodal data from the web and use it for LLM processing.

Setup

  • If you want to download videos: Then, the installation of ffmpeg is highly recommended. In Conda, you can install it with conda install -c conda-forge ffmpeg.
  • If you want to scrape Perma.cc archive records, you'll need to install playwright with pip install playwright and running playwright install.

Usage

from scrapemm import retrieve
import asyncio

url = "https://example.com"
loop = asyncio.get_event_loop()
result = loop.run_until_complete(retrieve(url))
result.render()

scrapeMM will ask you for the API keys needed for the social media integrations. You may skip them if you don't need them. You will also be prompted to choose a password that is used to secure the secrets in an encrypted file.

How it works

Input:                                  Output:
URL (string)   -->   retrieve()   -->   MultimodalSequence

The MultimodalSequence is a sequence of Markdown-formatted text and media provided by the ezMM library.

Web scraping is done with Firecrawl and Decodo.

Supported Platforms

Social Media

  • ✅ X/Twitter
  • ✅ Telegram
  • ✅ Bluesky
  • ✅ TikTok
  • ✅ YouTube
  • (✅️) Instagram: works for most content
  • ⏳ Facebook: done for videos but not for images yet
  • ❌ Threads: TBD
  • ❌ Reddit: TBD

Archiving Services

  • ❌ Perma.cc
  • ❌ Archive.today
  • ❌ Wayback Machine, Internet Archive (web.archive.org)
  • ❌ AwesomeScreenshot.com
  • ⏳ MediaVault (mvau.lt): Works for images but not for videos yet
  • ❌ Ghost Archive (ghostarchive.org)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemm-0.5.5.tar.gz (47.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapemm-0.5.5-py3-none-any.whl (57.6 kB view details)

Uploaded Python 3

File details

Details for the file scrapemm-0.5.5.tar.gz.

File metadata

  • Download URL: scrapemm-0.5.5.tar.gz
  • Upload date:
  • Size: 47.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.5.5.tar.gz
Algorithm Hash digest
SHA256 7e5775c37d583e4c5a1804a894aba58ab744a9f06fb8380663ae16ce468d72e6
MD5 e8c6d737c7871f5f0b2447fee1362ae8
BLAKE2b-256 4e71bec4e2a464cb292e8b1025873220b0fe7c07deea0b0123728337d813e2e5

See more details on using hashes here.

File details

Details for the file scrapemm-0.5.5-py3-none-any.whl.

File metadata

  • Download URL: scrapemm-0.5.5-py3-none-any.whl
  • Upload date:
  • Size: 57.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.5.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b07d1dda343b8805c752b2f03378aa764bffbc0bda98644593c9bf1bc9c7cee5
MD5 8350049c673ef4bd941c798ba2f8d7bf
BLAKE2b-256 2f5c56a18235f0220e8fdd630512e600987d1899ed1f6e7228d96d987359e928

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page