Skip to main content

Flexible multimodal scraper for social media and the open web.

Project description

scrapeMM: Multimodal Web Retrieval

Simple web scraper to asynchronously retrieve webpages and access social media contents, fetching text along with media, i.e., images and videos.

This library aims to help developers and researchers to easily access multimodal data from the web and use it for LLM processing.

Setup

  • If you want to download videos: Then, the installation of ffmpeg is highly recommended. In Conda, you can install it with conda install -c conda-forge ffmpeg.
  • If you want to scrape Perma.cc archive records or Facebook photos, you'll need to install playwright with pip install playwright and running playwright install.

Usage

from scrapemm import retrieve
import asyncio

if __name__ == "__main__":
    url = "https://www.snopes.com/fact-check/gauze-originate-from-gaza/"
    result = asyncio.run(retrieve(url))
    if result.errors:
        print(result.errors)
    else:
        print(result.content)

scrapeMM will ask you for the API secrets needed for the integrations. You may skip them if you don't need them.

You will also be prompted to choose a password that is used to secure the secrets in an encrypted file.

How it works

Input:                                  Output:
URL (string)   -->   retrieve()   -->   MultimodalSequence

The MultimodalSequence is a sequence of Markdown-formatted text and media provided by the ezMM library.

Web scraping is done with Firecrawl and Decodo.

Supported Platforms

Social Media

  • ✅ X/Twitter
  • ✅ Telegram
  • ✅ Bluesky
  • ✅ TikTok
  • ✅ YouTube
  • (✅️) Instagram: works for most content
  • ✅️ Facebook
  • ❌ Threads: TBD
  • ❌ Reddit: TBD

Archiving Services

  • ✅ Perma.cc
  • (✅) Archive.today: Sometimes ending up in TimeoutErrors, generally pretty slow
  • ✅ MediaVault (mvau.lt)
  • ❌ Wayback Machine, Internet Archive (web.archive.org)
  • ❌ AwesomeScreenshot.com
  • ❌ Ghost Archive (ghostarchive.org)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemm-0.6.1.tar.gz (55.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapemm-0.6.1-py3-none-any.whl (67.4 kB view details)

Uploaded Python 3

File details

Details for the file scrapemm-0.6.1.tar.gz.

File metadata

  • Download URL: scrapemm-0.6.1.tar.gz
  • Upload date:
  • Size: 55.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.6.1.tar.gz
Algorithm Hash digest
SHA256 a1540273e370b41d167d09928b5c0b44f7d6667a893af67b3b0da76f4cb6de68
MD5 7c036ffcfdc1e6aa63de7513d75d5c4d
BLAKE2b-256 0fa61b599b2f1471de1caf614f90353f4256e15cbfba01e4370d61bffdbdb2e9

See more details on using hashes here.

File details

Details for the file scrapemm-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: scrapemm-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 67.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cc6b46a153ae741bf78a0b09cb650bfc303accd7d3967f8c8f1ee6a8c02d6717
MD5 bd4e50c909b8a979edbf787dd462848b
BLAKE2b-256 75e7d1d4b5a0108c5d5bd55034e87afad830f4491203d6263b54d7d6937cf31b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page