Skip to main content

LLM-friendly scraper for media and text from social media and the open web.

Project description

scrapeMM: Multimodal Web Retrieval

Simple web scraper to asynchronously retrieve webpages and access social media contents, fetching text along with media, i.e., images and videos.

This library aims to help developers and researchers to easily access multimodal data from the web and use it for LLM processing.

Usage

from scrapemm import retrieve
import asyncio

url = "https://example.com"
loop = asyncio.get_event_loop()
result = loop.run_until_complete(retrieve(url))
result.render()

scrapeMM will ask you for the API keys needed for the social media integrations. You may skip them if you don't need them. You will also be prompted to choose a password that is used to secure the secrets in an encrypted file.

How it works

Input:                                  Output:
URL (string)   -->   retrieve()   -->   MultimodalSequence

The MultimodalSequence is a sequence of Markdown-formatted text and media provided by the ezMM library.

Web scraping is done with Firecrawl.

Supported Proprietary APIs

  • ✅ X/Twitter
  • ✅ Telegram
  • ✅ Bluesky
  • ✅ TikTok
  • ⏳ Threads
  • ⏳ Reddit
  • ⏳ Facebook
  • ⏳ Instagram

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemm-0.2.2.tar.gz (28.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapemm-0.2.2-py3-none-any.whl (34.9 kB view details)

Uploaded Python 3

File details

Details for the file scrapemm-0.2.2.tar.gz.

File metadata

  • Download URL: scrapemm-0.2.2.tar.gz
  • Upload date:
  • Size: 28.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.2.2.tar.gz
Algorithm Hash digest
SHA256 9baaf03db0d481ea64e9a369fe8245b089ee33bc5d1417fbde4d7f1c46517c17
MD5 32275a8b3ad12b10e0142638d0efd4ca
BLAKE2b-256 8c7c7b743bd6f39d2086a77861f3e2c56ce6e17ab35d3b88baebfeadcdb96a34

See more details on using hashes here.

File details

Details for the file scrapemm-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: scrapemm-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 34.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for scrapemm-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0a4d83d8e043e779d844d9597be4fe2804977505d7468f3025744686519c95a9
MD5 953c0049e47bb39289df1f5867a0f9dc
BLAKE2b-256 b347b3011fff186109350951d507a8f3afea62dd541eec5f7032311328eb07b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page