Skip to main content

Python SDK for MarkUDown — AI data extraction infrastructure

Project description

MarkUDown Python SDK

Python SDK for MarkUDown — AI data extraction infrastructure that turns any website into structured, ready-to-use data.

Installation

pip install markudown

Quick Start

from markudown import MarkUDown

md = MarkUDown(api_key="your-api-key")

# Scrape a page
result = md.scrape("https://example.com")
print(result["markdown"])

# Search the web
results = md.search("best web scraping tools 2025", engine="google", limit=5)

# Extract structured data
data = md.extract(
    "https://shop.example.com/products",
    schema=[
        {"name": "title",  "type": "string", "active": True},
        {"name": "price",  "type": "float",  "active": True},
        {"name": "rating", "type": "float",  "active": True},
    ],
    extract_query="Extract all product listings with title, price and rating",
)

Get your API key at scrapetechnology.com/markudown. New accounts include 500 free credits.


Async Client

import asyncio
from markudown import AsyncMarkUDown

async def main():
    async with AsyncMarkUDown(api_key="your-api-key") as md:
        result = await md.scrape("https://example.com")
        print(result["markdown"])

asyncio.run(main())

Configuration

md = MarkUDown(
    api_key="your-api-key",
    base_url="https://api.scrapetechnology.com",  # default
    timeout=120,          # HTTP request timeout (seconds)
    poll_interval=2.0,    # seconds between status checks for async jobs
    poll_timeout=300.0,   # max seconds to wait for a job to complete
)

All async endpoints accept wait=False to return the job ID immediately instead of polling:

job = md.crawl("https://example.com", wait=False)
job_id = job["id"]

# Check later
status = md.get_job_status("crawl", job_id)

API Reference

scrape(url, **opts) → dict

Scrape a single URL. Returns result immediately.

Parameter Type Default Description
url str required URL to scrape
timeout int 60 Request timeout (seconds)
exclude_tags list[str] ["header","nav","footer"] HTML tags to strip
main_content bool True Extract only main content
include_links bool False Include hyperlinks in output
include_html bool False Include raw HTML in output
result = md.scrape(
    "https://example.com",
    main_content=True,
    include_links=True,
)
print(result["markdown"])

screenshot(url, **opts) → dict

Take a screenshot of a URL.

result = md.screenshot("https://example.com")
# result["image"] contains base64-encoded PNG

map(url, **opts) → dict

Discover all URLs on a website.

Parameter Type Default Description
url str required Root URL
allowed_words list[str] [] Only include URLs with these keywords
blocked_words list[str] [] Exclude URLs with these keywords
max_urls int 1000 Maximum URLs to return
wait bool True Poll until complete
result = md.map("https://example.com", max_urls=500)
print(result["data"]["urls"])

crawl(url, **opts) → dict

Crawl a website recursively and extract content from every page.

Parameter Type Default Description
url str required Starting URL
max_depth int 2 Maximum crawl depth
limit int 10 Maximum pages to crawl
timeout int 60 Per-page timeout
include_links bool False Include links in output
include_html bool False Include raw HTML
blocked_words list[str] [] Skip URLs with these words
include_only list[str] [] Only crawl URLs matching these patterns
main_content bool True Extract only main content
callback_url str None Webhook URL on completion
wait bool True Poll until complete
result = md.crawl(
    "https://docs.example.com",
    max_depth=3,
    limit=50,
    include_only=["*guide*", "*tutorial*"],
)
for page in result["data"]["pages"]:
    print(page["url"], page["markdown"][:200])

extract(url, schema, **opts) → dict

Schema-based structured data extraction with AI.

Parameter Type Default Description
url str required URL to extract from
schema list[dict] required Field definitions
extract_query str "" Natural language description
extraction_scope str None "whole_site", "category", "single_page"
extraction_target str None Category or search term
allowed_words list[str] [] Keywords in relevant URLs
blocked_words list[str] [] Keywords in irrelevant URLs
allowed_patterns list[str] [] URL path fragments to include
blocked_patterns list[str] [] URL path fragments to exclude
callback_url str None Webhook URL on completion
wait bool True Poll until complete

Schema field format: {"name": "field_name", "type": "string|float|integer|date|url", "active": True}

result = md.extract(
    "https://shop.example.com",
    schema=[
        {"name": "title",  "type": "string", "active": True},
        {"name": "price",  "type": "float",  "active": True},
        {"name": "in_stock", "type": "string", "active": True},
    ],
    extract_query="Product listings with title, price and availability",
    extraction_scope="whole_site",
)

create_schema(query) → dict

Generate an extraction schema automatically from a natural language description.

schema_data = md.create_schema(
    "Extract all products from https://shop.example.com with title, price, rating and availability"
)
# Returns: {"url": "...", "schema": [...], "extraction_scope": "...", ...}

prompt_extract(url, prompt, **opts) → dict

Extract data with a natural language prompt — no schema required.

result = md.prompt_extract(
    "https://example.com/team",
    prompt="List all team members with their name, title and LinkedIn URL",
)

batch_scrape(urls, **opts) → dict

Scrape multiple URLs in parallel.

result = md.batch_scrape([
    "https://example.com/page-1",
    "https://example.com/page-2",
    "https://example.com/page-3",
])
for page in result["data"]["pages"]:
    print(page["url"])

search(query, **opts) → dict

Web search with optional scraping of result pages.

Parameter Type Default Description
query str required Search query
limit int 5 Number of results
scrape_results bool True Scrape each result page
lang str "en" Language code
country str "us" Country code
engine str "google" "google", "bing", "duckduckgo", "all"
timeout int 60 Timeout in seconds
callback_url str None Webhook URL on completion
wait bool True Poll until complete
result = md.search(
    "open source web scraping frameworks",
    engine="all",
    limit=10,
    scrape_results=True,
)

rss(url, **opts) → dict

Generate an RSS feed from any web page.

result = md.rss("https://blog.example.com", max_items=20, title="My Blog Feed")
print(result["data"]["feed_url"])

change_detection(url, **opts) → dict

Detect and diff content changes on a page.

result = md.change_detection("https://example.com/pricing", include_diff=True)
print(result["data"]["changed"])  # True/False
print(result["data"]["diff"])

deep_research(query, urls, **opts) → dict

Scrape multiple pages and synthesize findings with an LLM.

result = md.deep_research(
    query="What are the main pricing differences between these competitors?",
    urls=[
        "https://competitor-a.com/pricing",
        "https://competitor-b.com/pricing",
    ],
    max_tokens=4096,
)
print(result["data"]["synthesis"])

agent(url, prompt, **opts) → dict

AI autonomous navigation agent that clicks, scrolls and fills forms.

Parameter Type Default Description
url str required Starting URL
prompt str required Task description
max_steps int 10 Max actions
max_pages int 5 Max pages to visit
allow_navigation bool True Allow following links
include_screenshots bool False Capture screenshots
timeout int 120 Timeout in seconds
wait bool True Poll until complete
result = md.agent(
    "https://example.com",
    prompt="Find the contact email on this website",
    max_steps=15,
)
print(result["data"]["answer"])

smart_extract(url, goal, **opts) → dict

Guided extraction: analyzes the site structure, plans the interaction sequence, and returns complete data.

result = md.smart_extract(
    "https://example.com/products",
    goal="Extract all products with name, price, and description",
    output_format="json",
    hints=["Click 'Load more' button to get all items"],
)

rank(keyword, domain, **opts) → dict

Check where a domain ranks for a keyword in search results.

result = md.rank(
    keyword="web scraping api",
    domain="scrapetechnology.com",
    engine="google",
    country="us",
    depth=100,
)
print(result["data"]["position"])  # e.g. 4

dataset(url, goal, **opts) → dict

Auto-paginate a listing page and extract all items as a structured dataset.

result = md.dataset(
    "https://books.toscrape.com",
    goal="Extract all books with title, price, rating and availability",
    max_pages=10,
    output_format="json",
)
print(len(result["data"]["items"]))

Monitor

Create a persistent URL monitor that calls a webhook when content changes.

# Create a monitor
subscription = md.monitor_create(
    "https://example.com/pricing",
    callback_url="https://yourapp.com/webhook",
    interval_minutes=60,
)
sub_id = subscription["subscription_id"]

# List active monitors
monitors = md.monitor_list()

# Delete a monitor
md.monitor_delete(sub_id)

instagram(resource, target, **opts) → dict

Extract public Instagram data.

resource target Description
"profile" "username" Profile info + recent posts
"post" "shortcode" Single post by shortcode
"hashtag" "travel" Recent posts for a hashtag
"search" "coffee shops" Search results
# Profile
profile = md.instagram("profile", "instagram", limit=20)

# Hashtag
posts = md.instagram("hashtag", "webdev", limit=30)

# Search
results = md.instagram("search", "coffee shops NYC", limit=15)

x(resource, target, **opts) → dict

Extract public X (Twitter) data.

resource target Description
"profile" "username" Profile info + recent tweets
"post" "tweet_id" Single tweet by ID
"search" "AI agents" Search results
# Profile
profile = md.x("profile", "elonmusk", limit=20)

# Search
results = md.x("search", "web scraping tools", limit=25)

Job Management

# Get status of any async job
status = md.get_job_status("crawl", "job-id-here")
# status["status"] → "pending" | "processing" | "completed" | "failed"

# Cancel a running job
md.cancel_job("agent", "job-id-here")

Error Handling

from markudown import MarkUDown, MarkUDownError, RateLimitError, InsufficientCreditsError

md = MarkUDown(api_key="your-key")

try:
    result = md.scrape("https://example.com")
except RateLimitError:
    print("Rate limit hit — wait a minute and retry")
except InsufficientCreditsError:
    print("Out of credits — upgrade at scrapetechnology.com")
except MarkUDownError as e:
    print(f"API error {e.status_code}: {e}")

Context Manager

with MarkUDown(api_key="your-key") as md:
    result = md.scrape("https://example.com")
# HTTP connection is automatically closed

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markudown-0.1.0.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markudown-0.1.0-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file markudown-0.1.0.tar.gz.

File metadata

  • Download URL: markudown-0.1.0.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for markudown-0.1.0.tar.gz
Algorithm Hash digest
SHA256 24bfb620ef93589323d2a6c3f6b1040edb94b007ca1d9f2a56952406cd4aa9e6
MD5 3718ba526fae0ece7481378147b60224
BLAKE2b-256 1cb51a4a49086b9a0e07462d6bb47b0c4c29217aab600b63cf08a0dc78438c9c

See more details on using hashes here.

File details

Details for the file markudown-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: markudown-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for markudown-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3ba713c8b8e018b55f19d797df0e095ea5aaa734cfa9e6d92f6844c0dd4726a5
MD5 53ef11c6b8661c7f0f605e78ccfbf7a0
BLAKE2b-256 a2af977b95330ef1b6264ebb453ebf64e3bf6047c72318b9ae8892a948904b95

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page