Skip to main content

Scrape learn.liferay.com/w/dxp into a local Markdown corpus (raw/{capability}/*.md) for the liferay-expert Claude Code skill.

Project description

liferay-docs-scraper

Scrapes learn.liferay.com/w/dxp/* into a local, clean Markdown corpus (raw/{capability}/*.md) and ships a Claude Code skill (liferay-expert) that answers Liferay DXP questions by searching and citing that corpus.

This repo does not ship Liferay's documentation. It ships the code that scrapes it, and a skill that reads whatever you scrape locally. Each user builds and refreshes their own copy directly from learn.liferay.com.

Quickstart

The recommended order for a first-time setup: scrape, then install the skill, then ask questions.

1. Build the corpus (one-time, ~30-40 min):

uvx --from crawl4ai crawl4ai-setup   # one-time, installs Playwright browsers
uvx --python 3.13 --from "git+https://github.com/mordonez/liferay-docs-scraper" liferay-docs-scraper

Run this from anywhere -- it does not write into your current directory, see "Reference: the scraper in detail" below for exactly where it goes.

2. Install the skill into whatever project you're working in:

npx skills add mordonez/liferay-docs-scraper --skill liferay-expert -a claude-code

You'll see:

◇  Installed 1 skill ───────────────────╮
│                                       │
│  ✓ liferay-expert (copied)            │
│    → ./.claude/skills/liferay-expert  │
│                                       │
├───────────────────────────────────────╯

3. Ask Claude Code a Liferay question, e.g. "how do I configure a synonym set in Liferay search?" The skill finds the corpus, greps the search capability, reads search-administration-and-tuning-synonym-sets.md, and answers grounded in that page -- citing https://learn.liferay.com/w/dxp/search/search-administration-and-tuning/synonym-sets as the source.

The corpus is shared across every project where you install the skill (see "OS default location" below), so step 1 is only ever needed once per machine -- rerun it later just to refresh, not per-project.

If you install the skill without doing step 1 first (or its corpus goes stale), it notices and tells you what to run rather than guessing or answering ungrounded -- it never launches the ~30-40 min scrape on its own mid-conversation. See "Step 1/2" in skills/liferay-expert/SKILL.md for that check.

Reference: the scraper in detail

Requires Python 3.10-3.13 (crawl4ai's Playwright dependency doesn't yet support 3.14) and uv.

# One-time: installs the Playwright/Chromium browser crawl4ai drives
uvx --from crawl4ai crawl4ai-setup

# From anywhere -- the corpus does NOT go in your current directory.
# Not on PyPI yet, so install straight from GitHub:
uvx --python 3.13 --from "git+https://github.com/mordonez/liferay-docs-scraper" liferay-docs-scraper

This takes roughly 30-40 minutes (BFS deep crawl of ~1900 pages across 14 capabilities) and writes to one shared, per-user location (so it's the same corpus no matter which project you're in when the skill looks for it):

OS Default location
macOS ~/Library/Application Support/liferay-docs/
Linux ~/.local/share/liferay-docs/ (or $XDG_DATA_HOME/liferay-docs)
Windows %LOCALAPPDATA%\liferay-docs\

Set LIFERAY_DOCS_DIR to override (e.g. to keep a project-local copy instead).

Inside that directory:

  • raw/{capability}/*.md — the corpus, one file per page
  • raw/_navigation/{capability}/*.md — pure TOC pages, kept but deprioritized
  • raw/_removed/{capability}/*.md — pages confirmed gone from the live site
  • reports/filtered/ — URL manifests, self-hosted prune log, run summary

Re-run it anytime (weekly recommended) to refresh: it starts from zero every time, so it naturally picks up new pages, updates changed ones, and quarantines (never deletes) removed ones. If that directory is (or becomes) a git repo -- worth doing once, purely as a local diffing tool, nothing needs pushing anywhere -- it also runs check-regressions automatically afterward and flags any file that shrank by more than half or grew more than 3x versus the last commit (signals of a broken fetch); see docs/adr/0001-crawl4ai-based-corpus-pipeline.md for why that check exists.

Reference: the skill in detail

npx skills add mordonez/liferay-docs-scraper --skill liferay-expert

Or just copy skills/liferay-expert/SKILL.md into .claude/skills/liferay-expert/ in any project. Claude Code picks it up automatically; the skill itself resolves $LIFERAY_DOCS_DIR (or the OS default above) to find the corpus, so it works the same regardless of which project you installed it into.

Why no bundled docs, no embeddings, no vector DB

See docs/adr/ for the full reasoning. Short version: the corpus is Liferay's copyrighted documentation text -- distributing the tool that scrapes public pages is a different, much lower-risk thing than a third party redistributing that text at scale. Plain grep + Read over ~1800 well-organized Markdown files is fast enough that no search index is needed; add one later if that stops being true.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

liferay_docs_scraper-0.1.0.tar.gz (29.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

liferay_docs_scraper-0.1.0-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file liferay_docs_scraper-0.1.0.tar.gz.

File metadata

File hashes

Hashes for liferay_docs_scraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ad3e440af3ead30e261cd997e13c63a3d4d8ad49d84163f54b39dbe8b4d0f7f2
MD5 609c60cdde385975f1d13ed4e714631c
BLAKE2b-256 4ae35710f986574504f48d696adf60374a628e61ba6572ab9e609b2c393fbdc5

See more details on using hashes here.

File details

Details for the file liferay_docs_scraper-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for liferay_docs_scraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e56e175f93365dd1919729c5a177c2ab0d171f63209a6953af305a2bef38f761
MD5 e9d19be2d5780b3745b88f31c71271bf
BLAKE2b-256 892855b0d8c673ba909249c22aedd545ca32b2b1761348465d6a36e2dd7ae8c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page