Scrape learn.liferay.com/w/dxp into local Markdown docs (raw/{capability}/*.md) for the liferay-expert Claude Code skill.
Project description
liferay-docs-scraper
Scrape learn.liferay.com/w/dxp/* into a local Markdown copy of the docs, then answer
Liferay DXP questions in Claude Code by grepping and citing it — no
bundled docs, no embeddings, no vector DB.
What makes this different
- No bundled copyrighted content. This repo and the PyPI package ship
only the scraping tool, never Liferay's documentation text. Each user
scrapes their own local copy directly from learn.liferay.com. See
docs/adr/0001-crawl4ai-based-corpus-pipeline.mdfor the full reasoning on why that's the safer distribution model. - No embeddings, no vector DB. Plain
grep+Readover ~1,800 well-organized Markdown files is fast enough — theliferay-expertskill just searches those docs directly. - One shared docs folder, not per-project. The scraper writes to a single
OS-appropriate per-user directory (resolved by
resolve_docs_dir()), so every project that installs the skill reads the same docs instead of duplicating a ~30-40 minute scrape.
How it works
- Scrape:
uvx liferay-docs-scraperruns a crawl4ai (free, self-hosted, Playwright-based) BFS crawl oflearn.liferay.com/w/dxp/*and writes clean Markdown toraw/{capability}/*.md, one file per page, across 14 Liferay DXP capabilities. - Install:
npx skills add mordonez/liferay-docs-scraper --skill liferay-expertdrops theliferay-expertskill into any project's.claude/skills/. - Ask: Claude Code greps the docs for the relevant capability, reads the matching page(s), and answers — always citing the source URL from that file's frontmatter.
Contents
Quickstart
The recommended order for a first-time setup: scrape, then install the skill, then ask questions.
1. Scrape the docs (one-time, ~30-40 min):
uvx --from crawl4ai crawl4ai-setup # one-time, installs Playwright browsers
uvx liferay-docs-scraper
Run this from anywhere -- it does not write into your current directory, see "Reference: the scraper in detail" below for exactly where it goes.
2. Install the skill into whatever project you're working in:
npx skills add mordonez/liferay-docs-scraper --skill liferay-expert -a claude-code
You'll see:
◇ Installed 1 skill ───────────────────╮
│ │
│ ✓ liferay-expert (copied) │
│ → ./.claude/skills/liferay-expert │
│ │
├───────────────────────────────────────╯
3. Ask Claude Code a Liferay question, e.g. "how do I configure a
synonym set in Liferay search?" The skill finds the docs, greps the
search capability, reads search-administration-and-tuning-synonym-sets.md,
and answers grounded in that page -- citing
https://learn.liferay.com/w/dxp/search/search-administration-and-tuning/synonym-sets
as the source.
The docs are shared across every project where you install the skill (see "OS default location" below), so step 1 is only ever needed once per machine -- rerun it later just to refresh, not per-project.
If you install the skill without doing step 1 first (or the docs go
stale), it notices and tells you what to run rather than guessing or
answering ungrounded -- it never launches the ~30-40 min scrape on its own
mid-conversation. See "Step 1/2" in skills/liferay-expert/SKILL.md for
that check.
Reference: the scraper in detail
Requires Python 3.10-3.13 (crawl4ai's Playwright dependency doesn't yet support 3.14) and uv.
# One-time: installs the Playwright/Chromium browser crawl4ai drives
uvx --from crawl4ai crawl4ai-setup
# From anywhere -- the docs do NOT go in your current directory:
uvx liferay-docs-scraper
This takes roughly 30-40 minutes (BFS deep crawl of ~1900 pages across 14
capabilities) and writes to ~/liferay-docs — one shared location, the
same on macOS, Linux, and Windows, so it's the same docs no matter which
project you're in when the skill looks for it. Set LIFERAY_DOCS_DIR to
override (e.g. to keep a project-local copy instead).
Inside that directory:
raw/{capability}/*.md— the docs, one file per pageraw/_navigation/{capability}/*.md— pure TOC pages, kept but deprioritizedraw/_removed/{capability}/*.md— pages confirmed gone from the live sitereports/filtered/— URL manifests, self-hosted prune log, run summary
Re-run it anytime (weekly recommended) to refresh: it starts from zero every time, so it naturally picks up new pages, updates changed ones, and quarantines (never deletes) removed ones.
Optional, for extra safety: the scraper can occasionally fetch a page
successfully but get the wrong content (e.g. a different page's text, or
content cut off mid-render) -- rare, but it's happened. There's no way to
catch that from a single fetch alone; it can only be caught by comparing
against a previous known-good copy. If you git init the ~/liferay-docs
directory yourself (purely as a personal versioning tool -- nothing needs
pushing anywhere), each run automatically diffs against the last commit
and flags any page that shrank by more than half or grew more than 3x. Skip
this entirely if you don't want to bother with it: without git there, this
step silently does nothing. See
docs/adr/0001-crawl4ai-based-corpus-pipeline.md for the full story.
Reference: the skill in detail
npx skills add mordonez/liferay-docs-scraper --skill liferay-expert
Or just copy skills/liferay-expert/SKILL.md into .claude/skills/liferay-expert/
in any project. Claude Code picks it up automatically; the skill itself
resolves $LIFERAY_DOCS_DIR (or the OS default above) to find the docs,
so it works the same regardless of which project you installed it into.
License
MIT — applies to this tool and skill only, not to the Liferay documentation text it helps you fetch (that stays Liferay's, and each user scrapes their own local copy directly from learn.liferay.com).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file liferay_docs_scraper-0.2.0.tar.gz.
File metadata
- Download URL: liferay_docs_scraper-0.2.0.tar.gz
- Upload date:
- Size: 31.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ffab43dbcc7b160905e82095b054742bc7e6a7aa736e7ee3876a1b8236b9db20
|
|
| MD5 |
231aa9bea39437374afb289c3d5e5f01
|
|
| BLAKE2b-256 |
5088b04e3b2acaca365b0d111bda69f8ad80b0d5239a0161593cc8961f0c14e3
|
File details
Details for the file liferay_docs_scraper-0.2.0-py3-none-any.whl.
File metadata
- Download URL: liferay_docs_scraper-0.2.0-py3-none-any.whl
- Upload date:
- Size: 17.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b87e04d10d9d4bb1d425547bc8d8d577d1c12d50efc9d974eb03350405fc049b
|
|
| MD5 |
93eac0a2ba4dd29c134c0bc5591ff911
|
|
| BLAKE2b-256 |
2f0e9c95bc3829b62e3ab9e3c3a63e6630b1b6bd0606873a5ba86a5ca8107abf
|