Scrape learn.liferay.com/w/dxp into local Markdown docs (raw/{capability}/*.md) for the liferay-expert Claude Code skill.
Project description
liferay-docs-scraper
Scrape learn.liferay.com/w/dxp/* into local Markdown, then let Claude Code
answer Liferay DXP questions by searching those files. No bundled Liferay
content, no embeddings, no vector DB.
Quickstart
From zero to asking Liferay questions in Claude Code:
# 1. One-time browser setup for crawl4ai/Playwright
uvx --from crawl4ai crawl4ai-setup
# 2. Scrape the official Liferay DXP docs (~30-40 min)
uvx liferay-docs-scraper
# 3. Install the Claude Code skill in your project
npx skills add mordonez/liferay-docs-scraper --skill liferay-expert -a claude-code
# 4. Check that docs and skill are ready
uvx --from liferay-docs-scraper liferay-docs-scraper-doctor
Then ask Claude Code something like:
How do I configure synonym sets in Liferay Search?
The skill searches the local Markdown, reads the matching page, and cites the source URL from that file's frontmatter.
Keep -a claude-code in the install command. It avoids interactive installer
edge cases where the skill can appear installed but not land in
.claude/skills/.
What this repo does not do
- It does not ship Liferay documentation text. The package contains the scraper
and skill only; each user fetches their own local copy from
learn.liferay.com. - It does not use embeddings, RAG infrastructure, or a vector database. The skill uses normal file search and reads Markdown directly.
- It does not scrape automatically from the skill. If docs are missing, the skill tells you what command to run instead of starting a long crawl in the middle of a conversation.
Requirements
- Python 3.10-3.13
uv- Node/npm for
npx skills add
crawl4ai drives Playwright, so run this once before the first scrape:
uvx --from crawl4ai crawl4ai-setup
Scraper Reference
Run the official-docs scraper:
uvx liferay-docs-scraper
It crawls https://learn.liferay.com/w/dxp/index with crawl4ai's BFS crawler,
keeps URLs under /w/dxp/*, extracts .learn-article-content, classifies each
page into one of 14 Liferay capabilities, and writes Markdown to one shared
docs directory.
Default docs directory:
~/.liferay-docs
Override it when needed:
export LIFERAY_DOCS_DIR="$PWD/.liferay-docs"
uvx liferay-docs-scraper
Directory layout:
~/.liferay-docs/
raw/{capability}/*.md
raw/_navigation/{capability}/*.md
raw/_removed/{capability}/*.md
reports/filtered/
Useful commands:
# Smaller smoke run
uvx liferay-docs-scraper --max-pages 200
# Check local docs and current-project skill installation
uvx --from liferay-docs-scraper liferay-docs-scraper-doctor
The scraper writes files atomically, retries page fetches through crawl4ai,
uses bounded concurrency, and exits non-zero if page fetches or the crawl stream
fail. If the crawl is interrupted, already written pages remain usable, but the
run is marked failed and orphan quarantine is skipped so a partial crawl cannot
move good pages to raw/_removed/.
Community Articles
Optional, larger, and lower-authority:
uvx --from liferay-docs-scraper liferay-docs-scraper-community
This fetches Liferay community How-To and Troubleshooting articles from
learn.liferay.com/kb-article/*. It writes them separately:
raw/community-howto/{capability}/*.md
raw/community-troubleshooting/{capability}/*.md
Many community articles are not tagged with a capability by the site; those go
to _uncategorized/. The liferay-expert skill treats community content as a
secondary source and says so in answers.
Useful options:
uvx --from liferay-docs-scraper liferay-docs-scraper-community --resource-type howto
uvx --from liferay-docs-scraper liferay-docs-scraper-community --limit 100
Skill Reference
Install into the current project:
npx skills add mordonez/liferay-docs-scraper --skill liferay-expert -a claude-code
Manual install also works: copy skills/liferay-expert/SKILL.md to:
.claude/skills/liferay-expert/SKILL.md
The skill resolves docs exactly like the scraper:
$LIFERAY_DOCS_DIR, if set.~/.liferay-docs, otherwise.
When answering, it searches raw/{capability}/*.md, reads the best matching
files, and cites their url: frontmatter. It skips raw/_navigation/ unless
there is no better source.
Development
uv sync --group dev
uv run ruff check .
uv run pytest
uv build
CI runs lint, tests, and package build on Python 3.10, 3.11, 3.12, and 3.13.
It does not run a real scrape. Release publishing is documented in
docs/release.md.
License
MIT applies to this tool and skill only. Liferay documentation content remains Liferay's content and is fetched locally by each user.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file liferay_docs_scraper-0.6.1.tar.gz.
File metadata
- Download URL: liferay_docs_scraper-0.6.1.tar.gz
- Upload date:
- Size: 43.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc98d53df0fa6d1bd717a007923c6da4b31f7c80a34f6d3c2ec90562dee7cd7e
|
|
| MD5 |
d369f035270f79913d09961d5523b0fd
|
|
| BLAKE2b-256 |
3577a371f172a9788300c119f708211d390e1de56062aa0251a3aa3cac9bdc8f
|
File details
Details for the file liferay_docs_scraper-0.6.1-py3-none-any.whl.
File metadata
- Download URL: liferay_docs_scraper-0.6.1-py3-none-any.whl
- Upload date:
- Size: 24.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4eb55475961f59ca1764a6c9ee061086686068ba29ae8e294994277893f79ee9
|
|
| MD5 |
3cc7a22ac2a9297df0b8e5ed83a2ce81
|
|
| BLAKE2b-256 |
0d7f3e762ca73b0a7d538f4767ea5cfeb4eb46090d3c60b4315e25e242e3ae15
|