11 KiB
| name | description |
|---|---|
| download-docs | Given a GitHub/GitLab repo, discover the documentation directory and recursively download all doc files into the artifacts directory, preserving the original directory structure and writing a .meta.json sidecar next to each file. |
download-docs Skill
Overview
This skill downloads an entire documentation tree from a GitHub or GitLab repository
into the mcp-forge artifact directory. Each file gets a .meta.json sidecar with
provenance metadata.
Constraints:
- mcp-forge is air-gapped. All HTTP goes through injected MCP tools (
fetch_raw). - Inject tools with their bare name:
fetch_raw(notsearxng_fetch_raw). - All injected tool calls use keyword-only arguments.
Process
Step 1 — Parse user request
Extract owner, repo, and optionally branch from the user's request.
- If branch is not given, try
mainthenmaster(check which exists via the GitHub API). - Canonical form:
owner/repoorhttps://github.com/owner/repo.
Step 2 — Discover documentation directory
Try each candidate path against the GitHub Contents API until one returns a non-empty list of entries:
DOC_LOCATIONS = ["docs", "doc", "documentation", "guide", "guides",
"content", "pages", "site", "wiki"]
API endpoint:
https://api.github.com/repos/{owner}/{repo}/contents/{path}?ref={branch}
A 200 response with a JSON list (not a dict with message) means the
path exists and is a directory. Use the first match.
Step 3 — CI pipeline branch hint (optional, best-effort)
Before Step 2 (or if Step 2 finds nothing), scan CI/config files for a branch override:
CI_FILES = [
".github/workflows/docs.yml",
".github/workflows/ci.yml",
".github/workflows/deploy.yml",
".gitlab-ci.yml",
"mkdocs.yml",
"readthedocs.yml",
".readthedocs.yaml",
]
Fetch each via raw URL:
https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{ci_file}
Scan content for keywords like ref:, branch:, gh-pages, checkout,
working-directory.
If a specific docs branch is found, update BRANCH and re-run Step 2.
If a working-directory: line is found (e.g. working-directory: ./www),
extract that path and prepend it to DOC_LOCATIONS so it is tried first.
Step 4 — Recursive download
Walk the directory tree depth-first using the GitHub Contents API. For each entry:
type == "dir": recurse (skip hidden dirs and known junk dirs).type == "file": download if extension matches the allowlist.
Extension allowlist:
DOC_EXTENSIONS = {".md", ".mdx", ".rst", ".txt", ".html", ".htm",
".ipynb", ".yaml", ".yml", ".toml"}
Skip dirs:
SKIP_DIRS = {"__pycache__", ".git", "node_modules", ".venv",
".tox", ".eggs", "dist", "build"}
Also skip any directory whose name starts with ..
Download raw content:
https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{file_path}
Rate limit: time.sleep(0.05) between API calls.
Step 5 — Write files + metadata sidecars
For each downloaded file:
- Reconstruct the relative path under
{ARTIFACT_DIR}/{repo}/{file_path}. - Create parent directories with
Path.mkdir(parents=True, exist_ok=True). - Write file content (UTF-8, errors=
replace). - Write
.jsonsidecar at{out_path.with_suffix('.json')}.
Metadata fields:
{
"source": "github",
"owner": "pydantic",
"repo": "pydantic-ai",
"branch": "main",
"path": "docs/index.md",
"raw_url": "https://raw.githubusercontent.com/...",
"html_url": "https://github.com/...",
"sha": "abc123",
"size_bytes": 4096,
"content_type": "text/plain",
"downloaded_at": "2026-04-21T10:00:00Z"
}
Complete mcp-forge Script
import json, os, time
from pathlib import Path
from datetime import datetime, timezone
# ── Configuration ────────────────────────────────────────────────────────────
OWNER = "pydantic" # ← set from user request
REPO = "pydantic-ai" # ← set from user request
BRANCH = "main" # ← set from user request or discovered
DOC_LOCATIONS = ["docs", "doc", "documentation", "guide", "guides",
"content", "pages", "site", "wiki"]
CI_FILES = [".github/workflows/docs.yml", ".github/workflows/ci.yml",
".github/workflows/deploy.yml", ".gitlab-ci.yml",
"mkdocs.yml", "readthedocs.yml", ".readthedocs.yaml"]
DOC_EXTENSIONS = {".md", ".mdx", ".rst", ".txt", ".html", ".htm",
".ipynb", ".yaml", ".yml", ".toml"}
SKIP_DIRS = {"__pycache__", ".git", "node_modules", ".venv",
".tox", ".eggs", "dist", "build"}
ARTIFACT_DIR = Path(os.environ["MCP_ARTIFACT_DIR"])
# ── Helpers ───────────────────────────────────────────────────────────────────
def gh_contents(path):
"""Return parsed JSON from GitHub Contents API, or None on failure."""
url = f"https://api.github.com/repos/{OWNER}/{REPO}/contents/{path}?ref={BRANCH}"
r = fetch_raw(url=url)
time.sleep(0.05)
if r.get("status_code", 200) >= 400:
return None
try:
return json.loads(r["content"])
except Exception:
return None
def raw_url(path):
return f"https://raw.githubusercontent.com/{OWNER}/{REPO}/{BRANCH}/{path}"
def html_url(path):
return f"https://github.com/{OWNER}/{REPO}/blob/{BRANCH}/{path}"
def api_contents_url(path):
return f"https://api.github.com/repos/{OWNER}/{REPO}/contents/{path}?ref={BRANCH}"
# ── Step 1: confirm branch exists ────────────────────────────────────────────
for candidate in ([BRANCH] if BRANCH else ["main", "master"]):
r = fetch_raw(url=f"https://api.github.com/repos/{OWNER}/{REPO}/branches/{candidate}")
if r.get("status_code", 404) == 200:
BRANCH = candidate
print(f"Branch confirmed: {BRANCH}")
break
else:
print("ERROR: could not confirm branch — aborting")
raise SystemExit(1)
# ── Step 2 (optional): CI pipeline branch hint ───────────────────────────────
for ci_file in CI_FILES:
r = fetch_raw(url=raw_url(ci_file))
if r.get("status_code", 404) == 200:
content = r["content"]
for line in content.splitlines():
if any(kw in line for kw in ("ref:", "branch:", "gh-pages")):
print(f"CI hint in {ci_file}: {line.strip()}")
break # only need to find one
# ── Step 3: discover docs directory ──────────────────────────────────────────
DOC_ROOT = None
for loc in DOC_LOCATIONS:
data = gh_contents(loc)
if isinstance(data, list) and len(data) > 0:
DOC_ROOT = loc
print(f"Found docs at: {DOC_ROOT}")
break
if DOC_ROOT is None:
print("ERROR: no docs directory found — tried:", DOC_LOCATIONS)
raise SystemExit(1)
# ── Step 4 + 5: recursive download ───────────────────────────────────────────
downloaded = 0
errors = 0
now_iso = datetime.now(timezone.utc).isoformat()
def process_dir(api_path):
global downloaded, errors
entries = gh_contents(api_path)
if not isinstance(entries, list):
return
for entry in entries:
name = entry.get("name", "")
etype = entry.get("type")
epath = entry.get("path", "")
if etype == "dir":
if name in SKIP_DIRS or name.startswith("."):
continue
process_dir(epath)
elif etype == "file":
ext = Path(name).suffix.lower()
if ext not in DOC_EXTENSIONS:
continue
# Download raw content
r = fetch_raw(url=raw_url(epath))
time.sleep(0.05)
if r.get("status_code", 200) >= 400:
print(f" ERROR {r.get('status_code')} {epath}")
errors += 1
continue
# Write file
out_path = ARTIFACT_DIR / REPO / epath
out_path.parent.mkdir(parents=True, exist_ok=True)
out_path.write_text(r["content"], encoding="utf-8", errors="replace")
# Write .meta.json sidecar
meta = {
"source": "github",
"owner": OWNER,
"repo": REPO,
"branch": BRANCH,
"path": epath,
"raw_url": raw_url(epath),
"html_url": html_url(epath),
"sha": entry.get("sha", ""),
"size_bytes": entry.get("size", len(r["content"])),
"content_type": r.get("content_type", "text/plain"),
"downloaded_at": now_iso,
}
meta_path = out_path.with_suffix(".json")
meta_path.write_text(json.dumps(meta, indent=2), encoding="utf-8")
downloaded += 1
if downloaded % 25 == 0:
print(f" {downloaded} files downloaded...")
process_dir(DOC_ROOT)
print(f"\nDone. Downloaded: {downloaded}, Errors: {errors}")
print(f"Output: {ARTIFACT_DIR / REPO / DOC_ROOT}")
Inject with:
mcp-forge_execute_python(
code=<script above with OWNER/REPO/BRANCH filled in>,
mcp_tools=["fetch_raw"]
)
Usage Examples
Basic (user provides full path):
"Download the docs from github.com/pydantic/pydantic-ai"
Set OWNER="pydantic", REPO="pydantic-ai", BRANCH="main" (or leave blank for
auto-detection) and run the script.
With explicit branch:
"Download docs from tiangolo/fastapi, branch master"
Set BRANCH="master" and skip the branch-discovery loop.
GitLab (future):
GitLab uses the same REST pattern but with https://gitlab.com/api/v4/projects/{id}/repository/tree.
Not yet implemented — treat GitLab repos as out of scope for now.
Known Limitations
- GitHub API unauthenticated rate limit: 60 req/hour. For large repos with many
subdirectories, consider adding a
Authorization: Bearer <token>header.fetch_rawdoes not currently support custom headers — afetch_raw_authvariant would be needed. - GitLab not supported yet.
- Only files with known doc extensions are downloaded. Binary assets (images, PDFs) are intentionally skipped.
- The script is not idempotent: re-running will overwrite existing files silently.