Hans Aschauer 02931b70d5 fix: enhance download-docs skill to handle working-directory and update metadata file extension

2026-05-18 07:33:19 +02:00

11 KiB

Raw Blame History

name	description
download-docs	Given a GitHub/GitLab repo, discover the documentation directory and recursively download all doc files into the artifacts directory, preserving the original directory structure and writing a .meta.json sidecar next to each file.

download-docs Skill

Overview

This skill downloads an entire documentation tree from a GitHub or GitLab repository into the mcp-forge artifact directory. Each file gets a .meta.json sidecar with provenance metadata.

Constraints:

mcp-forge is air-gapped. All HTTP goes through injected MCP tools (fetch_raw).
Inject tools with their bare name: fetch_raw (not searxng_fetch_raw).
All injected tool calls use keyword-only arguments.

Process

Step 1 — Parse user request

Extract owner, repo, and optionally branch from the user's request.

If branch is not given, try main then master (check which exists via the GitHub API).
Canonical form: owner/repo or https://github.com/owner/repo.

Step 2 — Discover documentation directory

Try each candidate path against the GitHub Contents API until one returns a non-empty list of entries:

DOC_LOCATIONS = ["docs", "doc", "documentation", "guide", "guides",
                 "content", "pages", "site", "wiki"]

API endpoint:

https://api.github.com/repos/{owner}/{repo}/contents/{path}?ref={branch}

A 200 response with a JSON list (not a dict with message) means the path exists and is a directory. Use the first match.

Step 3 — CI pipeline branch hint (optional, best-effort)

Before Step 2 (or if Step 2 finds nothing), scan CI/config files for a branch override:

CI_FILES = [
    ".github/workflows/docs.yml",
    ".github/workflows/ci.yml",
    ".github/workflows/deploy.yml",
    ".gitlab-ci.yml",
    "mkdocs.yml",
    "readthedocs.yml",
    ".readthedocs.yaml",
]

Fetch each via raw URL:

https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{ci_file}

Scan content for keywords like ref:, branch:, gh-pages, checkout, working-directory. If a specific docs branch is found, update BRANCH and re-run Step 2. If a working-directory: line is found (e.g. working-directory: ./www), extract that path and prepend it to DOC_LOCATIONS so it is tried first.

Step 4 — Recursive download

Walk the directory tree depth-first using the GitHub Contents API. For each entry:

type == "dir": recurse (skip hidden dirs and known junk dirs).
type == "file": download if extension matches the allowlist.

Extension allowlist:

DOC_EXTENSIONS = {".md", ".mdx", ".rst", ".txt", ".html", ".htm",
                  ".ipynb", ".yaml", ".yml", ".toml"}

Skip dirs:

SKIP_DIRS = {"__pycache__", ".git", "node_modules", ".venv",
             ".tox", ".eggs", "dist", "build"}

Also skip any directory whose name starts with ..

Download raw content:

https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{file_path}

Rate limit: time.sleep(0.05) between API calls.

Step 5 — Write files + metadata sidecars

For each downloaded file:

Reconstruct the relative path under {ARTIFACT_DIR}/{repo}/{file_path}.
Create parent directories with Path.mkdir(parents=True, exist_ok=True).
Write file content (UTF-8, errors=replace).
Write .json sidecar at {out_path.with_suffix('.json')}.

Metadata fields:

{
  "source": "github",
  "owner": "pydantic",
  "repo": "pydantic-ai",
  "branch": "main",
  "path": "docs/index.md",
  "raw_url": "https://raw.githubusercontent.com/...",
  "html_url": "https://github.com/...",
  "sha": "abc123",
  "size_bytes": 4096,
  "content_type": "text/plain",
  "downloaded_at": "2026-04-21T10:00:00Z"
}

Complete mcp-forge Script

import json, os, time
from pathlib import Path
from datetime import datetime, timezone

# ── Configuration ────────────────────────────────────────────────────────────
OWNER   = "pydantic"          # ← set from user request
REPO    = "pydantic-ai"       # ← set from user request
BRANCH  = "main"              # ← set from user request or discovered

DOC_LOCATIONS  = ["docs", "doc", "documentation", "guide", "guides",
                  "content", "pages", "site", "wiki"]
CI_FILES       = [".github/workflows/docs.yml", ".github/workflows/ci.yml",
                  ".github/workflows/deploy.yml", ".gitlab-ci.yml",
                  "mkdocs.yml", "readthedocs.yml", ".readthedocs.yaml"]
DOC_EXTENSIONS = {".md", ".mdx", ".rst", ".txt", ".html", ".htm",
                  ".ipynb", ".yaml", ".yml", ".toml"}
SKIP_DIRS      = {"__pycache__", ".git", "node_modules", ".venv",
                  ".tox", ".eggs", "dist", "build"}

ARTIFACT_DIR = Path(os.environ["MCP_ARTIFACT_DIR"])

# ── Helpers ───────────────────────────────────────────────────────────────────
def gh_contents(path):
    """Return parsed JSON from GitHub Contents API, or None on failure."""
    url = f"https://api.github.com/repos/{OWNER}/{REPO}/contents/{path}?ref={BRANCH}"
    r = fetch_raw(url=url)
    time.sleep(0.05)
    if r.get("status_code", 200) >= 400:
        return None
    try:
        return json.loads(r["content"])
    except Exception:
        return None

def raw_url(path):
    return f"https://raw.githubusercontent.com/{OWNER}/{REPO}/{BRANCH}/{path}"

def html_url(path):
    return f"https://github.com/{OWNER}/{REPO}/blob/{BRANCH}/{path}"

def api_contents_url(path):
    return f"https://api.github.com/repos/{OWNER}/{REPO}/contents/{path}?ref={BRANCH}"

# ── Step 1: confirm branch exists ────────────────────────────────────────────
for candidate in ([BRANCH] if BRANCH else ["main", "master"]):
    r = fetch_raw(url=f"https://api.github.com/repos/{OWNER}/{REPO}/branches/{candidate}")
    if r.get("status_code", 404) == 200:
        BRANCH = candidate
        print(f"Branch confirmed: {BRANCH}")
        break
else:
    print("ERROR: could not confirm branch — aborting")
    raise SystemExit(1)

# ── Step 2 (optional): CI pipeline branch hint ───────────────────────────────
for ci_file in CI_FILES:
    r = fetch_raw(url=raw_url(ci_file))
    if r.get("status_code", 404) == 200:
        content = r["content"]
        for line in content.splitlines():
            if any(kw in line for kw in ("ref:", "branch:", "gh-pages")):
                print(f"CI hint in {ci_file}: {line.strip()}")
        break  # only need to find one

# ── Step 3: discover docs directory ──────────────────────────────────────────
DOC_ROOT = None
for loc in DOC_LOCATIONS:
    data = gh_contents(loc)
    if isinstance(data, list) and len(data) > 0:
        DOC_ROOT = loc
        print(f"Found docs at: {DOC_ROOT}")
        break

if DOC_ROOT is None:
    print("ERROR: no docs directory found — tried:", DOC_LOCATIONS)
    raise SystemExit(1)

# ── Step 4 + 5: recursive download ───────────────────────────────────────────
downloaded = 0
errors = 0
now_iso = datetime.now(timezone.utc).isoformat()

def process_dir(api_path):
    global downloaded, errors
    entries = gh_contents(api_path)
    if not isinstance(entries, list):
        return
    for entry in entries:
        name = entry.get("name", "")
        etype = entry.get("type")
        epath = entry.get("path", "")

        if etype == "dir":
            if name in SKIP_DIRS or name.startswith("."):
                continue
            process_dir(epath)

        elif etype == "file":
            ext = Path(name).suffix.lower()
            if ext not in DOC_EXTENSIONS:
                continue

            # Download raw content
            r = fetch_raw(url=raw_url(epath))
            time.sleep(0.05)
            if r.get("status_code", 200) >= 400:
                print(f"  ERROR {r.get('status_code')} {epath}")
                errors += 1
                continue

            # Write file
            out_path = ARTIFACT_DIR / REPO / epath
            out_path.parent.mkdir(parents=True, exist_ok=True)
            out_path.write_text(r["content"], encoding="utf-8", errors="replace")

            # Write .meta.json sidecar
            meta = {
                "source":       "github",
                "owner":        OWNER,
                "repo":         REPO,
                "branch":       BRANCH,
                "path":         epath,
                "raw_url":      raw_url(epath),
                "html_url":     html_url(epath),
                "sha":          entry.get("sha", ""),
                "size_bytes":   entry.get("size", len(r["content"])),
                "content_type": r.get("content_type", "text/plain"),
                "downloaded_at": now_iso,
            }
            meta_path = out_path.with_suffix(".json")
            meta_path.write_text(json.dumps(meta, indent=2), encoding="utf-8")

            downloaded += 1
            if downloaded % 25 == 0:
                print(f"  {downloaded} files downloaded...")

process_dir(DOC_ROOT)

print(f"\nDone. Downloaded: {downloaded}, Errors: {errors}")
print(f"Output: {ARTIFACT_DIR / REPO / DOC_ROOT}")

Inject with:

mcp-forge_execute_python(
    code=<script above with OWNER/REPO/BRANCH filled in>,
    mcp_tools=["fetch_raw"]
)

Usage Examples

Basic (user provides full path):

"Download the docs from github.com/pydantic/pydantic-ai"

Set OWNER="pydantic", REPO="pydantic-ai", BRANCH="main" (or leave blank for auto-detection) and run the script.

With explicit branch:

"Download docs from tiangolo/fastapi, branch master"

Set BRANCH="master" and skip the branch-discovery loop.

GitLab (future): GitLab uses the same REST pattern but with https://gitlab.com/api/v4/projects/{id}/repository/tree. Not yet implemented — treat GitLab repos as out of scope for now.

Known Limitations

GitHub API unauthenticated rate limit: 60 req/hour. For large repos with many subdirectories, consider adding a Authorization: Bearer <token> header. fetch_raw does not currently support custom headers — a fetch_raw_auth variant would be needed.
GitLab not supported yet.
Only files with known doc extensions are downloaded. Binary assets (images, PDFs) are intentionally skipped.
The script is not idempotent: re-running will overwrite existing files silently.

11 KiB Raw Blame History