KMS Compilation Pipeline — Design Doc

Overview

One-click Compile processes raw .txt uploads into structured wiki pages using markitdown for conversion and LiteLLM for content analysis.

Data Flow

Upload (.txt)
    │
    ▼
raw/notes/{file}.txt          ◄── inbox (user uploads here)
    │
    ▼ (Compile button clicked)
    │
┌── FileScanner ──────────────────────────────┐
│  Scans raw/notes/*.txt (not .md, not in     │
│  subdirs — just the direct inbox)           │
└──────────────┬──────────────────────────────┘
               │
               ▼
┌── TextConverter (markitdown) ───────────────┐
│  .txt → .md  (strip excessive whitespace,   │
│  normalise line endings)                    │
└──────────────┬──────────────────────────────┘
               │
               ▼
┌── LLMProcessor (LiteLLM) ───────────────────┐
│  Batch: send all files in ONE call with     │
│  numbered sections. Returns JSON array.     │
│  Fallback: individual calls if content      │
│  exceeds context window.                    │
│                                             │
│  Per-file extraction:                       │
│  - title (from content, not filename)       │
│  - tags (auto-generated list)               │
│  - confidence (high/medium/low)             │
│  - summary (concise but nuance-preserving)  │
│  - source (URL, book ref, article, etc.)    │
│  - content (cleaned .md, no extra fluff)    │
└──────────────┬──────────────────────────────┘
               │
               ▼
┌── WikiWriter ───────────────────────────────┐
│  Write to wiki/topics/{slug}.md with YAML   │
│  frontmatter from extracted data +          │
│  source_path: notes/{filename}.txt          │
└──────────────┬──────────────────────────────┘
               │
               ▼
┌── FileMover ────────────────────────────────┐
│  Move processed .txt to raw/processed/       │
│  (keeps source artifact, clean inbox)       │
└──────────────┬──────────────────────────────┘
               │
               ▼
┌── RebuildIndex ─────────────────────────────┐
│  Rebuild FTS5 index via existing rebuild    │
└──────────────┬──────────────────────────────┘
               │
               ▼
        Results shown in compile.html

Data Shapes

@dataclass
class RawFile:
    path: Path
    name: str                    # filename with extension
    stem: str                    # filename without extension
    size_bytes: int

@dataclass
class ProcessedFile:
    raw_file: RawFile
    success: bool
    slug: str                    # wiki page slug
    error: str | None = None

@dataclass
class CompiledNote:
    """Output from LLM per file."""
    title: str
    tags: list[str]
    confidence: str              # "high" | "medium" | "low"
    summary: str
    source: str                  # user-provided context (URL, book, etc.)
    content: str                 # cleaned markdown body

Batching Strategy

Default: send all new .txt files in a single LLM call. Number files [1], [2], [3]... in the prompt, return JSON array.

Fallback (content too large or >1M tokens): - Process in batches of 5 files - If a single file is enormous (>100K chars), process it alone

This means 1 LLM call (or a handful) per Compile run rather than N calls.

LLM Prompt

You are a knowledge management compiler. Given raw text notes,
extract structured information for a personal wiki.

For each note, return:
- title: descriptive title (from content, not filename)
- tags: 3-8 relevant tags (lowercase, no spaces)
- confidence: "high" if well-structured, "medium" if reasonable,
  "low" if fragmentary/unclear
- summary: 2-3 sentences preserving key nuance
- source: the source context if apparent, else "Personal note"
- content: the note cleaned up as concise markdown (remove
  excessive blank lines, normalise formatting, keep all meaning)

Return a JSON array. Each element corresponds to one note.

Configuration (env vars, KMS_ prefix)

Add to kms/web/config.py:

# LLM
llm_api_key: str = ""          # KMS_LLM_API_KEY
llm_model: str = "gpt-4o-mini" # KMS_LLM_MODEL
llm_max_files_per_batch: int = 10  # KMS_LLM_MAX_FILES_PER_BATCH

# Paths
processed_dir: Path | None = None  # defaults to raw/processed/

Error Handling

Failure Action
markitdown fails on a file Skip file, log error, continue
LLM call fails (rate limit, timeout) Retry once, then skip batch
LLM returns malformed JSON Skip batch, log raw response
Slug collision (existing wiki page) Append _v2, _v3 etc.
File already processed (in inbox) Handled by moving to processed/

Files to Create / Modify

New files: - kms/scripts/compile_pipeline.py — PipelineOrchestrator + all components - kms/web/compile_pipeline.py — or inline in main? Better as separate module

Modified files: - kms/web/main.py — update /compile POST to run pipeline - kms/web/config.py — add LLM settings - kms/web/templates/compile.html — show per-file results, error counts - requirements.txt — add litellm, markitdown

Installation (needs pip install)

cd kms
.venv/bin/pip install litellm markitdown