User Guide

DocForge is a LAN-hosted pipeline that turns PDF documents into structured, queryable data — using an LLM-driven four-phase ETL workflow.

00 Getting Access — Sign In

Click Open App on the landing page. Enter your username and password, then click Sign In. The system auto-detects your account type and redirects you:

User

Regular user

Users authenticate with their group's shared password. If you belong to one group it is selected automatically. On success you are redirected to /app.

Admin

Admins authenticate with a personal password. On success you are redirected to /admin where you manage groups and users.

Account type	Min length	Uppercase	Digits	Special char
Regular user (group password)	8	≥ 1	≥ 1	≥ 1
Admin (personal password)	12	≥ 1	≥ 3	≥ 1

Admin Panel `/admin`

Admins can create groups, set group passwords, add or remove users, and reset passwords. Each group shares a single password — resetting it affects all members immediately.

Forgot password? Use the link in the sign-in modal to email fmonneret@gmail.com and request a reset. Admins can reset group passwords directly from the Admin Panel.

01 What It Does

DocForge automates the extraction of structured data from PDF reports (fund reports, financial statements, whitepapers). It converts unstructured PDF content into clean Markdown, infers document metadata, and extracts field-level values according to a user-defined JSON schema — all without manual copy-paste.

Input

PDF files uploaded via the browser UI

Output

JSON / Markdown / CSV with extracted fields, permanently archived by document dimensions

02 The Four-Phase Pipeline

Phase 1 — Ingest

PDF is uploaded to a temp folder and converted to Markdown. Charts and scanned images are processed by a vision LLM and converted to data tables inline.
Phase 2 — Archive

The LLM infers document dimensions (asset manager, fund, date, strategy) from the Markdown content, then moves the file to the permanent workspace tree and updates the index.
Phase 3 — Extract

A schema-driven extraction prompt is built per field group. The LLM reads the archived document and populates the schema — with optional TransformEngine post-processing.
Phase 4 — Query

Query the document index with date-range filters to get snapshots or time-series. Ask grounded questions via the chat interface.

03 Practical Workflow

Step 1

Select workspace & Upload PDF

Pick a workspace (group + doc type) from the sidebar dropdowns.
Drag a PDF into the upload zone on the Upload tab, or click to browse.
Click Ingest (Fast) to convert it to Markdown immediately, or Ingest (Standard) for background processing with full DOCINDEX tree.
Re-ingest specific pages: Enter page numbers (e.g. 43 or 1-5,10) in the Pages field. Check Morph to merge the re-ingested pages into the global document output.
Monitor progress in the server log panel at the bottom. Partial vision output (pages where chart conversion failed) is shown inline.

Step 2

Extract

Switch to the Extract tab.
Select the schema to use (from _extract-templates/ in your workspace).
Pick the archived document(s) to extract from. Documents in the same folder are automatically batched together in one LLM call.
Click Run Extract — the LLM populates each field in the schema.
Results are saved as extract-singles-{suffix}.md + .yaml and extract-tables-{suffix}.md + .yaml alongside the PDF. Fields marked narrative in the schema use a dedicated prose-summarisation pass (see Tips).

Step 4

Query & Chat

Switch to the Query tab.
Use the filter panel to select a date range and dimensions.
Click Snapshot for point-in-time data or Series for time-series across reports.
Use the Chat sub-tab to ask natural-language questions — answers are grounded in the archived documents.

04 Tabs Reference

Tab	Group	Purpose
Upload	Workflow	Upload PDFs and run Ingest (Phase 1)
Archive	Workflow	Preview dimension inference, confirm archive, clean up temp files (Phase 2)
Extract	Workflow	Schema-driven LLM field extraction on archived documents (Phase 3)
Query	Workflow	Date-range document selection, grounded chat, save query results (Phase 4)
Templates	Workflow	Save, manage, and run reusable query templates + extraction schema editor
Config	Settings	Pipeline settings: vision model, text model, concurrency, feature flags (merged cells, chart detection, table format…)
LLM Config	Settings	Model aliases, fallback chains, provider endpoints
Keys	Settings	API keys per workspace (stored encrypted in `_config/.keys.yaml`)
Schema	Settings	Edit extraction schemas (YAML) — field types, formats, hints, transforms
Dashboard	Settings	Workspace overview: archive status, extract coverage, and query output counts per reporting date

05 Dashboard

The Dashboard tab provides a live snapshot of your workspace's processing state. Select a reporting date from the shared date picker — all three status sections update together.

Global Stats

Date range, distinct entity counts (asset managers, fund names…), total files archived, pages processed, total file size.

Archived Stats

Per-entity archive status at T and T−1: full (JSON tree + MD), compact (headers-only JSON), or md (Markdown only).

Extract Stats

Which extraction schemas have been run per entity at the selected date. One column per schema found in _extract-templates/.

Query Stats

Count of saved Snapshot and Series query outputs at T and T−1, scanned from queries-output/.

06 Tips & Notes

Fast vs Standard ingest: Fast is synchronous and sufficient for most documents. Standard runs in background and builds a full DOCINDEX tree — use it when you need deep section-level structure for large documents.

API keys: Keys are stored per workspace in _config/.keys.yaml — not in the environment. Set them once via the Keys tab.

Archive is idempotent: Re-archiving the same file overwrites the existing entry without duplicating it in the index.

Dashboard for progress tracking: After each reporting period, open the Dashboard tab to confirm all entities are archived (full), extracted across all schemas, and query outputs saved.

AI Review: Right-click any editable text field (query input, refinement, template) and choose Review by AI to sharpen the phrasing. Works on query outputs too — click Edit, then right-click.

Narrative fields: Add narrative: 2-4 to any schema field to request a prose summary (2–4 sentences) instead of a bare value. The extractor runs a separate summarisation pass for these fields — ideal for investment thesis, strategy description, or risk commentary columns. Works on both singles and table columns.

YAML extraction output: Every extraction also writes a machine-readable YAML mirror — extract-singles-{suffix}.yaml and extract-tables-{suffix}.yaml. The tables YAML is a dict keyed by table name, each row carries an auto-incremented id. Use these files as input for downstream scripts or spreadsheet imports without parsing Markdown.

07 API Access (SDK / curl)

Every pipeline operation is available as a REST API. Requests are authenticated with a short-lived JWT. The full interactive spec is at /docs.

Auth

Get a JWT token

Two auth paths. Use workspace auth for programmatic API access (curl/SDK). Use user login when calling from a browser context or testing user flows.

Workspace auth (master password)

TOKEN=$(curl -s -X POST http://localhost:8000/auth/login \
  -d "password=YOUR_MASTER_PASSWORD" \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

User login (group password)

TOKEN=$(curl -s -X POST http://localhost:8000/auth/user/login \
  -H "Content-Type: application/json" \
  -d '{"username":"reedcapital","password":"Reed123!","group_id":1}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

All subsequent requests: -H "Authorization: Bearer $TOKEN"

Phase 1

Ingest a PDF

curl -X POST "http://localhost:8000/api/fin-report/ingest-fast" \
  -H "Authorization: Bearer $TOKEN" \
  -F "files=@/path/to/report.pdf"

Use /ingest for background DOCINDEX processing. Use /ingest-fast for synchronous fast Markdown output.

Phase 2

Archive all temp files

Archives every ingested file in the temp folder. The LLM infers dimension values automatically.

curl -X POST "http://localhost:8000/api/fin-report/archive" \
  -H "Authorization: Bearer $TOKEN"

Phase 3

Extract fields from an archived document

# Single document
curl -X POST "http://localhost:8000/api/fin-report/extract" \
  -H "Authorization: Bearer $TOKEN" \
  -G \
  --data-urlencode "path_level2=amundi_initiative-impact" \
  --data-urlencode "path_level3=2024q4" \
  --data-urlencode "schema_suffix=prtf"

# Entire snapshot date (all entities)
curl -X POST "http://localhost:8000/api/fin-report/extract/batch" \
  -H "Authorization: Bearer $TOKEN" \
  -G \
  --data-urlencode "path_level3=2024q4" \
  --data-urlencode "schema_suffix=prtf"

Phase 4

Query the index

# Snapshot — latest document per entity at a given date
curl -X POST "http://localhost:8000/api/fin-report/select/snapshot?date=2024q4" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{}'

# Series — all documents across a date range
curl -X POST "http://localhost:8000/api/fin-report/select/series?start_date=2024q1&end_date=2024q4" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{}'

# With dimension filters
curl -X POST "http://localhost:8000/api/fin-report/select/snapshot?date=2024q4" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"config_params": {"asset-manager": "amundi"}}'

Response: {"matched_files": ["reedcapital/fin-report/amundi.../2024q4/..."], "count": 3}