Skip to content

User Guide

DocForge is a LAN-hosted pipeline that turns PDF documents into structured, queryable data — using an LLM-driven four-phase ETL workflow.

00 Getting Access — Sign In

Click Open App on the landing page. Enter your username and password, then click Sign In. The system auto-detects your account type and redirects you:

User

Regular user

Users authenticate with their group's shared password. If you belong to one group it is selected automatically. On success you are redirected to /app.

Admin

Admin

Admins authenticate with a personal password. On success you are redirected to /admin where you manage groups and users.

Account type Min length Uppercase Digits Special char
Regular user (group password) 8 ≥ 1 ≥ 1 ≥ 1
Admin (personal password) 12 ≥ 1 ≥ 3 ≥ 1

Admin Panel /admin

Admins can create groups, set group passwords, add or remove users, and reset passwords. Each group shares a single password — resetting it affects all members immediately.

Forgot password? Use the link in the sign-in modal to email fmonneret@gmail.com and request a reset. Admins can reset group passwords directly from the Admin Panel.

01 What It Does

DocForge automates the extraction of structured data from PDF reports (fund reports, financial statements, whitepapers). It converts unstructured PDF content into clean Markdown, infers document metadata, and extracts field-level values according to a user-defined JSON schema — all without manual copy-paste.

Input

PDF files uploaded via the browser UI

Output

JSON / Markdown / CSV with extracted fields, permanently archived by document dimensions

02 The Four-Phase Pipeline

  • Phase 1 — Ingest

    PDF is uploaded to a temp folder and converted to Markdown. Charts and scanned images are processed by a vision LLM and converted to data tables inline.



  • Phase 2 — Archive

    The LLM infers document dimensions (asset manager, fund, date, strategy) from the Markdown content, then moves the file to the permanent workspace tree and updates the index.



  • Phase 3 — Extract

    A schema-driven extraction prompt is built per field group. The LLM reads the archived document and populates the schema — with optional TransformEngine post-processing.



  • Phase 4 — Query

    Query the document index with date-range filters to get snapshots or time-series. Ask grounded questions via the chat interface.

03 Practical Workflow

Step 1

Select workspace & Upload PDF

  1. Pick a workspace (group + doc type) from the sidebar dropdowns.
  2. Drag a PDF into the upload zone on the Upload tab, or click to browse.
  3. Click Ingest (Fast) to convert it to Markdown immediately, or Ingest (Standard) for background processing with full DOCINDEX tree.
  4. Monitor progress in the server log panel at the bottom.
Step 2

Archive

  1. Switch to the Archive tab — ingested files appear in the list.
  2. Click Preview to see what dimensions the LLM inferred (asset manager, fund name, reporting date, etc.).
  3. Confirm or edit the values, then click Archive to move the file to the permanent tree.
  4. The document index (_index/index.json) is updated automatically.
Step 3

Extract

  1. Switch to the Extract tab.
  2. Select the schema to use (from _extract-templates/ in your workspace).
  3. Pick the archived document(s) to extract from.
  4. Click Run Extract — the LLM populates each field in the schema.
  5. Results are saved as {stem}-singles.md and {stem}-tables.md alongside the PDF.
Step 4

Query & Chat

  1. Switch to the Query tab.
  2. Use the filter panel to select a date range and dimensions.
  3. Click Snapshot for point-in-time data or Series for time-series across reports.
  4. Use the Chat sub-tab to ask natural-language questions — answers are grounded in the archived documents.

04 Tabs Reference

Tab Group Purpose
Upload Workflow Upload PDFs and run Ingest (Phase 1)
Archive Workflow Preview dimension inference, confirm archive, clean up temp files (Phase 2)
Extract Workflow Schema-driven LLM field extraction on archived documents (Phase 3)
Query Workflow Date-range document selection, grounded chat, save query results (Phase 4)
Templates Workflow Save, manage, and run reusable query templates + extraction schema editor
Config Settings Pipeline settings: vision model, text model, concurrency, feature flags (merged cells, chart detection, table format…)
LLM Config Settings Model aliases, fallback chains, provider endpoints
Keys Settings API keys per workspace (stored encrypted in _config/.keys.yaml)
Schema Settings Edit extraction schemas (YAML) — field types, formats, hints, transforms
Dashboard Settings Workspace overview: archive status, extract coverage, and query output counts per reporting date

05 Dashboard

The Dashboard tab provides a live snapshot of your workspace's processing state. Select a reporting date from the shared date picker — all three status sections update together.

Global Stats

Date range, distinct entity counts (asset managers, fund names…), total files archived, pages processed, total file size.

Archived Stats

Per-entity archive status at T and T−1: full (JSON tree + MD), compact (headers-only JSON), or md (Markdown only).

Extract Stats

Which extraction schemas have been run per entity at the selected date. One column per schema found in _extract-templates/.

Query Stats

Count of saved Snapshot and Series query outputs at T and T−1, scanned from queries-output/.

06 Tips & Notes

Fast vs Standard ingest: Fast is synchronous and sufficient for most documents. Standard runs in background and builds a full DOCINDEX tree — use it when you need deep section-level structure for large documents.
API keys: Keys are stored per workspace in _config/.keys.yaml — not in the environment. Set them once via the Keys tab.
Archive is idempotent: Re-archiving the same file overwrites the existing entry without duplicating it in the index.
Dashboard for progress tracking: After each reporting period, open the Dashboard tab to confirm all entities are archived (full), extracted across all schemas, and query outputs saved.

07 API Access (SDK / curl)

Every pipeline operation is available as a REST API. Requests are authenticated with a short-lived JWT. The full interactive spec is at /docs.

Auth

Get a JWT token

Two auth paths. Use workspace auth for programmatic API access (curl/SDK). Use user login when calling from a browser context or testing user flows.

Workspace auth (master password)

TOKEN=$(curl -s -X POST http://localhost:8000/auth/login \
  -d "password=YOUR_MASTER_PASSWORD" \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

User login (group password)

TOKEN=$(curl -s -X POST http://localhost:8000/auth/user/login \
  -H "Content-Type: application/json" \
  -d '{"username":"reedcapital","password":"Reed123!","group_id":1}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

All subsequent requests: -H "Authorization: Bearer $TOKEN"

Phase 1

Ingest a PDF

curl -X POST "http://localhost:8000/api/fin-report/ingest-fast" \
  -H "Authorization: Bearer $TOKEN" \
  -F "files=@/path/to/report.pdf"

Use /ingest for background DOCINDEX processing. Use /ingest-fast for synchronous fast Markdown output.

Phase 2

Archive all temp files

Archives every ingested file in the temp folder. The LLM infers dimension values automatically.

curl -X POST "http://localhost:8000/api/fin-report/archive" \
  -H "Authorization: Bearer $TOKEN"
Phase 3

Extract fields from an archived document

# Single document
curl -X POST "http://localhost:8000/api/fin-report/extract" \
  -H "Authorization: Bearer $TOKEN" \
  -G \
  --data-urlencode "path_level2=amundi_initiative-impact" \
  --data-urlencode "path_level3=2024q4" \
  --data-urlencode "schema_suffix=prtf"

# Entire snapshot date (all entities)
curl -X POST "http://localhost:8000/api/fin-report/extract/batch" \
  -H "Authorization: Bearer $TOKEN" \
  -G \
  --data-urlencode "path_level3=2024q4" \
  --data-urlencode "schema_suffix=prtf"
Phase 4

Query the index

# Snapshot — latest document per entity at a given date
curl -X POST "http://localhost:8000/api/fin-report/select/snapshot?date=2024q4" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{}'

# Series — all documents across a date range
curl -X POST "http://localhost:8000/api/fin-report/select/series?start_date=2024q1&end_date=2024q4" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{}'

# With dimension filters
curl -X POST "http://localhost:8000/api/fin-report/select/snapshot?date=2024q4" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"config_params": {"asset-manager": "amundi"}}'

Response: {"matched_files": ["reedcapital/fin-report/amundi.../2024q4/..."], "count": 3}

Phase 4

Grounded chat

Pass matched_files from a select call to ground the LLM answer in real documents.

curl -X POST "http://localhost:8000/api/fin-report/chat" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "matched_files": [
      "reedcapital/fin-report/amundi_initiative-impact/2024q4/amundi_initiative-impact_2024q4"
    ],
    "prompt": "What is the total AUM as of Q4 2024?"
  }'
Replace fin-report with your doc_type in every URL. The workspace (group) is embedded in the JWT — no need to pass it per request. Full spec at GET /docs.