User Guide
DocForge is a LAN-hosted pipeline that turns PDF documents into structured, queryable data — using an LLM-driven four-phase ETL workflow.
00 Getting Access — Sign In
Click Open App on the landing page. Enter your username and password, then click Sign In. The system auto-detects your account type and redirects you:
Regular user
Users authenticate with their group's shared password. If you belong to one group it is selected automatically. On success you are redirected to /app.
Admin
Admins authenticate with a personal password. On success you are redirected to /admin where you manage groups and users.
| Account type | Min length | Uppercase | Digits | Special char |
|---|---|---|---|---|
| Regular user (group password) | 8 | ≥ 1 | ≥ 1 | ≥ 1 |
| Admin (personal password) | 12 | ≥ 1 | ≥ 3 | ≥ 1 |
Admin Panel /admin
Admins can create groups, set group passwords, add or remove users, and reset passwords. Each group shares a single password — resetting it affects all members immediately.
01 What It Does
DocForge automates the extraction of structured data from PDF reports (fund reports, financial statements, whitepapers). It converts unstructured PDF content into clean Markdown, infers document metadata, and extracts field-level values according to a user-defined JSON schema — all without manual copy-paste.
Input
PDF files uploaded via the browser UI
Output
JSON / Markdown / CSV with extracted fields, permanently archived by document dimensions
02 The Four-Phase Pipeline
-
Phase 1 — Ingest
PDF is uploaded to a temp folder and converted to Markdown. Charts and scanned images are processed by a vision LLM and converted to data tables inline.
-
Phase 2 — Archive
The LLM infers document dimensions (asset manager, fund, date, strategy) from the Markdown content, then moves the file to the permanent workspace tree and updates the index.
-
Phase 3 — Extract
A schema-driven extraction prompt is built per field group. The LLM reads the archived document and populates the schema — with optional TransformEngine post-processing.
-
Phase 4 — Query
Query the document index with date-range filters to get snapshots or time-series. Ask grounded questions via the chat interface.
03 Practical Workflow
Select workspace & Upload PDF
- Pick a workspace (group + doc type) from the sidebar dropdowns.
- Drag a PDF into the upload zone on the Upload tab, or click to browse.
- Click Ingest (Fast) to convert it to Markdown immediately, or Ingest (Standard) for background processing with full DOCINDEX tree.
- Monitor progress in the server log panel at the bottom.
Archive
- Switch to the Archive tab — ingested files appear in the list.
- Click Preview to see what dimensions the LLM inferred (asset manager, fund name, reporting date, etc.).
- Confirm or edit the values, then click Archive to move the file to the permanent tree.
- The document index (
_index/index.json) is updated automatically.
Extract
- Switch to the Extract tab.
- Select the schema to use (from
_extract-templates/in your workspace). - Pick the archived document(s) to extract from.
- Click Run Extract — the LLM populates each field in the schema.
- Results are saved as
{stem}-singles.mdand{stem}-tables.mdalongside the PDF.
Query & Chat
- Switch to the Query tab.
- Use the filter panel to select a date range and dimensions.
- Click Snapshot for point-in-time data or Series for time-series across reports.
- Use the Chat sub-tab to ask natural-language questions — answers are grounded in the archived documents.
04 Tabs Reference
| Tab | Group | Purpose |
|---|---|---|
| Upload | Workflow | Upload PDFs and run Ingest (Phase 1) |
| Archive | Workflow | Preview dimension inference, confirm archive, clean up temp files (Phase 2) |
| Extract | Workflow | Schema-driven LLM field extraction on archived documents (Phase 3) |
| Query | Workflow | Date-range document selection, grounded chat, save query results (Phase 4) |
| Templates | Workflow | Save, manage, and run reusable query templates + extraction schema editor |
| Config | Settings | Pipeline settings: vision model, text model, concurrency, feature flags (merged cells, chart detection, table format…) |
| LLM Config | Settings | Model aliases, fallback chains, provider endpoints |
| Keys | Settings | API keys per workspace (stored encrypted in _config/.keys.yaml) |
| Schema | Settings | Edit extraction schemas (YAML) — field types, formats, hints, transforms |
| Dashboard | Settings | Workspace overview: archive status, extract coverage, and query output counts per reporting date |
05 Dashboard
The Dashboard tab provides a live snapshot of your workspace's processing state. Select a reporting date from the shared date picker — all three status sections update together.
Global Stats
Date range, distinct entity counts (asset managers, fund names…), total files archived, pages processed, total file size.
Archived Stats
Per-entity archive status at T and T−1: full (JSON tree + MD), compact (headers-only JSON), or md (Markdown only).
Extract Stats
Which extraction schemas have been run per entity at the selected date. One column per schema found in _extract-templates/.
Query Stats
Count of saved Snapshot and Series query outputs at T and T−1, scanned from queries-output/.
06 Tips & Notes
_config/.keys.yaml — not in the environment. Set them once via the Keys tab.
07 API Access (SDK / curl)
Every pipeline operation is available as a REST API. Requests are authenticated with a short-lived JWT. The full interactive spec is at /docs.
Get a JWT token
Two auth paths. Use workspace auth for programmatic API access (curl/SDK). Use user login when calling from a browser context or testing user flows.
Workspace auth (master password)
TOKEN=$(curl -s -X POST http://localhost:8000/auth/login \
-d "password=YOUR_MASTER_PASSWORD" \
| python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")
User login (group password)
TOKEN=$(curl -s -X POST http://localhost:8000/auth/user/login \
-H "Content-Type: application/json" \
-d '{"username":"reedcapital","password":"Reed123!","group_id":1}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")
All subsequent requests: -H "Authorization: Bearer $TOKEN"
Ingest a PDF
curl -X POST "http://localhost:8000/api/fin-report/ingest-fast" \
-H "Authorization: Bearer $TOKEN" \
-F "files=@/path/to/report.pdf"
Use /ingest for background DOCINDEX processing. Use /ingest-fast for synchronous fast Markdown output.
Archive all temp files
Archives every ingested file in the temp folder. The LLM infers dimension values automatically.
curl -X POST "http://localhost:8000/api/fin-report/archive" \
-H "Authorization: Bearer $TOKEN"
Extract fields from an archived document
# Single document
curl -X POST "http://localhost:8000/api/fin-report/extract" \
-H "Authorization: Bearer $TOKEN" \
-G \
--data-urlencode "path_level2=amundi_initiative-impact" \
--data-urlencode "path_level3=2024q4" \
--data-urlencode "schema_suffix=prtf"
# Entire snapshot date (all entities)
curl -X POST "http://localhost:8000/api/fin-report/extract/batch" \
-H "Authorization: Bearer $TOKEN" \
-G \
--data-urlencode "path_level3=2024q4" \
--data-urlencode "schema_suffix=prtf"
Query the index
# Snapshot — latest document per entity at a given date
curl -X POST "http://localhost:8000/api/fin-report/select/snapshot?date=2024q4" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{}'
# Series — all documents across a date range
curl -X POST "http://localhost:8000/api/fin-report/select/series?start_date=2024q1&end_date=2024q4" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{}'
# With dimension filters
curl -X POST "http://localhost:8000/api/fin-report/select/snapshot?date=2024q4" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"config_params": {"asset-manager": "amundi"}}'
Response: {"matched_files": ["reedcapital/fin-report/amundi.../2024q4/..."], "count": 3}
Grounded chat
Pass matched_files from a select call to ground the LLM answer in real documents.
curl -X POST "http://localhost:8000/api/fin-report/chat" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"matched_files": [
"reedcapital/fin-report/amundi_initiative-impact/2024q4/amundi_initiative-impact_2024q4"
],
"prompt": "What is the total AUM as of Q4 2024?"
}'
fin-report with your doc_type in every URL. The workspace (group) is embedded in the JWT — no need to pass it per request. Full spec at GET /docs.