Frequently Asked Questions
Common questions about DocForge.
General
PDF files only. The ingest pipeline uses PyMuPDF for text extraction and an optional vision LLM for charts, scanned images, and complex tables.
The application itself runs locally on your LAN. You do need internet access to reach external LLM API providers (Z.AI, Anthropic, OpenRouter, HuggingFace). Self-hosted LLM endpoints work without internet.
Files go through two stages:
- Temp:
workspaces/{group}/{doc_type}/temp/01-input/(original PDF) andtemp/02-output/(converted Markdown + JSON) - Permanent:
workspaces/{group}/{doc_type}/{level1}_{level2}/{temporal}/(after Archive phase)
Temp files can be cleaned up via the Archive tab after archiving.
Yes. Select multiple files in the upload zone and click Ingest. They are processed sequentially for Standard ingest (background queue) or one-by-one for Fast ingest. Archive and Extract also support multi-file selection.
Sign In & Access
Click Open App on the landing page. Enter your username and password, then click Sign in. If you belong to multiple groups, select the right one from the dropdown.
Admins sign in with their personal password. Regular users sign in with the group's shared password.
Regular users land on /app. Admins land on /admin.
A group is a named workspace team (e.g. reedcapital). All members of a group share one password. This keeps access management simple on a trusted LAN — one password controls the whole team's access.
If the group password needs to change, only the admin needs to update it (from the Admin Panel). All members then use the new password.
If you belong to exactly one group, the dropdown is hidden and that group is auto-selected. The dropdown only appears when you belong to more than one group.
Admin accounts never see the group dropdown — admins authenticate with a personal password, not a group one.
| Account type | Min length | Uppercase | Digits | Special char |
|---|---|---|---|---|
| Regular user (group password) | 8 | ≥ 1 | ≥ 1 | ≥ 1 |
| Admin (personal password) | 12 | ≥ 1 | ≥ 3 | ≥ 1 |
Strength indicators are shown live in the sign-in modal as you type.
Regular user: Contact your group admin — they can reset the group password from the Admin Panel. Once reset, all members of that group use the new password.
Admin: Contact the server operator to reset your personal password in data/auth.db.
Your session token has expired (tokens are valid for 8 hours). Sign in again to get a new token. This also happens if you clear browser localStorage or open the app in a private window.
From /admin, admins can:
- Create and delete groups
- Reset a group's shared password
- Add or remove users within a group
- Manage API keys — unlock with admin password, view/edit, save encrypted
Admins also have access to the full app at /app — navigate there from the navbar after signing in.
Phase 1 — Ingest
Fast ingest runs synchronously. It uses PyMuPDF4LLM for text + a concurrent VisionInterceptor pass for charts. Suitable for most documents.
Standard ingest runs in the background. It builds a full DOCINDEX tree with structural hierarchy (section nodes, summaries). Use it when you need deep section-level context for complex documents.
The VisionInterceptor detects raster and vector chart regions on each page. Each region is cropped, sent to the configured vision LLM (e.g. qwen3-30-hf), and the response table is injected inline into the Markdown output. The original chart area is replaced in the PDF so PyMuPDF doesn't garble it.
This usually means PyMuPDF detected vector chart lines as table separators. Enable Cleanup Garbled Rows and Reconstruct Garbled Tables in the Config panel. For merged-cell tables, enable Demerge Table Cells.
Phase 2 — Archive
Dimensions are the metadata axes that identify a document — for example: asset manager, fund name, reporting date, strategy. They are defined in _config/dimensions.json per workspace.
During Archive, the LLM reads the Markdown and infers each dimension's value. These values determine the permanent file path and appear in the search index.
Yes. Click Preview in the Archive tab before confirming. You can manually edit any inferred value in the preview form before clicking Archive.
The file is overwritten in the permanent tree and the index entry is updated. It is idempotent — no duplicates are created.
Phase 3 — Extract
A schema is a YAML file that defines what fields to extract from documents — type, format, label, hint, and optional transform operations. Schemas live in _extract-templates/ inside your workspace (e.g. extract-schema-prtf.yaml).
You can create and edit schemas directly in the Schema config tab.
| Type | Description |
|---|---|
| string | Free-text value |
| number | Numeric value |
| monetary | Currency amount (with format: M, k) |
| boolean | True / False |
| dict[T] | Date-keyed dict of values (time series) |
Each schema field can have a transform key with comma-separated operations. Examples:
skip_thousand_separator— removes,in numbers like1,234strip_currency_symbols— removes €, $, £ etc.parse_date('DD/MM/YYYY','YYYY-MM-DD')— reformats datesconstant('N/A')— always outputs a fixed valuelambda x: float(x) * 100— arbitrary Python expression
Common causes:
- The schema field labels or hints don't match the document's terminology — try adding synonyms to the
hintfield. - The document wasn't archived yet (Extract needs the permanent
.jsontree). - The LLM timed out — increase concurrency or switch to a faster model in Config.
- The field is marked
rarely_present: true— the LLM only extracts it when very confident.
Phase 4 — Query
Snapshot: Returns one document per matching entity at a single selected date. Filters are multi-select — you can compare multiple entities side by side.
Series: Returns all documents for a single entity across a date range (start → end). Filters are single-select — pick one entity to follow over time.
Trend: Compares documents at two specific dates (T and T‑1). Only entities that have a document at both dates are included — unpaired documents are discarded. Filters are multi-select.
Trend mode performs a strict entity intersection between the two selected dates. If an entity has a document at T but not at T‑1 (or vice versa), it is silently excluded from the result set.
The selection result shows the number of matched entity pairs and the total document count (always 2× the entity count). This guarantees the LLM always receives complete before/after pairs.
The Previous Date selector only shows dates earlier than or equal to the selected Date — it will never be empty because the current Date is always included as a fallback.
The Chat tab lets you ask natural-language questions. It retrieves relevant document passages from the archived Markdown files and passes them as context to the LLM. Answers are grounded in actual document content — not hallucinated from model memory.
Query templates let you save and reuse query configurations (filters + prompt). They are stored as Markdown files in _query-templates/ inside your workspace. Manage them from the Templates tab — create, edit, and run templates directly without re-entering parameters each time.
Yes. After running a query, use the Save Output action to persist the result as a Markdown file in queries-output/{snapshot|series|trend}/{reporting-date}/. Saved outputs appear as counts in the Dashboard panel.
Transmuted mode handles queries over a large number of documents that would exceed the LLM's context window. Instead of sending all documents in one call, the system:
- Rewrites your cross-document question as a simpler single-document question (the transmuted question).
- Runs that simpler question in parallel against each document individually.
- Combines all per-document answers in a final reduce step to produce the answer to your original question.
The Query Panel shows a token budget indicator after document selection. When the total exceeds 100 000 tokens, Transmuted mode is recommended automatically.
| Format | Token cost | Best for |
|---|---|---|
| Minified JSON | Medium (default) | Specific metrics, allocations, tables — works for most questions |
| JSON (full) | Highest | Maximum accuracy on complex nested data when token budget is not an issue |
| Compact JSON | Lowest | Asking about document structure ("which section covers X?") |
| Markdown | Medium | Prose summaries, strategy descriptions, evidence finding, narrative questions |
Not sure? Minified JSON is the right default — it provides the same extraction accuracy as full JSON at ~40% fewer tokens.
After each document answers the single-doc question individually, the reduce step combines all per-document answers into a final cross-document answer.
Two reduce strategies are used depending on the query type:
- Deterministic: max, min, sort, delta (T–T-1), series assembly — no LLM call, computed directly from structured answers.
- LLM-based: filter, similarity, aggregate — a final QUERY_MODEL call assembles the per-document answers into a narrative conclusion.
The reduce operation is determined automatically from the transmuted question metadata and shown to you before execution.
Dashboard
The Dashboard tab gives a live workspace overview with four sections:
- Global Stats: Date range, distinct entity counts, total files archived, pages processed, total file size.
- Archived Stats: Per-entity archive status (full JSON / compact JSON / Markdown) at the selected date and the previous date.
- Extract Stats: Which extraction schemas have been run per entity at the selected date.
- Query Stats: Number of saved query outputs (Snapshot and Series) per reporting date.
A shared date selector at the top drives all three status sections.
All data is computed on-demand from _index/index.json (archive and extract status) and the queries-output/ folder (query counts). No separate database — click Refresh to reload.
LLM & Configuration
Edit the LLM Config tab to update model aliases and provider endpoints. Add or update API keys in the Keys Panel (Admin page) — unlock with your admin password, paste keys in YAML format, and save. The router supports automatic fallback chains — if one model fails, the next in the ranking is tried.
In workspaces/{group}/{doc_type}/_config/.keys.yaml.enc — encrypted with Fernet (PBKDF2 + MASTER_PASSWORD). Plaintext .keys.yaml is a read-only fallback, auto-migrated to encrypted on first save.
Admins manage keys from the Keys Panel in the Admin page:
- Select a workspace (group + doc type) — keys are shown masked.
- Click Unlock and enter your admin password to decrypt and edit.
- Paste keys in YAML format and click Save — encrypted automatically.
Keys are never stored in .env or committed to source control.
Chart extraction is the main bottleneck — each chart image requires a separate vision LLM call. You can speed it up by:
- Increasing Vision Concurrency in Config (default: 3)
- Switching to a faster vision model
- Disabling Two-Pass Image Detection if your documents have few charts
API & SDK
All API calls require a JWT Bearer token. Get one by logging in:
User login (single-step, works for both admins and regular users):
# Look up username (optional — confirms groups for regular users)
curl -s "http://localhost:8000/auth/user/lookup?username=alice"
# → {"is_admin": true, "groups": []}
# Login
TOKEN=$(curl -s -X POST http://localhost:8000/auth/user/login \
-H "Content-Type: application/json" \
-d '{"username":"alice","password":"MyAdminPass123!!!"}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")
Pass the token on every request: -H "Authorization: Bearer $TOKEN"
The /docs page loads RapiDoc with the spec served from app/project-docs/openapi.json — a static pre-generated file, not the live FastAPI spec at /api/openapi.json.
If you add or change routes, re-generate it:
python app/project-scripts/export_openapi.py
Add --yaml to also export a YAML version.
# 1. Login
TOKEN=$(curl -s -X POST http://localhost:8000/auth/user/login \
-H "Content-Type: application/json" \
-d '{"username":"alice","password":"MyAdminPass123!!!"}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")
# 2. Ingest (fast, synchronous)
curl -X POST "http://localhost:8000/api/fin-report/ingest-fast" \
-H "Authorization: Bearer $TOKEN" \
-F "files=@report.pdf"
# 3. Archive all temp files
curl -X POST "http://localhost:8000/api/fin-report/archive" \
-H "Authorization: Bearer $TOKEN"
# 4. Extract (single document)
curl -X POST "http://localhost:8000/api/fin-report/extract" \
-H "Authorization: Bearer $TOKEN" \
-G \
--data-urlencode "path_level2=amundi_initiative-impact" \
--data-urlencode "path_level3=2024q4" \
--data-urlencode "schema_suffix=prtf"
# 5. Query snapshot
curl -X POST "http://localhost:8000/api/fin-report/select/snapshot?date=2024q4" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{}'
Replace fin-report with your doc type. Full parameter reference: /docs.
Public (no token): Landing page, sign-in, user lookup (POST /auth/user/login, GET /auth/user/lookup), static doc pages (/guide, /faq, /docs).
JWT required: all /api/* routes, UI workspace routes, Admin panel, Keys Panel.
The token's sub claim identifies the user and their role (admin or regular). Admin tokens unlock keys management.
First select matching documents, then pass the paths as context to chat:
# Step 1: get matched files
RESULT=$(curl -s -X POST \
"http://localhost:8000/api/fin-report/select/snapshot?date=2024q4" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{}')
# Step 2: extract matched_files and send to chat
MATCHED=$(echo $RESULT | python3 -c "import sys,json; print(json.dumps(json.load(sys.stdin)['matched_files']))")
curl -X POST "http://localhost:8000/api/fin-report/chat" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d "{\"matched_files\": $MATCHED, \"prompt\": \"What is the total AUM?\"}"