FAQ — DocForge

General

What file types does DocForge support?

PDF files only. The ingest pipeline uses PyMuPDF for text extraction and an optional vision LLM for charts, scanned images, and complex tables.

Do I need an internet connection?

The application itself runs locally on your LAN. You do need internet access to reach external LLM API providers (Z.AI, Anthropic, OpenRouter, HuggingFace). Self-hosted LLM endpoints work without internet.

Where are uploaded files stored?

Files go through two stages:

Temp: workspaces/{group}/{doc_type}/temp/01-input/ (original PDF) and temp/02-output/ (converted Markdown + JSON)
Permanent: workspaces/{group}/{doc_type}/{level1}_{level2}/{temporal}/ (after Archive phase)

Temp files can be cleaned up via the Archive tab after archiving.

Can I process multiple PDFs at once?

Yes. Select multiple files in the upload zone and click Ingest. They are processed sequentially for Standard ingest (background queue) or one-by-one for Fast ingest. Archive and Extract also support multi-file selection.

Sign In & Access

How do I sign in?

Click Open App on the landing page. Enter your username and password, then click Sign in. If you belong to multiple groups, select the right one from the dropdown.

Admins sign in with their personal password. Regular users sign in with the group's shared password.

Regular users land on /app. Admins land on /admin.

What is a group and why don't I have an individual password?

A group is a named workspace team (e.g. reedcapital). All members of a group share one password. This keeps access management simple on a trusted LAN — one password controls the whole team's access.

If the group password needs to change, only the admin needs to update it (from the Admin Panel). All members then use the new password.

Why is the group dropdown not shown on my login?

If you belong to exactly one group, the dropdown is hidden and that group is auto-selected. The dropdown only appears when you belong to more than one group.

Admin accounts never see the group dropdown — admins authenticate with a personal password, not a group one.

What are the password strength requirements?

Account type	Min length	Uppercase	Digits	Special char
Regular user (group password)	8	≥ 1	≥ 1	≥ 1
Admin (personal password)	12	≥ 1	≥ 3	≥ 1

Strength indicators are shown live in the sign-in modal as you type.

I forgot my password — what do I do?

Regular user: Contact your group admin — they can reset the group password from the Admin Panel. Once reset, all members of that group use the new password.

Admin: Contact the server operator to reset your personal password in data/auth.db.

I'm redirected back to the landing page — why?

Your session token has expired (tokens are valid for 8 hours). Sign in again to get a new token. This also happens if you clear browser localStorage or open the app in a private window.

What can the admin do that regular users can't?

From /admin, admins can:

Create and delete groups
Reset a group's shared password
Add or remove users within a group
Manage API keys — unlock with admin password, view/edit, save encrypted

Admins also have access to the full app at /app — navigate there from the navbar after signing in.

Phase 1 — Ingest

What's the difference between Fast and Standard ingest?

Fast ingest runs synchronously. It uses PyMuPDF4LLM for text + a concurrent VisionInterceptor pass for charts. Suitable for most documents.

Standard ingest runs in the background. It builds a full DOCINDEX tree with structural hierarchy (section nodes, summaries). Use it when you need deep section-level context for complex documents.

How are charts handled?

The VisionInterceptor detects raster and vector chart regions on each page. Each region is cropped, sent to the configured vision LLM (e.g. qwen3-30-hf), and the response table is injected inline into the Markdown output. The original chart area is replaced in the PDF so PyMuPDF doesn't garble it.

Some pages show "Vision partial" — what does that mean?

The vision LLM failed to convert charts on certain pages. Text extraction still completed for the entire document. Failed page numbers are shown in the ingest progress. You can re-ingest those pages specifically using the Pages input and Morph to merge them into the global output.

What is the Morph feature?

Morph lets you re-ingest specific pages and merge the result into the existing global document. Workflow:

Enter page numbers (e.g. 43 or 1-5) in the Pages field.
If a global {stem}.md exists, the Morph checkbox appears.
First partial ingest auto-bootstraps the global file. Subsequent ingests merge into it.

The page-specific output is always kept as {stem}_p{range}.md alongside the global file.

Why is ingest slow for documents with many charts?

Each chart image requires a separate vision LLM call. DocForge throttles requests with a configurable inter-request delay (Vision Inter-Request Delay in Config, default 0.5s) to avoid rate limits. You can increase Vision Concurrency (default: 3) or switch to a faster vision model to speed things up.

The Markdown output has garbled table rows — what happened?

This usually means PyMuPDF detected vector chart lines as table separators. Enable Cleanup Garbled Rows and Reconstruct Garbled Tables in the Config panel. For merged-cell tables, enable Demerge Table Cells.

Phase 2 — Archive

What are "dimensions"?

Dimensions are the metadata axes that identify a document — for example: asset manager, fund name, reporting date, strategy. They are defined in _config/dimensions.json per workspace.

During Archive, the LLM reads the Markdown and infers each dimension's value. These values determine the permanent file path and appear in the search index.

The LLM inferred the wrong dimension value — can I fix it?

Yes. Click Preview in the Archive tab before confirming. You can manually edit any inferred value — including the suggested filename (auto-generated from the document description) — in the preview form before clicking Archive.

What happens if I archive the same document twice?

The file is overwritten in the permanent tree and the index entry is updated. It is idempotent — no duplicates are created.

Phase 3 — Extract

What is a schema and where do I put it?

A schema is a YAML file that defines what fields to extract from documents — type, format, label, hint, and optional transform operations. Schemas live in _extract-templates/ inside your workspace (e.g. extract-schema-prtf.yaml).

You can create and edit schemas directly in the Schema config tab.

What schema column types are supported?

Type	Description
string	Free-text value
number	Numeric value
monetary	Currency amount (with format: M, k)
boolean	True / False
dict[T]	Date-keyed dict of values (time series)

How do TransformEngine operations work?

Each schema field can have a transform key with comma-separated operations. Examples:

skip_thousand_separator — removes , in numbers like 1,234
strip_currency_symbols — removes €, $, £ etc.
parse_date('DD/MM/YYYY','YYYY-MM-DD') — reformats dates
constant('N/A') — always outputs a fixed value
lambda x: float(x) * 100 — arbitrary Python expression

Extract returned all nulls — what went wrong?

Common causes:

The schema field labels or hints don't match the document's terminology — try adding synonyms to the hint field.
The document wasn't archived yet (Extract needs the permanent .json tree).
The LLM timed out — increase concurrency or switch to a faster model in Config.
The field is marked rarely_present: true — the LLM only extracts it when very confident.

How does multi-document extraction work?

When multiple documents belong to the same folder (e.g. same fund and reporting period), they are automatically batched together. The system concatenates their document trees into one combined input for a single LLM call — no separate prompts needed.

Single-document folders are extracted individually, exactly as before.

What are narrative fields and how do I use them?

A narrative field asks the LLM to write a prose summary instead of extracting a bare value. Add narrative: 2-4 to any field in the schema to request a 2–4 sentence summary. Works on both singles and table columns.

# In a singles section:
- key: investment_thesis
  type: string
  label: Investment thesis
  hint: Strategy overview section
  narrative: 2-4          # write 2 to 4 sentences

# In a table column:
- key: company_strategy
  type: string
  label: Portfolio company strategy
  narrative: 1-3          # write 1 to 3 sentences per row

Narrative fields are extracted in a separate LLM pass that uses a prose-summarisation prompt (not the standard value-extraction prompt). For table columns, the entity list from pass 1 is injected so the LLM writes one summary per row. The results appear in the same Markdown table columns as regular fields.

What output files does extraction produce?

Each extraction writes four files next to the archived PDF:

File	Format	Contents
extract-singles-{suffix}.md	Markdown pipe table	One row — all scalar fields
extract-singles-{suffix}.yaml	YAML	Same row(s), dict-keyed by primary key column
extract-tables-{suffix}.md	Markdown, one `## table_name` section per table	Multi-row tables
extract-tables-{suffix}.yaml	YAML	Dict keyed by table name; each row has an auto-incremented `id`

The YAML files are machine-readable mirrors — useful as input for downstream scripts or spreadsheet imports without parsing Markdown.

Phase 4 — Query

What's the difference between Snapshot, Series, and Trend queries?

Snapshot: Returns one document per matching entity at a single selected date. Filters are multi-select — you can compare multiple entities side by side.

Series: Returns all documents for a single entity across a date range (start → end). Filters are single-select — pick one entity to follow over time.

Trend: Compares documents at two specific dates (T and T‑1). Only entities that have a document at both dates are included — unpaired documents are discarded. Filters are multi-select.

How does Trend mode handle missing documents?

Trend mode performs a strict entity intersection between the two selected dates. If an entity has a document at T but not at T‑1 (or vice versa), it is silently excluded from the result set.

The selection result shows the number of matched entity pairs and the total document count (always 2× the entity count). This guarantees the LLM always receives complete before/after pairs.

The Previous Date selector only shows dates earlier than or equal to the selected Date — it will never be empty because the current Date is always included as a fallback.

How does the Chat feature work?

The Chat tab lets you ask natural-language questions. It retrieves relevant document passages from the archived Markdown files and passes them as context to the LLM. Answers are grounded in actual document content — not hallucinated from model memory.

What are Query Templates?

Query templates let you save and reuse query configurations (filters + prompt). They are stored as Markdown files in _query-templates/ inside your workspace. Manage them from the Templates tab — create, edit, and run templates directly without re-entering parameters each time.

Can I save query results?

Yes. After running a query, use the Save Output action to persist the result as a Markdown file in queries-output/{snapshot|series|trend}/{reporting-date}/. Saved outputs appear as counts in the Dashboard panel.

What is the "Transmuted" mode and when should I use it?

Transmuted mode handles queries over a large number of documents that would exceed the LLM's context window. Instead of sending all documents in one call, the system:

Rewrites your cross-document question as a simpler single-document question (the transmuted question).
Runs that simpler question in parallel against each document individually.
Combines all per-document answers in a final reduce step to produce the answer to your original question.

The Query Panel shows a token budget indicator after document selection. When the total exceeds 100 000 tokens, Transmuted mode is recommended automatically.

Which document format should I choose?

Format	Token cost	Best for
Minified JSON	Medium (default)	Specific metrics, allocations, tables — works for most questions
JSON (full)	Highest	Maximum accuracy on complex nested data when token budget is not an issue
Compact JSON	Lowest	Asking about document structure ("which section covers X?")
Markdown	Medium	Prose summaries, strategy descriptions, evidence finding, narrative questions

Not sure? Minified JSON is the right default — it provides the same extraction accuracy as full JSON at ~40% fewer tokens.

How does AI Review work?

Right-click any editable text field (query input, refinement, template editor, output in Edit mode) and choose Review by AI. The LLM sharpens phrasing without inventing information. Prepend [Instruction: make this more formal] to guide the rewrite. Use Undo AI Review to revert.

What is the reduce step in transmuted mode?

After each document answers the single-doc question individually, the reduce step combines all per-document answers into a final cross-document answer.

Two reduce strategies are used depending on the query type:

Deterministic: max, min, sort, delta (T–T-1), series assembly — no LLM call, computed directly from structured answers.
LLM-based: filter, similarity, aggregate — a final QUERY_MODEL call assembles the per-document answers into a narrative conclusion.

The reduce operation is determined automatically from the transmuted question metadata and shown to you before execution.

Dashboard

What does the Dashboard show?

The Dashboard tab gives a live workspace overview with four sections:

Global Stats: Date range, distinct entity counts, total files archived, pages processed, total file size.
Archived Stats: Per-entity archive status (full JSON / compact JSON / Markdown) at the selected date and the previous date.
Extract Stats: Which extraction schemas have been run per entity at the selected date.
Query Stats: Number of saved query outputs (Snapshot and Series) per reporting date.

A shared date selector at the top drives all three status sections.

How is the Dashboard data computed?

All data is computed on-demand from _index/index.json (archive and extract status) and the queries-output/ folder (query counts). No separate database — click Refresh to reload.

LLM & Configuration

How do I switch LLM providers?

Edit the LLM Config tab to update model aliases and provider endpoints. Add or update API keys in the Keys Panel (Admin page) — unlock with your admin password, paste keys in YAML format, and save. The router supports automatic fallback chains — if one model fails, the next in the ranking is tried.

Where are API keys stored?

In workspaces/{group}/{doc_type}/_config/.keys.yaml.enc — encrypted with Fernet (PBKDF2 + MASTER_PASSWORD). Plaintext .keys.yaml is a read-only fallback, auto-migrated to encrypted on first save.

Admins manage keys from the Keys Panel in the Admin page:

Select a workspace (group + doc type) — keys are shown masked.
Click Unlock and enter your admin password to decrypt and edit.
Paste keys in YAML format and click Save — encrypted automatically.

Keys are never stored in .env or committed to source control.

Why is ingest slow?

Chart extraction is the main bottleneck — each chart image requires a separate vision LLM call. You can speed it up by:

Increasing Vision Concurrency in Config (default: 3)
Switching to a faster vision model
Disabling Two-Pass Image Detection if your documents have few charts

API & SDK

How do I authenticate via the API?

All API calls require a JWT Bearer token. Get one by logging in:

User login (single-step, works for both admins and regular users):

# Look up username (optional — confirms groups for regular users)
curl -s "http://localhost:8000/auth/user/lookup?username=alice"
# → {"is_admin": true, "groups": []}

# Login
TOKEN=$(curl -s -X POST http://localhost:8000/auth/user/login \
  -H "Content-Type: application/json" \
  -d '{"username":"alice","password":"MyAdminPass123!!!"}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

Pass the token on every request: -H "Authorization: Bearer $TOKEN"

Which file is read by the /docs API viewer?

The /docs page loads RapiDoc with the spec served from app/project-docs/openapi.json — a static pre-generated file, not the live FastAPI spec at /api/openapi.json.

If you add or change routes, re-generate it:

python app/project-scripts/export_openapi.py

Add --yaml to also export a YAML version.

How do I run the full pipeline via curl?

# 1. Login
TOKEN=$(curl -s -X POST http://localhost:8000/auth/user/login \
  -H "Content-Type: application/json" \
  -d '{"username":"alice","password":"MyAdminPass123!!!"}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

# 2. Ingest (fast, synchronous)
curl -X POST "http://localhost:8000/api/fin-report/ingest-fast" \
  -H "Authorization: Bearer $TOKEN" \
  -F "files=@report.pdf"

# 3. Archive all temp files
curl -X POST "http://localhost:8000/api/fin-report/archive" \
  -H "Authorization: Bearer $TOKEN"

# 4. Extract (single document)
curl -X POST "http://localhost:8000/api/fin-report/extract" \
  -H "Authorization: Bearer $TOKEN" \
  -G \
  --data-urlencode "path_level2=amundi_initiative-impact" \
  --data-urlencode "path_level3=2024q4" \
  --data-urlencode "schema_suffix=prtf"

# 5. Query snapshot
curl -X POST "http://localhost:8000/api/fin-report/select/snapshot?date=2024q4" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{}'

Replace fin-report with your doc type. Full parameter reference: /docs.

What routes require a JWT and what don't?

Public (no token): Landing page, sign-in, user lookup (POST /auth/user/login, GET /auth/user/lookup), static doc pages (/guide, /faq, /docs).

JWT required: all /api/* routes, UI workspace routes, Admin panel, Keys Panel.

The token's sub claim identifies the user and their role (admin or regular). Admin tokens unlock keys management.

How do I call the chat endpoint programmatically?

First select matching documents, then pass the paths as context to chat:

# Step 1: get matched files
RESULT=$(curl -s -X POST \
  "http://localhost:8000/api/fin-report/select/snapshot?date=2024q4" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{}')

# Step 2: extract matched_files and send to chat
MATCHED=$(echo $RESULT | python3 -c "import sys,json; print(json.dumps(json.load(sys.stdin)['matched_files']))")

curl -X POST "http://localhost:8000/api/fin-report/chat" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"matched_files\": $MATCHED, \"prompt\": \"What is the total AUM?\"}"

Frequently Asked Questions