NousResearch/hermes-agent

github.com/NousResearch/hermes-agent · audited 2026-06-03 · commit 6420220

44% ERI composite

Hermes Agent is a well-built single-user agent application — a desktop app (apps/desktop) over a Python CLI core (hermes_cli). Run through the Enterprise Readiness Index, it scores a 44% composite, and the shape of that number is more interesting than the number itself.

Where it’s strong

The execution-velocity tier holds up. Implementation & Customization (78%), AI / Data Foundation (72%), Reliability Primitives (72%), Deployability (71%), and Performance Primitives (66%) all land in healthy territory: configuration-driven variation instead of per-customer branches, sane data handling, real deploy signals, and an engineering org without an obvious single-owner cliff (Engineering Org Resilience 64%).

Where the thesis breaks

The exit-cleanliness and enterprise-control dimensions score low — and that’s the honest signal, not a defect. Tenancy Isolation (0%), Audit / Governance / Residency (14%), Reporting & Data Export (16%), API & Extensibility (18%), and Procurement Readiness (21%) all reflect the same root cause: this codebase was never architected as a multi-tenant B2B platform. There are stray tenant_id fields (e.g. TeamsMeetingRef.tenant_id) and a narrow dashboard-auth audit log, but no row-level isolation, no cross-tenant tests, no append-only audit spine, no customer-facing export surface.

The read

For its actual purpose — a fast, local, single-operator agent — Hermes is in good shape. For an upmarket B2B thesis that assumes multi-tenancy and enterprise procurement, the gap is structural and would be a re-platform, not a sprint. The dimension breakdown below shows exactly which sites are expected versus found, with remediation linked to the audited commit.

T1 Thesis Viability

AI / Data Foundation

Versioned data pipelines, pinned model versions, and a real vector or feature store — not scattered cron jobs and model="latest".

72% 16/19 scored

Declarative, tested transformations 67%

3/3 expected sites
Orchestrated pipelines 89%

3/3 expected sites
Data quality validation / contracts 67%

2/2 expected sites
Data + pipeline versioning 0%

0/3 expected sites
Data lineage / provenance 89%

3/3 expected sites
Feature management 58%

3/4 expected sites
Vector / embedding store 75%

4/4 expected sites
Model version pinning 33%

1/2 expected sites
Prompt / model-call management 92%

4/4 expected sites
Reproducibility / determinism 8%

1/4 expected sites
AI output validation 100%

2/2 expected sites
Grounding / wrongness check 67%

1/1 expected sites
Self-correction / feedback loop 0%

0/3 expected sites not present
Actionable diagnostics 100%

2/2 expected sites
Positive confirmation 200%

2/1 expected sites
Machine-readable contracts 100%

4/4 expected sites

Declarative, tested transformations 67%

Declarative, tested transformations exist primarily through Hermes’ plugin system hooks (transform_llm_output / transform_tool_result / transform_terminal_output). The repo provides unit + integration tests that load real plugins from manifests and verify the transformation contract end-to-end (dispatch semantics, wiring of kwargs, replacement rules, truncation/redaction interactions, and exception fallback). The main potential gap is that the core application seam locations (e.g., run_agent/model_tools/terminal_tool boundaries) are not directly evidenced here as hook-invocation sites—though the tests strongly indicate the transformation layer is governed and validated.

high
Add (or locate and reference) a direct code-evidence slice in run_agent.py/model_tools.py/terminal_tool.py showing exactly where each hook (transform_llm_output / transform_tool_result / transform_terminal_output) is invoked during production execution, so the primitive’s presence is proven at the critical seams (not just via tests).
- run_agent.py:1-260 — run_agent.py is the expected LLM-output seam, but the provided evidence here is only module header range; hook invocation wiring lines need confirmation.
- model_tools.py:1-260 — model_tools.py is the expected tool-result seam; the provided evidence here is only the module header range and does not yet show the hook call site.
- tools/terminal_tool.py:1-260 — tools/terminal_tool.py is the expected terminal-output seam; the provided evidence here is only the module header range and does not yet show the hook call site.
med
Ensure transformation assets are clearly versioned per plugin (e.g., plugin.yaml version + any compatibility constraints) and that each transformation hook has at least one dataset/boundary test case for empty outputs and malformed plugin return types (some exist, but consolidate per hook into a consistent suite).
- tests/test_transform_llm_output_hook.py:1-160 — Covers empty string pass-through, non-string returns, and exception behavior for transform_llm_output.
- tests/test_transform_tool_result_hook.py:1-192 — Covers None/invalid hook returns, first-valid-string replacement, and exception fallback for transform_tool_result.
- tests/tools/test_terminal_output_transform_hook.py:1-210 — Covers first-valid-string replacement, truncation, redaction behavior, and exception fallback for transform_terminal_output.
low
Document (in a short developer guide) the expected plugin contract for each transform hook (input kwargs, replacement semantics, return-type rules, truncation/redaction expectations) and point to the corresponding test files as the authoritative spec.
- hermes_cli/plugins.py:120-260 — VALID_HOOKS enumerates transform hooks and describes replacement semantics at a high level, but a dedicated “contract + tests” doc would reduce reliance on reading tests.

Orchestrated pipelines 89%

This codebase contains an orchestrated, dependency-style pipeline implementation for the Teams meeting summary flow. It externalizes orchestration state using a durable store, persists step-by-step lifecycle statuses, classifies retryable vs terminal failures, and wires the pipeline into the gateway via an explicit scheduler callback.

high
Add a first-class, queryable DAG/asset definition for the pipeline steps (e.g., a versioned pipeline manifest describing step graph, retry policy, and step inputs/outputs) and surface it for observability tooling. Right now, the step graph is implicit in control flow within `run_job`.
- plugins/teams_pipeline/pipeline.py:260-560 — The orchestration steps and transitions are implemented directly in `run_job` control flow, but there is no separate external artifact describing the pipeline DAG/graph.
med
Strengthen the retry mechanism into an explicit scheduler-backed retry loop (with bounded attempts and backoff scheduling persisted per job), rather than relying only on `retry_scheduled` status updates and eventual invocation.
- plugins/teams_pipeline/pipeline.py:260-560 — `run_job` catches `TeamsPipelineRetryableError` and persists `retry_scheduled`, but the actual retry scheduling policy/worker loop is not shown in the orchestration hot path we inspected.
low
Expose an auditable summary of pipeline runs (job_id, event_id/dedupe_key, step timestamps, last error) via a small API/CLI command that reads the durable store. This would improve operational observability and reproducibility.
- plugins/teams_pipeline/store.py:1-194 — The store contains `jobs` and timestamps/receipts, but we did not confirm a dedicated CLI/API that outputs a structured run report from these records.

Data quality validation / contracts 67%

This codebase does have data-quality/contract-like validation layers, but they are not consistently applied as a single “data contracts” primitive across all ingestion boundaries. Strongest evidence is present for (1) tool JSON schema sanitization to prevent ingestion failures, and (2) file-tool input guards (size limits and blocked device paths) with unit tests that confirm rejection/quarantine behavior. Other ingestion-style boundaries (e.g., delivery routing inputs) appear to rely more on parsing and downstream logic than on a clearly governed contract gate.

high
Introduce (or standardize) an explicit data-quality contract gate for delivery routing inputs in gateway/delivery.py—e.g., strict schema/validation for DeliveryTarget.parse inputs (and any content/metadata shape), with a quarantine/error return type that prevents malformed targets from reaching platform adapters.
- gateway/delivery.py:1-120 — Delivery routing accepts and parses target strings; evidence available shows best-effort parsing without an explicit contract gate for malformed/unsafe routing inputs.
med
Extend file_tools validation coverage to include comprehensive handler-level shape validation for all tool entrypoints (keys/types/ranges) in addition to existing path/device/size guards, and ensure each validation rule has a unit test that asserts rejection behavior.
- tools/file_tools.py:1-520 — Existing guards cover some high-risk cases (blocked devices, size caps, path resolution); broader ingestion-boundary shape contracts appear partially covered but not comprehensively demonstrated in the sampled sections.
low
Document the schema/validation contract pattern (what constitutes the “contract”, where it runs, what error payload looks like, and how quarantine is expressed) and reuse it across modules that accept LLM/tool inputs.
- tools/schema_sanitizer.py:1-220 — Schema sanitization already follows a contract mindset; formalizing the pattern would help apply it consistently to other ingestion boundaries.

Raw / immutable source layer N/A

I did not find an immutable “raw/source preserved unmodified” landing layer anywhere in the codebase. The only “raw/Raw…” hits are UI-level naming (e.g., `RawAnsi`) or data-fetch scripts that output processed/normalized artifacts directly, without a governed immutable raw layer for audit/reproducibility.

high
Introduce a dedicated immutable raw landing layer for external fetches (e.g., Wikipedia/Wikidata, YouTube transcript, other tool ingestions): persist the exact HTTP response payload (and request parameters/headers + timestamps + source identifier) to a versioned store before any parsing/normalization; write transforms downstream from this stored raw artifact.
- optional-skills/research/osint-investigation/scripts/fetch_wikipedia.py:200-267 — Currently writes enriched CSV rows directly from processed lookups; no observable immutable raw capture before transformation.
- skills/media/youtube-content/scripts/fetch_transcript.py:1-125 — Currently returns normalized structured JSON; no evidence of persisting unmodified raw inputs for later replay.
med
Add an ingestion manifest/spec for the raw layer (schema + required fields, retention policy, and a deterministic reprocessing command that reads raw artifacts without re-fetching external sources).
- skills/media/youtube-content/scripts/fetch_transcript.py:1-125 — Script-level JSON output exists, but there is no accompanying machine-readable contract/manifest for an immutable raw storage layer.
low
Rename or namespace UI-layer “raw” components (e.g., `RawAnsi`) to avoid confusion with the data-layer primitive name, and document clearly which “raw” refers to UI bytes vs. immutable source-data persistence.
- ui-tui/packages/hermes-ink/src/ink/components/RawAnsi.tsx:1-62 — Terminology collision risk: UI component uses `RawAnsi` name but does not represent the requested data primitive.

Data + pipeline versioning 0%

The codebase contains a strong reproducibility primitive via filesystem snapshotting (Checkpoint Manager using a shared shadow git store) and a durable pipeline state store for the Teams meeting pipeline. However, there is no clear evidence that data + pipeline versions are tightly coupled and recorded for each pipeline release/job (e.g., no explicit pipeline-version/data-provenance manifest tying code version to specific input/output versions).

high
Add explicit pipeline versioning metadata capture and persistence: when a checkpoint/snapshot is created for a pipeline run, store (and persist into the pipeline store) the pipeline code version (git commit hash of the pipeline module), configuration hash, and the input artifact/version identifiers that determine the produced outputs.
- tools/checkpoint_manager.py:1-60 — Snapshot mechanism exists, but the snippet indicates it snapshots filesystem state; additional required binding to pipeline/data provenance appears missing.
- plugins/teams_pipeline/store.py:1-194 — Durable state exists but does not show fields for pipeline/data version linkage.
med
Extend TeamsPipeline job/sink records to include deterministic identifiers: pipeline logic hash, input artifact keys (meeting artifact/transcript/audio versions), and output schema/version. Persist these at upsert_job/upsert_sink_record time so later runs can replay exactly.
- plugins/teams_pipeline/store.py:1-194 — Store currently persists jobs/sink_records but schema for version linkage is not present in the observed code.
low
Add tests that verify reproducibility: given the same pipeline version + captured input versions, outputs should be identical (or produce the same sink record identifiers).
- tests/test_yuanbao_pipeline.py:1-200 — There are pipeline unit tests, but the audit did not find version-coupling tests specific to data+pipeline reproducibility.

Data lineage / provenance 89%

Data lineage/provenance exists in this codebase primarily as durable conversation/session provenance in the SQLite-backed Hermes state store (not as an external lineage standard like OpenLineage/DataHub). Sessions link derivations via `parent_session_id`, messages store tool-call identifiers and timestamps, and retrieval tooling (session_search_tool) reconstructs lineage roots and corrects anchor rebinding for consistency. An observability plugin (Langfuse) provides external trace emission, but the lineage primitive here is more “conversation/tool provenance” than “dataset lineage” for data pipelines.

high
Audit and document the end-to-end provenance emission path (session creation → message insertion → tool call linkage) to ensure lineage is always emitted for every transformation/derivation event, and that identifiers (session_id/task_id/tool_call_id) are consistently propagated across modules.
- hermes_state.py:230-320 — Provenance fields exist in schema (`parent_session_id`, tool call fields), but the wiring that guarantees they are always populated should be verified/standardized at the write points.
med
Add a machine-queryable provenance export (e.g., JSON export of a session lineage graph or an internal API endpoint) so lineage can be validated without requiring consumers to understand internal DB semantics.
- tools/session_search_tool.py:1-170 — Current lineage correctness is enforced during retrieval; providing a first-class export would externalize lineage for change-management review.
low
If external observability is the intended governed provenance system, extend/confirm that trace metadata includes all lineage-critical ids (session lineage root, parent/child relationships, tool_call_id/session_id mapping) consistently across all trace/span creation points.
- plugins/observability/langfuse/__init__.py:1-120 — The plugin defines trace state and activation gating; ensure trace payloads always include the same lineage anchors as Hermes state.

Feature management 58%

This codebase has a centralized feature-management layer for Tool Gateway entitlements: `hermes_cli/nous_subscription.py` computes governed, structured feature state (available/active/provider/managed) and is backed by unit tests. Other surfaces (CLI tool config + portal status) import and use this computation, reducing training/serving-style skew risk. However, the audit evidence available shows imports for consumers, but not the full wiring where each consumer uses the computed states to gate runtime behavior (so overall quality is not “perfect”).

high
Verify end-to-end wiring that every runtime decision/surface using features (especially those affecting which tools/skills are exposed to the agent) derives from `get_nous_subscription_features`/`apply_nous_managed_defaults`, rather than re-implementing entitlement logic. Add/extend tests that assert consistent feature gating across the main execution entrypoints.
- hermes_cli/nous_subscription.py:220-420 — Central feature-definition computation exists; ensure all downstream consumers use its outputs.
- hermes_cli/tools_config.py:1-140 — Consumer module imports the centralized feature layer; confirm it is actually used for gating in the relevant code paths.
- agent/prompt_builder.py:900-1040 — Prompt/skills assembly is a concrete serving-time surface; confirm it uses the same feature state when filtering/gating skills.
med
Add a single “contract test” that compares feature computation inputs/outputs across all entrypoints that call it (e.g., CLI tools selection, portal status, agent runtime). This prevents drift when new features/backends are added.
- tests/hermes_cli/test_nous_subscription.py:1-260 — Unit tests exist for the feature computation; expand to cross-entrypoint consistency checks.
- hermes_cli/portal_cli.py:1-120 — Portal status is one consumer; include other consumers in a shared contract test.

Vector / embedding store 75%

The codebase includes a persisted vector/embedding-like store for its memory system: HRR vectors are stored in SQLite tables (`facts.hrr_vector` and `memory_banks.vector`) and are recomputed on ingestion/rebuild. Retrieval code uses these persisted vectors for similarity scoring rather than keeping everything ephemeral in process memory. However, the implementation is not a managed, model+content-version governed embedding store; it primarily persists vectors locally without explicit linkage to a producing model version (beyond parameters like HRR dimension).

high
Add explicit version governance to the persisted embeddings: store an embedding-config version (e.g., vector type + embedding parameters + producing model identifier/version if any) alongside each vector/bank, and refuse or automatically rebuild vectors when the producing configuration changes.
- plugins/memory/holographic/store.py:1-120 — Schema defines `hrr_vector` and `memory_banks.vector` but does not record a producing model/config version to tie embeddings to their generator.
med
If the intent is to satisfy the “managed, queryable, versioned store” requirement, replace/augment the local SQLite vector persistence with a dedicated vector DB interface (or at least isolate it behind a vector-store abstraction that exposes upsert/query/delete and versioned namespaces).
- plugins/memory/holographic/store.py:1-120 — The store is implemented directly as SQLite tables and blobs; there is no external vector DB or managed indexing layer.
med
Add an automated freshness/consistency check at query time (or before retrieval): verify that vector banks exist and match the current embedding configuration; otherwise trigger `rebuild_all_vectors` (bounded, logged) and fall back safely.
- plugins/memory/holographic/store.py:450-579 — `rebuild_all_vectors` exists, but there is no evidence of automatic, governed “config mismatch → rebuild” gating before using stored vectors.

Model version pinning 33%

Model identity pinning is partially supported: the agent’s public constructor takes an explicit `model` string and threads it into initialization, but direct enforcement that the string is not a floating alias (e.g., rejecting `latest`/`stable` as runtime model IDs) was not found in the inspected model-call wiring. The codebase includes careful model-string parsing/handling for tags like `latest`/`stable`, but pinning enforcement at invocation appears incomplete based on observed call-path slices.

high
Add an explicit guardrail at the model-identity ingress point (agent init / request kwargs build): detect `model` values that end with or equal floating aliases (e.g., `latest`, `stable`) and either (a) reject with a clear error or (b) resolve to a pinned concrete ID via a version resolver with persisted results for reproducibility.
- run_agent.py:1080-1125 — This is the choke point where model identity is first accepted and forwarded; pinning validation should occur here or immediately after.
med
Ensure the resolved/pinned model ID (post-alias-resolution) is the value stored in session state / logs (e.g., the session DB row) and used for the API request, not the user-provided alias string.
- run_agent.py:1080-1098 — Session/model fields are already tracked (model identity is present on the agent); align the stored value with the pinned/concrete ID after validation/resolution.
low
Add unit tests that explicitly cover `model='latest'` and `model='stable'` for each provider path (OpenAI-compatible, OpenRouter, Anthropic, local/Ollama), asserting deterministic behavior (reject or resolve-to-pinned).
- agent/model_metadata.py:35-55 — Model string parsing currently recognizes `latest`/`stable` tags; tests should verify how this impacts runtime model selection and reproducibility guarantees.

Prompt / model-call management 92%

The codebase has a clear prompt governance layer: prompt fragments live in agent/prompt_builder.py, the assembled system prompt is constructed in agent/system_prompt.py (with explicit caching described), and specialized prompts for background review are centralized in agent/background_review.py. The conversation loop further centralizes system-prompt cache restore/build via _restore_or_build_system_prompt(), helping prevent prompt literal drift near model call sites.

high
Audit remaining model-provider call paths (e.g., where chat/completions/responses are invoked) to ensure they always consume the governed system prompt built/restored by the system_prompt + conversation_loop layers, and do not inline literal prompt strings near the SDK calls.
- agent/system_prompt.py:1-240 — Central prompt assembly entrypoint; ensure all model calls source from here.
- agent/conversation_loop.py:1-260 — Central prompt cache restore/build; ensure all turn model calls use the cached/built prompt.
med
Where review or specialized prompts must evolve, add automated snapshot/equality tests to detect prompt drift across variants (e.g., memory vs combined review prompt) and confirm the background-review prompt strings remain the only source-of-truth.
- agent/background_review.py:1-220 — All background review prompt strings are centralized here; add/extend tests to lock them down against accidental duplication elsewhere.
low
For the largest prompt fragments (identity/guidance blocks), consider moving the biggest text literals into versioned markdown/template assets (if not already done elsewhere) and have prompt_builder load them, so changes remain even more diffable and reviewable than code-only constants.
- agent/prompt_builder.py:1-220 — Many guidance strings are currently embedded as Python constants; externalizing to versioned text assets would further strengthen diffability.

Reproducibility / determinism 8%

This codebase contains partial determinism/reproducibility primitives: deterministic tool-call IDs (to avoid UUID-driven cache invalidation) and a trajectory persistence utility for recording conversation outcomes. However, at run/turn boundaries there is no evidence of a fully captured “replay manifest” (pinned model/provider identifiers, generation parameters, preprocessing settings, and RNG seeds) sufficient to recreate runs exactly from pinned inputs. The trajectory evidence trail exists, but it appears incomplete for determinism.

high
Add a run-boundary “repro manifest” captured alongside each trajectory/run: exact model ID/provider/base_url, generation params (temperature/top_p/any seed), preprocessing/compression parameters, and the specific code/data versions used. Store it in the same output directory (or JSONL alongside trajectory entries) and include schema validation for required fields.
- agent/trajectory.py:1-57 — Trajectory entries currently record only `timestamp`, `model`, `completed`, and `conversations`—missing deterministic replay inputs (seed/generation params/preprocessing settings).
- agent/conversation_loop.py:1-220 — Core per-turn execution lives here; this is where determinism-affecting generation parameters should be captured/propagated into the trajectory manifest.
med
Version and document the determinism-critical hashing inputs for tool-call IDs (argument serialization normalization). Ensure the exact serialization function and its version are included in persisted artifacts so that cached/deterministic IDs remain stable across code changes.
- agent/codex_responses_adapter.py:128-164 — Deterministic call IDs depend on `arguments` (as a string) and `index`; without explicit serialization normalization/versioning captured, the same logical tool call might not hash identically across runs.
low
Reduce nondeterministic fields in replay artifacts where possible (e.g., keep `timestamp` for auditing, but ensure a separate `run_id` is derived deterministically from pinned inputs).
- agent/trajectory.py:1-57 — Uses `datetime.now().isoformat()` for trajectory entries, which is not deterministic; consider adding deterministic IDs derived from the repro manifest.

AI output validation 100%

The codebase has a strong, centralized AI output validation primitive for LLM auxiliary calls: `_validate_llm_response` enforces the expected `.choices[0].message` shape and throws a clear `RuntimeError`. `call_llm` consistently applies this validator immediately after each model invocation and reuses it across retries/fallback paths, preventing raw/unvalidated model payloads from flowing downstream.

high
Create/extend tests that intentionally return malformed LLM payloads (e.g., `response=None`, `response.choices=[]`, or missing `.message`) and assert that each retry path also fails via `_validate_llm_response` with the same error wording (no open-loop: retries must reuse the same schema gate).
- agent/auxiliary_client.py:4865-4955 — Validation error message and failure conditions are defined here; tests should assert these exact behaviors across retries.
- agent/auxiliary_client.py:4955-5300 — `call_llm` wraps model calls with `_validate_llm_response(...)` across multiple retry/fallback branches; tests should cover at least one malformed-payload trigger per major branch.
med
Audit the main agent LLM call path(s) (non-auxiliary) to ensure the same (or equivalent) validator is applied right after model invocations, not only for auxiliary routing.
- agent/auxiliary_client.py:4955-5300 — This evidence covers auxiliary calls; the primitive should also be present at other model-call boundaries if they exist.

Grounding / wrongness check 67%

This codebase has an output-grounding/wrongness check for structured plugin LLM calls: `PluginLlm.complete_structured()` enforces JSON parsing and (when a schema is provided) validates the parsed output against `json_schema` using `jsonschema`, failing closed on mismatch. However, beyond this structured path, I did not find evidence of a broader “claim-by-claim context grounding/judge” loop for free-form assistant text responses.

high
Add a general grounding/wrongness check for free-form LLM outputs that are surfaced to users or used for actions: introduce a judge-based or context-based verification step (or retrieval-backed citation verification) and enforce a bounded re-check/retry policy on failure.
- agent/plugin_llm.py:604-746 — Current enforcement is only for structured outputs. This is evidence of the existing check pattern, which can be extended to cover non-structured/free-form claims.
med
For structured paths, ensure failure modes are explicit and observable (e.g., include the validation error details in the audit trail and/or add a deterministic fallback output schema on validation failure).
- agent/plugin_llm.py:604-746 — Validation failures raise `ValueError` with a message, but the surrounding retry/fallback behavior is not shown in the slice; improving audit + fallback would strengthen closed-loop safety.

Self-correction / feedback loop 0%

No closed self-correction/feedback loop was found. The code detects judge-output parse/validation failures and uses fail-open + bounded pausing, but it does not feed the specific error back to the model for a re-check within the same validation path.

high
Implement a closed retry loop inside GoalManager/judge_goal for judge contract failures: when _parse_judge_response() reports parse_failed (empty/non-JSON), re-prompt the same judge model with an error-specific instruction (e.g., 'Output exactly one JSON object with keys done and reason; your previous output was <reason snippet>'). Bound attempts (e.g., 2-3) before falling back to the current auto-pause behavior.
- hermes_cli/goals.py:225-360 — parse failure is detected and returned as parse_failed, but the failure details are not used to create an error-fed-back next judge prompt.
- hermes_cli/goals.py:600-705 — GoalManager.evaluate_after_turn() currently only pauses/limits after repeated parse failures, rather than retrying with the error fed back to the judge model.
med
Add targeted tests that assert the loop is closed: when the judge output is non-JSON for the first attempt, the second attempt must include the specific failure and must re-run parsing. (For example, unit tests around judge_goal/_parse_judge_response with a mocked auxiliary client returning controlled invalid outputs.)
- hermes_cli/goals.py:225-360 — The parse_failed contract provides the data needed to craft error-specific feedback, so tests can verify the prompt augmentation and re-parse.

Evaluation harness + scoring N/A

I did not find an “Evaluation harness + scoring” primitive in this codebase. There are some benchmark/unit tests for a specific runtime evaluation path (browser CDP evaluation), but no offline golden set with automated scoring, no eval runner that logs inputs/outputs, and no evidence of recurring production eval/scoring distinct from the per-request execution loop.

high
Add a dedicated eval layer (e.g., an `evals/` package + CLI entrypoint) that runs an offline golden set, scores outputs with explicit metrics/rubrics, and logs results (inputs, model versions, prompts, outputs, scores, and pass/fail) to a persistent store.
- scripts/benchmark_browser_eval.py:1-139 — Current benchmarking is ad-hoc and does not provide the governance artifacts (golden set, rubric, logging, recurring scoring) required by this primitive.
med
Instrument the production request/agent loop to emit structured “eval candidates” (prompt/input + model version + tool context + ground-truth key/labels when available) into the logging store, but keep eval execution in the separate eval layer.
- tests/tools/test_browser_eval_supervisor_path.py:1-260 — Existing tests validate correctness of a specific eval dispatch path via mocks; this pattern should be extended into a broader, versioned, golden-set evaluation/score-and-log harness.
low
Optionally integrate an eval framework dependency (e.g., promptfoo/deepeval/langsmith/ragas) only after establishing the repo’s own artifact structure (golden set format, scorer definitions, and logging schema).
- scripts/benchmark_browser_eval.py:1-139 — There is no indication of a third-party eval framework or a structured scoring/logging pipeline; it’s currently just timing output.

Runnable correctness checks N/A

I did not find any documented, one-command runnable pass/fail correctness-check entrypoint for this codebase (e.g., a CI workflow or root-level `test`/`check`/`build` command wired to return an unambiguous green/red status). While the repo contains Python test files (e.g., under `tests/` and `skills/.../tests/`), the required governance layer that makes correctness checks trivially runnable and externally verifiable from a single command was not located.

high
Add or document a single root command (e.g., `pytest` invocation or `make test`/`just test`) that runs the existing test suite and returns a clear pass/fail exit code; ensure it covers the agent-facing correctness scope (setup/config flows + any workflow logic with mocks).
- tests/hermes_cli/test_setup.py:1-200 — Existing unit tests exist, but there is no evidence in-repo (from the checked governance entrypoints) of a single documented pass/fail command that orchestrates them.
- skills/creative/comfyui/tests/test_run_workflow.py:1-200 — Existing unit tests exist for workflow logic, reinforcing that correctness checks exist, but the runnable correctness-check primitive (one-command, externally visible pass/fail entrypoint) was not found.

Actionable diagnostics 100%

The codebase includes a strong “actionable diagnostics” primitive via `hermes_cli/kanban_diagnostics.py`, which produces structured diagnostics with fix-oriented `actions`, and via `agent/lsp/reporter.py`, which externalizes LSP diagnostics with severity and exact line/column positioning. Both are runnable/consumable outputs rather than implicit logs or ad-hoc strings.

high
Audit other failure-producing surfaces (e.g., CLI “status” outputs, update/check commands, and any tool preflight errors) to ensure they consistently emit structured diagnostics with (1) a stable diagnostic code/kind, (2) precise location/context when applicable, and (3) explicit operator actions/hints—not just error strings or stack traces. Reuse the existing `Diagnostic`/`DiagnosticAction` patterns where possible.
- hermes_cli/kanban_diagnostics.py:1-80 — Provides the canonical actionable-diagnostics shape (kind/severity/title/detail/actions) that other surfaces should emulate for consistency.

Positive confirmation 200%

The codebase contains a clear instance of positive confirmation: `tools/terminal_tool.py::check_terminal_requirements()` returns an explicit boolean success/failure signal (not only log messages), and `tests/tools/test_terminal_requirements.py` asserts on that success signal with runnable pytest pass/fail conditions.

high
Search for other operational gates (e.g., startup readiness checks, backend selection checks, tool availability checks) and ensure they also expose an explicit positive success signal (boolean/structured status) and have corresponding tests asserting the success case.
- tools/terminal_tool.py:2369-2510 — This file demonstrates the target pattern (explicit success return + tests). Other gates should be brought to the same standard.
med
If CI/workflow files exist elsewhere in the repository but were not indexed by the code-graph query, add/verify a documented one-command test run that guarantees an unambiguous green/yellow/red outcome (positive confirmation at the repo level).
- tests/tools/test_terminal_requirements.py:1-188 — While tests provide positive confirmation locally, the repo-level CI positive confirmation signal (e.g., GitHub Actions) was not evidenced from workflow/config queries in this audit.

Machine-readable contracts 100%

This codebase does have machine-readable contracts: it externalizes tool parameter expectations as JSON schema (e.g., computer_use), and it provides explicit schema sanitizer/translation modules for backend/provider compatibility (generic sanitizer + Gemini + Moonshot). The presence of focused schema modules plus targeted tests indicates the contracts are managed rather than implicit.

high
Add (or confirm) a single registry/manifest that enumerates all tool contracts (e.g., where schemas live, how to load them, and their versions), so an agent can query the available contracts without knowing file locations.
- tools/computer_use/schema.py:1-214 — Currently shows one concrete tool contract, but evidence here does not demonstrate a unified registry manifest across all tools.
med
Ensure the schema contract assets are consistently referenced/validated at the point where tool schemas are emitted to each provider (i.e., confirm call sites always use the sanitizers rather than duplicating shape assumptions).
- tools/schema_sanitizer.py:1-446 — Sanitizers exist; next step is to verify wiring at tool emission sites to ensure contracts stay source-of-truth.

Not applicable to this codebase: Raw / immutable source layer, Evaluation harness + scoring, Runnable correctness checks.

Tenancy Isolation

A tenant_id on every business table, row-level security in the database, and tests that prove a cross-tenant request returns 403.

0% 6/12 scored

Tenant key on every record 0%

0/1 expected sites
Cache key namespacing 0%

0/2 expected sites not present
Object/blob partitioning 0%

0/3 expected sites not present
Per-tenant resource limits 0%

0/2 expected sites not present
Tenant-scoped key management 0%

0/1 expected sites not present
Cross-tenant isolation tests 0%

0/4 expected sites not present

Tenant key on every record 0%

This codebase has tenant identifiers in some business-layer models (notably `TeamsMeetingRef.tenant_id`) and propagates them when normalizing Teams meeting data. However, the tenancy primitive does not appear to be applied consistently across related business records: for example, `MeetingArtifact` lacks a tenant key on the record itself.

high
Add `tenant_id` (or an appropriate tenant/org/workspace FK) to all other Teams pipeline record types derived from meetings (e.g., `MeetingArtifact`) and ensure normalization functions populate it from the source `TeamsMeetingRef` or Graph payload.
- plugins/teams_pipeline/models.py:151-225 — Shows `MeetingArtifact` has no tenant field, which is the core “tenant key on every record” gap.
med
Audit other domain models/dataclasses that represent persistable business records for missing tenant identifiers (not just Teams pipeline), and add a lightweight invariant/test that asserts `tenant_id` is present on all records in those model modules.
- plugins/teams_pipeline/models.py:108-150 — Demonstrates one record type already includes `tenant_id`; use this as a template for extending the pattern to other record types.

Database-enforced isolation N/A

No database-enforced (row-level security / FORCE RLS / tenant-scoped schema) isolation primitive is present. The codebase uses shared SQLite databases for state/session data and uses filesystem-per-board SQLite DBs for kanban separation, but there is no evidence of tenant/org row-level filtering or enforced DB policies that would prevent cross-tenant reads if application code forgot the filter.

high
If the system is intended to be multi-tenant, add a tenant identifier column (e.g., tenant_id/org_id) to each shared, writeable table in the DB layer (starting with sessions/messages and any other shared persistence). Then enforce access with database mechanisms (e.g., PostgreSQL RLS equivalent for SQLite if feasible, or move to a DB that supports RLS). Ensure policies are FORCE/mandatory so table owners cannot bypass them.
- hermes_state.py:220-520 — Shared `sessions` and `messages` tables are defined without any tenant/org column, so there is no place for tenant-scoped DB policies to apply.
- hermes_cli/kanban_db.py:1-220 — Kanban isolation is achieved via per-board separate SQLite files; this does not provide the requested defense-in-depth against missing tenant filters within a shared DB table.
med
Add automated integration tests that attempt cross-tenant reads and list/export of resources by crafting a request with a different tenant identity (or by directly querying the DB without applying tenant filters) and assert the access is denied / rows are not returned.
- hermes_state.py:220-520 — Current schema and design imply no tenant boundary exists at the persistence layer for state/session data, so cross-tenant tests should be introduced to validate the new enforcement.

Default-scoped queries N/A

I did not find an implementation of “default-scoped queries” (i.e., a data-access base model/repository that automatically applies tenant scoping to every query when no tenant filter is provided). The codebase appears to rely on isolation via partitioning (e.g., different SQLite DB files per Kanban board) rather than default-scoped query enforcement in the data-access layer, so this primitive is not applicable/present in this repository.

med
If this system is expected to be multi-tenant at the database row level, introduce a tenant-aware data-access layer (base repository/model with an implicit tenant predicate) and add tests that attempt cross-tenant reads/lists/exports without specifying tenant filters.
- hermes_cli/kanban_db.py:1-60 — Current isolation approach is board/DB-path partitioning; there is no indication of query-level default scoping in a shared repository/base model.

Tenant context at the boundary N/A

Agent produced no parseable output for this item.

Cache key namespacing 0%

No evidence of a cache-key namespacing primitive (tenant-prefixed cache keys like `tenant:{id}:...`) was found. Cache implementations observed include (1) a shared on-disk sticker cache keyed only by `file_unique_id`, and (2) a shared in-process memoization cache keyed only by config path/mtime—neither includes any tenant component.

high
Introduce tenant-aware cache key namespacing (or per-tenant cache partitioning) in `gateway/sticker_cache.py` so cached sticker descriptions cannot be read across tenants when multiple tenants share the same Hermes home/process.
- gateway/sticker_cache.py:66-109 — Cache lookup/update uses `cache.get(file_unique_id)` and `cache[file_unique_id]=...` without any tenant prefix/partitioning.
high
Partition the module-level in-process memoization cache in `agent/skill_utils.py` by tenant (e.g., include tenant id in the cache key tuple, or create separate caches per tenant context).
- agent/skill_utils.py:290-383 — The cache key is `(str(config_path), stat.st_mtime_ns)` and the cache is a global dict `_EXTERNAL_DIRS_CACHE`, with no tenant component.

Object/blob partitioning 0%

No evidence was found that object/blob storage artifacts are partitioned by tenant. The clearest persistence mechanism (`tools/tool_result_storage.py`) writes oversized tool outputs into a shared sandbox directory using only `tool_use_id` in the filename, with no tenant/org/workspace component in the storage path.

high
Tenant-scope persisted tool-result storage paths by default. For example, change `remote_path` to include a tenant-derived prefix (e.g., `.../{tenant_id}/{tool_use_id}.txt`) and update any corresponding read/retrieval logic to require the same tenant prefix.
- tools/tool_result_storage.py:124-159 — Shows the current shared-path construction `remote_path = f"{storage_dir}/{tool_use_id}.txt"` without any tenant identifier.
med
Make `_resolve_storage_dir()` return a tenant-scoped base directory (or accept tenant_id and append it), so the entire persistence stack is isolated even when called from new tool code paths.
- tools/tool_result_storage.py:36-62 — Shows `_resolve_storage_dir()` always returning a common directory (default `STORAGE_DIR = "/tmp/hermes-results"`) with no tenant scoping.
med
Add integration tests that attempt cross-tenant retrieval of persisted artifacts (e.g., persist a large tool output under tenant A, then try to read it while authenticated as tenant B, asserting denial/not-found).
- tools/tool_result_storage.py:124-159 — The persistence behavior creates artifacts that would be vulnerable to cross-tenant retrieval if retrieval is path-based without tenant scoping.

Tenant context in async work N/A

This codebase does not appear to implement a “tenant context in async work” primitive. It uses `contextvars` to propagate per-message/per-session gateway context (`HERMES_SESSION_*`) safely across concurrent asyncio tasks, but there is no corresponding tenant/org/workspace context type that is set into async workers/handlers and then enforced as mandatory before data access.

high
If the product requires multi-tenant isolation, introduce a dedicated tenant context primitive (e.g., `HERMES_TENANT_ID` as a `contextvars.ContextVar`) and ensure every async entry point (queue/message/event handlers, background tasks, tool execution, SSE/run events) re-establishes tenant context before touching any tenant-scoped storage.
- gateway/session_context.py:1-195 — Current async context propagation is session-scoped only; this is the layer where tenant-scoped context should be added if multi-tenancy exists.
med
Add integration tests that attempt cross-tenant async operations (list/read/write/export, and also async job/event paths) and assert they are denied using uniform not-found/forbidden semantics.
- tests/gateway/test_session_env.py:1-220 — There are concurrency/isolation tests for session contextvars, but none for tenant isolation; extend the pattern to tenant context once the tenant primitive exists.

Per-tenant resource limits 0%

The codebase includes multiple rate-limit/quota-related mechanisms (e.g., a shared Nous Portal “rate-limited” breaker file and a process-wide Signal attachment token bucket), but they are not applied per tenant. The mechanisms appear global/shared rather than keyed by tenant, so noisy-neighbor isolation for this primitive is missing.

high
Change Nous Portal rate-limit breaker state to be tenant-scoped: derive tenant from the authenticated request/session context and include it in the persisted key/path (and in any in-memory representations). This ensures one tenant’s 429/cooldown cannot block other tenants’ Nous usage.
- agent/nous_rate_guard.py:1-120 — State is written to a single shared file path returned by _state_path() using a fixed subdir/filename; no tenant component exists.
high
Key the Signal attachment scheduler/bucket on tenant: ensure acquire/refill/feedback accounting happens per tenant (e.g., SignalAttachmentScheduler per tenant, or a dict of buckets keyed by tenant id).
- gateway/platforms/signal_rate_limit.py:1-140 — The scheduler is a token-bucket simulator with a single bucket state (capacity/tokens/refill) and an asyncio lock; there is no tenant parameter/keying in the design shown.

Tenant-scoped key management 0%

No evidence of tenant-scoped encryption key management (per-tenant KMS/envelope keys, per-tenant crypto-erase, or explicit tenant-key references) exists in the codebase. Crypto functionality present (e.g., WeCom callback AES-CBC) uses key material passed in and reused without tenant scoping.

high
Introduce tenant-scoped key management at the lowest crypto layer: implement a key-provider that resolves the correct per-tenant key (or envelope data key) using a tenant identifier derived from the trusted session/context, not from request parameters. Update crypto call sites (e.g., `WXBizMsgCrypt`) to take a tenant context and fetch/decrypt the correct key per tenant before encrypt/decrypt.
- gateway/platforms/wecom_crypto.py:43-120 — Current design uses a single `encoding_aes_key` stored as `self.key` with no tenant key resolution step.
med
Add integration tests that attempt cross-tenant encryption/decryption with tenant A vs tenant B credentials to ensure keys/envelopes cannot be mixed and that the wrong tenant cannot decrypt data.
- tests/acp/test_approval_isolation.py:1-244 — There are strong isolation/concurrency tests in the codebase, but none were observed specifically for tenant-scoped key management.

Admin / role scoping N/A

This codebase does not implement the “Admin / role scoping” tenancy isolation primitive. There is no evidence of a tenant membership–scoped elevated role model with an `isAdmin`-style boolean that is tied to a tenant/membership, nor an explicit separate audited cross-tenant admin capability. Where “admin”/roles appear, it is either non-tenant (e.g., diagnostics) or used for other purposes (e.g., generic ACP permission bridging).

high
Introduce a tenant membership–scoped elevated role model in the authz layer (e.g., roles bound to a membership/tenant_id FK) and ensure cross-tenant elevated access is handled via a separate, explicitly audited capability. Confirm enforcement is default at the lowest layer (DB/RLS or a centralized repository scope), not scattered app-level checks.
- scripts/discord-voice-doctor.py:1-200 — Demonstrates that `is_admin` appears but is not part of tenant-scoped, audited authz enforcement.
med
Add integration tests that attempt cross-tenant admin actions (read/list/export/approval) and assert denial with uniform error behavior. This should be done at the boundary where admin checks are performed (authz middleware/handler and the data-access layer).
- tests/gateway/test_teams.py:1-220 — Existing gateway/platform tests mock SDKs and cover platform behavior, but none (from the audited searches/evidence) cover tenant-scoped admin isolation.

Uniform not-found vs. forbidden N/A

I did not find any implementation of the “uniform not-found vs forbidden” tenancy isolation primitive (i.e., returning the same not-found response for both missing and access-denied, to avoid leaking resource existence across tenants). The codebase’s visible access-control layer for the dashboard uses 401/redirect semantics rather than a tenant-scoped 403-vs-404 pattern.

high
Identify the tenant-scoped data model and the specific HTTP/API endpoints that fetch tenant/org-scoped resources by ID (where cross-tenant reads could return either 403 or 404). Then implement a single shared error/exception mapping at the data-access boundary so that access-denied is converted to the same not-found response as missing records for any tenant-scoped fetch.
- hermes_cli/dashboard_auth/middleware.py:1-260 — Current gate behavior focuses on authentication (401/redirect) rather than uniformizing not-found vs forbidden for tenant-scoped authorization failures.
med
Add an integration test that creates/uses two different tenant/org identities and attempts a cross-tenant read (and list/export if applicable). Assert the response for (a) non-existent resource id and (b) an existing resource in another tenant are identical (same status code and response body/error).
- tests/hermes_cli/test_dashboard_auth_audit.py:1-82 — Existing tests cover audit logging/redaction, not cross-tenant authorization semantics (403 vs 404 uniformity).

Cross-tenant isolation tests 0%

No cross-tenant isolation test suite boundary was found. Existing “isolation” tests focus on concurrency/session context leakage, cache aliasing, and platform-based state namespacing—none attempt cross-tenant read/write/list/export/async operations and assert denial.

high
Add a dedicated cross-tenant isolation integration test module that creates two tenants (or two identities bound to different tenants), then attempts cross-tenant read, write, list, and export for each tenant-scoped resource and asserts failure (prefer uniform not-found/forbidden behavior depending on your API contract). Include async/enqueued paths as well.
- tests/gateway/test_config_driven_access_policy.py:1-240 — Gateway authorization is already tested at a policy-contract level, but it does not cover cross-tenant access denial.
high
Extend the test strategy to cover any cross-request/shared storage and background workflows (caches, tool registries, event/queue handlers) with explicit tenant separation assertions (tenant A cannot observe tenant B outputs).
- tests/test_get_tool_definitions_cache_isolation.py:1-95 — Currently checks object aliasing, not tenant-bound data leakage.
med
Create a test harness/fixture for “tenant context” setup (two tenants + tenant-bound identity/session) so every resource test can reuse the same cross-tenant deny assertions across read/write/list/export/async.
- tests/gateway/test_voice_mode_platform_isolation.py:1-218 — Current harnesses use platform/chat_id namespacing; cross-tenant needs a tenant-bound identity/session fixture instead.

Not applicable to this codebase: Database-enforced isolation, Default-scoped queries, Tenant context at the boundary, Tenant context in async work, Admin / role scoping, Uniform not-found vs. forbidden.

Identity & Access

SAML/OIDC libraries, SCIM provisioning endpoints, and a real roles/permissions schema — not a hard-coded isAdmin boolean.

57% 9/11 scored

Federated SSO (SAML/OIDC) 100%

5/5 expected sites
RBAC modeled as data 0%

0/3 expected sites not present
Centralized authorization 67%

2/2 expected sites
No hardcoded privilege shortcuts 0%

0/1 expected sites not present
Deny-by-default 100%

2/2 expected sites
MFA / step-up auth 0%

0/3 expected sites not present
Session & token hygiene 94%

6/6 expected sites
Scoped machine credentials 0%

0/2 expected sites not present
IP allowlists / network constraints 150%

3/2 expected sites

Federated SSO (SAML/OIDC) 100%

Federated SSO exists in the dashboard authentication layer using a standardized, provider-pluggable OAuth/OIDC-like flow. The code wires a federated login start endpoint, a security-critical callback that validates CSRF state before completing login, a centralized middleware that verifies/refreshes sessions on each protected request, and logout/session invalidation plus guarded WS ticket minting.

high
Review and document per-provider verify_session semantics (e.g., whether tokens are cryptographically validated server-side vs. introspected) and ensure all registered providers conform to the expected “returns None on expiry/invalid” contract without bypasses.
- hermes_cli/dashboard_auth/base.py:1-159 — Defines the provider protocol/semantics (verify_session returns Optional[Session], refresh_session behavior, revoke_session best-effort).
- hermes_cli/dashboard_auth/middleware.py:80-220 — Middleware relies on verify_session returning None vs raising ProviderError; correctness and cryptographic validation depend on provider implementations.
med
Add explicit automated tests asserting that the callback rejects attacker-controlled next/state/cookie tampering across all registered providers (covering the state mismatch branch and missing_pkce_cookie branch).
- hermes_cli/dashboard_auth/routes.py:220-320 — Callback implements missing_pkce_cookie (400) and state mismatch (400) defenses; tests should ensure these remain provider-agnostic.
low
Ensure the public /api/auth/providers endpoint is rate-limited or protected against abuse if it can expose provider metadata in a sensitive deployment context.
- hermes_cli/dashboard_auth/routes.py:115-138 — Exposes provider list for login bootstrap and returns 503 when none registered; may be acceptable but should be checked for deployment threat model.

Directory provisioning (SCIM) N/A

There is no Directory provisioning (SCIM) primitive implemented in this codebase. The only “directory” concept found is a channel directory cache (messaging reachability), and authentication is implemented for the CLI/agent via OAuth/API keys rather than any SCIM 2.0 Users/Groups lifecycle (including deprovisioning). Because this repository does not appear to expose any SCIM-compatible identity API surface, there are no concrete SCIM lifecycle sites to validate for correctness.

high
If SCIM provisioning is a requirement for this product, add a dedicated identity/provisioning API module that implements the SCIM 2.0 surface (/scim/v2/Users and /scim/v2/Groups) including PATCH and a full lifecycle with deactivation that actually revokes access (e.g., sets user inactive, invalidates sessions/tokens, and prevents future authorization).
- hermes_cli/auth.py:1-60 — Current auth is CLI/agent authentication (OAuth/API keys), indicating no existing identity provisioning API wiring to extend.
med
Add integration tests that cover the SCIM lifecycle end-to-end (create user → verify access → deactivate/suspend → verify access is revoked; and optionally delete → verify access is removed).
- tests/gateway/test_channel_directory.py:1-40 — Existing tests are for channel directory behavior, demonstrating the repo currently lacks SCIM lifecycle test coverage.

RBAC modeled as data 0%

RBAC modeled as data (roles/permissions/memberships with centralized, permission-first authorization checks) is not implemented as a distinct primitive in this codebase. The gateway implements authorization for slash commands via config-derived allowlists (admin_user_ids and per-command allow sets) in gateway/slash_access.py, but this does not reflect an RBAC roles/permissions data model checked through one policy layer.

high
Introduce (or integrate) a data-driven RBAC model: define role, permission, role_permissions, and memberships (whether persisted or loaded from config), and build an authorization service/engine that takes (principal, action/resource) → allow/deny based on permissions derived from role memberships.
- gateway/slash_access.py:90-140 — policy_from_extra() currently builds authorization from allowlist keys; replace this with RBAC membership/role assignment resolution and permission aggregation.
high
Centralize authorization decisions so every slash-command dispatch consults the same RBAC engine (deny-by-default), instead of evaluating admin/user_allowed_commands in multiple places (or inlined policy objects).
- gateway/slash_access.py:30-70 — SlashAccessPolicy.can_run() is the current enforcement point; refactor it to perform permission checks from the RBAC engine rather than admin/user_allowed_commands logic.
med
If you must keep backward compatibility with existing config formats, add a translation layer that maps legacy allowlist fields (allow_admin_from, user_allowed_commands) into synthetic RBAC roles/permissions assignments, so authorization decisions remain uniformly data-driven.
- gateway/slash_access.py:90-140 — DM/group scope-specific keys are currently parsed here; keep parsing for compatibility but convert into RBAC membership/permission structures for enforcement.

Centralized authorization 67%

The codebase implements centralized authorization mainly in the messaging gateway: a single `_is_user_authorized(...)` decision point is used to gate user-originated events before dispatch, with a default-deny posture and consistent logging on unauthorized attempts. For the Hermes dashboard, there is also a single auth-gate middleware (`gated_auth_middleware`) with allowlisted public paths and enforced session verification; however, the code slice reviewed was only the middleware header/initial portion, so only the gateway decision point is fully evidenced as a correct centralized authz chokepoint.

high
Audit and document the dashboard authorization flow end-to-end: confirm that `gated_auth_middleware` (when `auth_required=True`) is the single place that makes allow/deny decisions for all non-public dashboard routes, and ensure every authorized/denied outcome is logged via the existing `audit_log` events.
- hermes_cli/dashboard_auth/middleware.py:1-200 — Middleware is explicitly described as a centralized auth gate that enforces verified sessions for all routes except the configured public allowlist; verify the remainder of the file confirms a single decision boundary plus decision logging.
med
Reduce drift risk in gateway authz semantics by ensuring adapter-owned access policy (`enforces_own_access_policy` / `dm_policy` etc.) is the only alternative path, and add explicit tests that assert `_is_user_authorized` is still the single decision chokepoint even when plugins run (confirm intended bypasses).
- gateway/run.py:7300-7450 — Plugin hook runs before auth and can return `skip`; ensure these bypasses are intended and cannot accidentally create implicit authorization gaps.

No hardcoded privilege shortcuts 0%

The codebase does not correctly apply the primitive 'No hardcoded privilege shortcuts'. In the slash-command authorization flow, privileged access is determined by checking whether the caller’s identity (`user_id`) is present in an operator-configured admin list, rather than by deriving privilege from a roles/permissions model.

high
Remove the identity-string-based privilege shortcut in `SlashAccessPolicy.is_admin` / `can_run`. Replace it with role/permission evaluation sourced from the canonical role model (e.g., memberships → role_permissions → permissions) and enforce the decision via a centralized policy module.
- gateway/slash_access.py:67-103 — The privilege gate is implemented by `return str(user_id) in self.admin_user_ids` and then used by `can_run` to allow all commands for admins. This is the exact anti-pattern the primitive forbids.

Deny-by-default 100%

The codebase implements deny-by-default at the dashboard auth boundary. Non-loopback mode uses a centralized allowlist (_path_is_public / PUBLIC_API_PATHS + public path prefixes) and otherwise returns 401/redirect unless a verified session is attached; legacy loopback mode similarly requires the ephemeral session token for all /api/* routes except explicitly listed public endpoints. This prevents silently public endpoints when new routes are added under /api/.

high
Ensure any new /api/* endpoints are added only if they are truly non-sensitive, by extending PUBLIC_API_PATHS (or the explicit public prefix list) after threat review; otherwise rely on the default-deny middleware behavior.
- hermes_cli/dashboard_auth/public_paths.py:1-50 — PUBLIC_API_PATHS is the explicit allowlist controlling which /api/* endpoints bypass the OAuth gate.
- hermes_cli/dashboard_auth/middleware.py:106-196 — Gated behavior is driven by _path_is_public(); everything else is rejected by default.
med
Keep the legacy and OAuth gates synchronized by routing both through the same PUBLIC_API_PATHS allowlist (already done) and add a regression test for any future drift to confirm newly added public paths do not get exposed unintentionally.
- hermes_cli/dashboard_auth/public_paths.py:1-24 — Documents that drift between independent allowlists caused /api/status to be accidentally public in one mode; centralization is intended to prevent recurrence.
- hermes_cli/web_server.py:301-322 — Legacy auth_middleware references _PUBLIC_API_PATHS imported from dashboard_auth.public_paths, ensuring shared allowlisting.

AuthN before AuthZ at the boundary N/A

Agent produced no parseable output for this item.

MFA / step-up auth 0%

I found dashboard authentication enforced via OAuth session cookies with token verification/refresh, but no MFA/step-up (no second-factor libraries or enrollment/verify challenges, and no step-up enforcement logic on high-risk operations). Therefore, this primitive appears absent in this codebase.

high
Introduce an MFA/step-up mechanism integrated into the centralized dashboard auth boundary (the auth gate middleware). Specifically: add a step-up-required decision point for sensitive actions (admin endpoints, credential/config changes), trigger a second-factor verification challenge, and ensure the step-up result is time-bound and auditable (e.g., stored as a claim in the session / separate step-up cookie, with event logs).
- hermes_cli/dashboard_auth/middleware.py:200-345 — The current gate verifies/refreshes the session cookie but has no second-factor or step-up enforcement path.
high
Add step-up enrollment/verification routes and wiring for the chosen second factor (TOTP or WebAuthn or an enterprise IdP step-up). Ensure the callback/challenge flow results in a server-validated step-up status that high-risk endpoints check before executing.
- hermes_cli/dashboard_auth/routes.py:120-230 — Current login/callback completes OAuth and sets session cookies; no MFA challenge/enrollment routes exist in the observed auth route layer.
med
Create/centralize a policy list of “step-up required” operations and ensure it cannot drift from the route allowlist. Use one shared source of truth for (a) unauth bypass paths and (b) auth-only vs. auth+step-up paths.
- hermes_cli/dashboard_auth/public_paths.py:1-50 — This file is already a shared allowlist for auth bypass; it should be extended/paired with step-up-required policy so sensitive operations are never implicitly allowed without second factor.

Session & token hygiene 94%

This codebase has solid session/token hygiene for the dashboard: access tokens are verified per request, expired access tokens trigger refresh rotation, logout revokes refresh tokens (best-effort) and clears cookies, and WebSocket access uses short-lived, single-use tickets with TTL and server-side consume/delete behavior. However, an additional internal WS credential is explicitly non-expiring and multi-use, which is a hygiene gap relative to 'short-lived, rotated, revocable' tokens.

high
Make the internal WS credential hygiene-aligned: introduce rotation (per-interval or per spawn), expiry, and server-side revocation/invalidating mechanisms. Today internal_ws_credential() is explicitly 'process-lifetime' and 'never expires'.
- hermes_cli/dashboard_auth/ws_tickets.py:86-132 — internal_ws_credential() is minted once per process, is multi-use, and the docstring states it 'never expires'. This conflicts with the primitive requirement for short-lived, revocable tokens.
med
Confirm provider implementations fully enforce 'refresh dead/reuse-detected' semantics such that refresh_session can reliably raise RefreshExpiredError, ensuring refresh token reuse is handled as a revocation/forced re-login event.
- hermes_cli/dashboard_auth/middleware.py:150-345 — Refresh path depends on providers raising RefreshExpiredError to clear cookies and force re-auth; middleware otherwise returns a refreshed session and re-sets rotated cookies.
- hermes_cli/dashboard_auth/base.py:1-159 — Provider protocol documents RefreshExpiredError as the 'dead/reuse-detected RT → force re-login' behavior contract.
low
Add automated tests that assert logout actually stops replay: after /auth/logout, a previously issued access cookie should be rejected by verify_session (or access token TTL should be minimized) and a refresh cookie should either be revoked or rejected on the next refresh attempt.
- hermes_cli/dashboard_auth/routes.py:320-420 — Logout clears cookies and attempts provider.revoke_session(refresh_token=rt); tests should validate end-to-end that subsequent verify_session/refresh_session fails.

Scoped machine credentials 0%

The codebase does not implement the 'Scoped machine credentials' primitive. The API server uses a single shared bearer secret (API_SERVER_KEY) validated via _check_auth, with no per-client scoping, revocability, or service-account model. While the repo has credential pooling for upstream LLM providers, that is not a scoped machine-credential scheme for inbound programmatic access to Hermes itself.

high
Replace the single global API_SERVER_KEY with a service-account / api_keys model that stores (at minimum) scopes/permissions and status for each client credential. Issue short-lived scoped tokens (or store hashed API keys) per service, and validate presented credentials by looking up the credential record (scope enforcement), not by comparing against one shared secret.
- gateway/platforms/api_server.py:666-703 — Currently loads one shared secret into self._api_key from API_SERVER_KEY / config.extra.key.
- gateway/platforms/api_server.py:843-867 — _check_auth compares Bearer token to the single shared secret; no scope or revocation model is present.
high
Add revocation/rotation support for inbound credentials: token expiry (for tokens) and server-side invalidation (e.g., credential status/blacklist) and ensure logout/revocation paths invalidate credentials immediately.
- gateway/platforms/api_server.py:843-867 — No expiry, rotation, or revocation is consulted during auth; the only check is equality to the configured shared secret.
med
Implement least-privilege scope checks at the auth boundary: parse/resolve the authenticated machine credential into allowed operations for that client, and enforce deny-by-default for API routes/handlers.
- gateway/platforms/api_server.py:843-867 — This is the chokepoint for auth enforcement; it should be extended from 'valid key?' to 'which scoped client and which permissions?' to prevent over-privilege.

IP allowlists / network constraints 150%

An IP allowlist / CIDR-based network constraint exists for the Microsoft Graph webhook adapter. The implementation parses `extra.allowed_source_cidrs`, fails closed at startup when the bind is network-accessible but no CIDRs are configured, and enforces the source-IP allowlist before processing in the health, validation, and notification handlers (returning 403 on mismatch). No comparable per-tenant IP allowlist middleware/guard was found elsewhere in the codebase.

med
If the gateway also exposes other external HTTP entrypoints beyond the MS Graph webhook, identify them and apply the same pattern (per-endpoint CIDR allowlist, fail-closed startup when exposed, and checks at the top of each handler) to keep network constraints consistent.
- gateway/platforms/msgraph_webhook.py:78-168 — Demonstrates the expected pattern (CIDR parsing + fail-closed startup + handler-level checks) that can be replicated for other entrypoints if applicable.
low
Audit whether reverse-proxy deployments are used for the MS Graph webhook and, if so, ensure `request.remote` reflects the true source IP (e.g., forwarded headers / aiohttp trust settings).
- gateway/platforms/msgraph_webhook.py:303-356 — `_source_ip_allowed` uses `request.remote` directly; correctness depends on how the server/proxy surfaces the originating IP.

Not applicable to this codebase: Directory provisioning (SCIM), AuthN before AuthZ at the boundary.

Compliance Code Patterns

Envelope encryption, enforced TLS, validated inputs, and zero secrets anywhere in the full git history.

35% 11/11 scored

Encryption in transit 0%

0/3 expected sites not present
Encryption at rest 0%

0/2 expected sites
Centralized key management 0%

0/1 expected sites not present
Secrets management 89%

3/3 expected sites
No secrets in git history 0%

0/1 expected sites not present
Input validation at boundaries 0%

0/1 expected sites
Injection-safe data access 100%

1/1 expected sites
Data classification & PII handling 83%

2/2 expected sites
Access logging on protected routes 0%

0/2 expected sites
Retention & secure deletion 11%

1/3 expected sites
Secure defaults / hardening 100%

4/4 expected sites

Encryption in transit 0%

No evidence of “encryption in transit” enforcement across every hop. The dashboard’s internal WebSocket URLs are hardcoded to `ws://` (plaintext) and the cross-container health probe is HTTP-based, indicating plaintext transport is possible and not redirected/secured by TLS/HSTS in code paths reviewed.

high
Change all internal WebSocket URL constructions from `ws://` to `wss://`, and derive the scheme from trusted request/proxy metadata (e.g., `X-Forwarded-Proto`) or a server config flag; ensure both `/api/ws` and `/api/pub` use TLS in production.
- hermes_cli/web_server.py:6500-7200 — Returns `ws://{netloc}/api/ws?...` and `ws://{netloc}/api/pub?...` (plaintext WebSocket).
high
Enforce TLS/HTTPS for inter-service HTTP calls. For `_probe_gateway_health`, require `https://` (or add a config that defaults to HTTPS and rejects plain HTTP in production), and optionally validate certificates.
- hermes_cli/web_server.py:620-720 — Health probe is documented/implemented as an HTTP fetch (`urllib.request.urlopen`) with `http://` examples; no TLS enforcement is shown.
med
Add edge transport hardening in the FastAPI server: redirect HTTP→HTTPS and set transport security headers (HSTS, secure redirects) for all dashboard routes (including websocket upgrade handling via correct proxy configuration).
- hermes_cli/web_server.py:1-260 — Review of server setup shows CORS/auth/middleware, but no TLS forcing/redirect/HSTS enforcement was found in the inspected boot/middleware portions.

Encryption at rest 0%

Encryption at rest is implemented at least for Matrix E2EE crypto-state persistence: when E2EE is enabled, the adapter uses a SQLite-backed mautrix crypto store (`PgCryptoStore`) located at `.../matrix/store/crypto.db`. However, other sensitive local persistence in the Weixin adapter (account `token` and `context_token` caches) is written to disk as JSON without encryption at rest, so the primitive is not consistently applied across all sensitive-at-rest data surfaces.

high
Encrypt sensitive Weixin on-disk data (`save_weixin_account` token JSON and `ContextTokenStore` context-token JSON) using field-level encryption (or an encrypted storage layer) and ensure encryption covers backups/snapshots as well.
- gateway/platforms/weixin.py:243-274 — Account `token` is persisted to disk via `atomic_json_write` (no encryption step shown); backups/snapshots would contain plaintext token.
- gateway/platforms/weixin.py:280-344 — `context_token` values are serialized into a JSON payload and written to disk using `atomic_json_write` (no encryption step shown).
med
Audit other local persistence paths for sensitive data (tokens, session keys, private keys, encrypted media parameters stored locally) and ensure they either use the same encrypted-at-rest mechanism as Matrix crypto.db or adopt an equivalent encrypted storage pattern.
- gateway/platforms/matrix.py:676-744 — Provides the reference implementation for at-rest encryption of sensitive crypto-state in this codebase (use this as the baseline pattern when extending to other sensitive stores).

Centralized key management 0%

I did not find any centralized key-management implementation (KMS/Vault/KeyVault/Secrets Manager-style) with rotation and revocation logic. The codebase instead appears to fetch keys ad-hoc from environment variables (e.g., LINEAR_API_KEY) rather than from a managed, centrally administered key store.

high
Introduce a centralized managed key store for any encryption/auth keys the system uses (e.g., cloud KMS/Vault/KeyVault/Secrets Manager) and replace direct env-based key retrieval with runtime fetches from the managed store; ensure rotation policies and an emergency revocation mechanism are implemented and exercised.
- skills/productivity/linear/scripts/linear_api.py:50-63 — Current behavior reads LINEAR_API_KEY directly from process environment in _get_key; this should be replaced by managed key-store retrieval with rotation/revocation.

Secrets management 89%

This codebase implements runtime secrets management by loading credentials from Bitwarden into environment variables. `hermes_cli.env_loader.load_hermes_dotenv()` invokes `_apply_external_secret_sources()` which reads a `secrets:` section from `~/.hermes/config.yaml` and calls `agent.secret_sources.bitwarden.apply_bitwarden_secrets(...)` to fetch secrets via the `bws` CLI. The implementation is centralized and is applied before credential-dependent runtime logic reads `os.environ`. Disk caching exists and is permission-restricted (0600) but is plaintext-equivalent.

high
Consider encrypting the disk cache (`<hermes_home>/cache/bws_cache.json`) or avoiding persistence of secret values altogether; current design stores fetched secret values in a plaintext-equivalent JSON file (even though it uses `chmod 0600`).
- agent/secret_sources/bitwarden.py:1-120 — Documents disk cache storing secret values: `bws_cache.json` “holds only the secret VALUES, never the access token… kept out of the .env file” and is written with atomic rename.
- agent/secret_sources/bitwarden.py:410-520 — Disk cache is written with `os.chmod(tmp, 0o600)` but the payload contains the fetched secret values.
med
Run/maintain a stricter policy around committed secret artifacts. A full-history gitleaks scan produced many “REDACTED” hits; confirm none are real credentials and keep test/fixture values clearly marked and rotated/invalidated.
- hermes_cli/auth.py:60-110 — Example gitleaks hit area includes OAuth client IDs/related constants; ensure none are true credentials and prefer loading client secrets/tokens only via the env_loader/secret source path.
low
Expand secret-source support beyond Bitwarden (Vault/Secrets Manager), but keep the same enforcement pattern (runtime injection into `os.environ` and no plaintext literals).
- hermes_cli/env_loader.py:1-120 — The env_loader includes `_SECRET_SOURCES` and labels for multiple secret sources (future-proofing), indicating an intended pattern extension.

No secrets in git history 0%

This primitive is NOT satisfied. A full-history gitleaks scan returned many matches for committed secret/credential material, and the current codebase contains hardcoded OAuth/credential-like values in files such as `hermes_cli/auth.py` and `agent/anthropic_adapter.py` (plus test literals). Therefore, the codebase does not meet “No secrets in git history”.

high
Rotate/replace any credentials/keys/client secrets that were committed (treat all matches as compromised), then remove them from history (e.g., git filter-repo/BFG) and force-push. Ensure CI includes a full-history secret scan so future commits are blocked.
- hermes_cli/auth.py:60-110 — Hardcoded OAuth credential-like identifiers present in the repo; consistent with gitleaks committed-secret matches.
- agent/anthropic_adapter.py:1100-1205 — Hardcoded OAuth client ID and related OAuth configuration present; indicates committed credential-like material.
med
Update tests and documentation to use non-secret placeholders (explicitly marked as fake) and/or generate ephemeral test secrets at runtime. Avoid committing even-looking tokens; if absolutely needed, ensure they are guaranteed non-functional and validated by a secret-scan rule exception policy.
- tests/hermes_cli/test_web_oauth_dispatch.py:130-190 — Test contains a literal `access_token` value; replace with safe dummy values or runtime generation.

Input validation at boundaries 0%

The codebase does apply input validation at boundaries in meaningful places (FastAPI/Pydantic validation for request bodies and explicit middleware validation of Host headers). However, at least one sensitive boundary (DELETE /api/webhooks/{name}) accepts a path parameter as a raw string with only normalization (strip/lower) and without a schema/constraints layer, leaving it as an unmatched should-be site.

high
Add explicit schema/constraint validation for webhook path parameter `name` on DELETE /api/webhooks/{name} (e.g., restrict length and allowed characters, and reject empty/invalid values with 400). Consider using a Pydantic model or FastAPI parameter constraints instead of only `(name or '').strip().lower()`.
- hermes_cli/web_server.py:5200-5300 — DELETE /api/webhooks/{name} handler normalizes the string but does not enforce constraints via a schema.

Injection-safe data access 100%

Injection-safe data access (parameterized/bound queries) is present and correctly applied in the Kanban dashboard API layer. The DB operations observed in request-driven handlers use `?` placeholders with bound parameters, not string concatenation or f-string interpolation of untrusted input into SQL.

high
Do a targeted sweep of all DB access functions that consume HTTP inputs (e.g., anything taking `task_id`, board slugs, user/session identifiers) to confirm they always use placeholders. Pay special attention to any SQL assembled via f-strings for dynamic identifiers; ensure only trusted/static identifiers are interpolated and never untrusted values.
- plugins/kanban/dashboard/plugin_api.py:900-1020 — This slice demonstrates the desired pattern for safety in this layer; other handler paths should be checked similarly for placeholder usage.
med
If any layer uses dynamic SQL construction for identifiers (table/column names), ensure it is restricted to whitelisted/validated values and not derived directly from request parameters.
- plugins/kanban/dashboard/plugin_api.py:900-1020 — Current observed usage is placeholder-based; this remediation is to generalize the guarantee across all dynamic-SQL cases.

Data classification & PII handling 83%

This codebase does have data classification/PII-handling controls: it includes targeted redaction logic for (1) public debug log uploads and (2) Telegram-bound gateway responses/status messages. However, the implementation appears specialized (pattern-based, best-effort) rather than a comprehensive, centrally-enforced sensitivity taxonomy across all logging/export paths.

high
Audit and enforce PII/sensitive-field masking centrally: identify every place that logs/serializes user content or credentials (especially JSON dumps via stringify/toJSON, debug snapshots, and any “dump/export” utilities) and ensure they route through a single sensitivity-aware redaction layer with a field/tag taxonomy (PII vs secrets vs free-form content).
- gateway/run.py:220-360 — Current redaction enforcement is clearly applied only on Telegram-bound user-facing text for provider failures/status; this suggests other logging/export paths may not uniformly apply the same control.
med
Extend classification from pattern-based masking to structured tagging: introduce a small schema for sensitive fields (e.g., email/phone/token/access-token/query-token/media identifiers) and mask those by key across serialization and log sinks.
- gateway/run.py:220-360 — _redact_gateway_user_facing_secrets is pattern-based; it’s effective for known secret shapes but not a general “PII field” guarantee.
low
Add/strengthen tests that assert “no PII in logs” for each sink: include property-based or fixture-based tests ensuring redact/mask functions are invoked on every relevant path (not just the Telegram error/status path and debug-share upload).
- hermes_cli/debug.py:1-220 — The debug-share behavior is documented and presumably tested, but additional sink coverage is needed beyond this boundary.

Access logging on protected routes 0%

The codebase includes a dedicated dashboard-auth audit logger (`audit_log` writing `~/.hermes/logs/dashboard-auth.log`) and it records several authentication lifecycle events (login/refresh/verification failures). However, it does not appear to implement “access logging on protected routes” for every authenticated/sensitive action: the central auth-gate middleware allows verified sessions to proceed (`return await call_next(request)`) without emitting a per-request access log that includes a unique actor identifier for the action.

high
In `hermes_cli/dashboard_auth/middleware.py` (inside `gated_auth_middleware`), add a per-request audit/access log emitted after authentication succeeds (including after refresh succeeds) and before `call_next(request)`. The log entry should include a unique actor identifier (e.g., the authenticated `user_id` from `request.state.session`) and enough request metadata to support attributable auditing.
- hermes_cli/dashboard_auth/middleware.py:300-339 — After `request.state.session = session`, the middleware immediately returns `await call_next(request)` with no access/audit log emitted for the protected request/action.
med
Ensure the per-request access log is applied uniformly across all authenticated/sensitive endpoints (not just refresh/verify failures). If there are additional auth-gates elsewhere, confirm they all delegate to the same middleware logging path.
- hermes_cli/dashboard_auth/middleware.py:1-239 — The middleware is the primary enforcement location for protected dashboard routes; it contains audit logging for auth lifecycle events, but not for every passed-through authenticated request.

Retention & secure deletion 11%

The codebase does include retention-style enforcement for a few credential/auxiliary data types (WS ticket TTL + immediate removal on consume; debug-share paste auto-expiry with a sweep that deletes remote pastes) and a request-time deletion endpoint for stored responses. However, the audited evidence does not show a comprehensive, system-wide retention window + secure deletion (including cascade to derived data/backups and cryptographic wipe) for persisted conversation/PII-like content. Therefore the primitive is only partially implemented.

high
Add/verify an enforcing retention policy for persisted chat/session/response content (not just deletion endpoints): implement scheduled purge/TTL jobs for response/session storage and ensure DELETE cascades to all related/derived records (and ensure purge reaches backups/exports). Evidence currently shows only DELETE /v1/responses/{response_id} without evidenced retention windows or secure disposal across backups.
- gateway/platforms/api_server.py:3184-3224 — DELETE /v1/responses/{response_id} deletes from _response_store, but no retention window/purge/secure disposal across derived data/backups is evidenced in the audited slices.
med
For debug-share/log upload flows, document and enforce local secure deletion (if any local intermediate files are written) and verify that any derived/local copies are removed or cryptographically wiped; currently only remote paste auto-deletion is evidenced.
- hermes_cli/debug.py:33-45 — Auto-delete retention window is defined for remote pastes.
- hermes_cli/debug.py:122-192 — Deletion sweep attempts remote delete of pastes; secure local disposal is not evidenced.
low
Extend test coverage for retention/purge correctness: add integration tests proving that expired items are removed by background sweeps and that delete-on-request removes all related artifacts (not only the primary record).
- hermes_cli/debug.py:122-192 — Best-effort delete/retry and pending.json update logic should be covered by deterministic tests for expiry.

Secure defaults / hardening 100%

This codebase applies Secure defaults/hardening primarily in its FastAPI dashboard auth layer. Session cookies are hardened (HttpOnly, SameSite=Lax, HTTPS-only Secure, constrained Path, and __Host-/__Secure- naming). The dashboard auth middleware enforces authentication for all non-public routes and clears cookies on invalid/expired sessions to force clean re-auth. However, there is no evidence in the inspected server bootstrap/web server file of security-header middleware (e.g., CSP/X-Frame-Options) being wired as a hardened default.

high
Add a production security-headers middleware for the FastAPI dashboard (e.g., CSP, X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Permissions-Policy) and ensure it is enabled in all non-local/prod paths.
- hermes_cli/web_server.py:1-120 — Central dashboard server wiring is present (FastAPI app, CORSMiddleware). Security-header hardening middleware wiring (CSP/other headers) was not observed in this inspected region, so this is a likely un-enforced should-be site.
med
Confirm production debug/verbose error behavior is disabled for the web server (e.g., no stack traces returned to clients, and any debug endpoints are gated off in non-local deployments). If it exists elsewhere, document/ensure consistent gating on every web entrypoint.
- hermes_cli/dashboard_auth/middleware.py:1-112 — The middleware focuses on auth gating and structured 401/redirect responses; additional production-safe error behavior (stack-trace suppression) was not verified via the inspected hardening surfaces.
low
If the compliance target requires inactivity-based session termination (not just token expiry), add/confirm an inactivity watchdog server-side for the dashboard auth session cookies/tokens and test it across refresh flows.
- hermes_cli/dashboard_auth/middleware.py:240-320 — Token expiry/invalid-state handling is enforced via cookie clearing and refresh; inactivity-based termination was not confirmed in the inspected code.

Audit, Governance, Residency

An append-only audit_events table, a queryable audit API, and per-region infrastructure keyed on each tenant’s region.

14% 7/10 scored

Dedicated audit event store 89%

3/3 expected sites
Append-only / tamper-evidence 0%

0/3 expected sites
Comprehensive event coverage 11%

1/3 expected sites
Queryable, provable audit access 0%

0/2 expected sites not present
Audit retention & separation of duties 0%

0/1 expected sites not present
Data-subject rights (export & erase) 0%

0/4 expected sites
Sub-processor / data-flow transparency 0%

0/4 expected sites not present

Dedicated audit event store 89%

A dedicated audit event store exists for dashboard-auth events (`hermes_cli/dashboard_auth/audit.py`) and is written as JSON-lines to `~/.hermes/logs/dashboard-auth.log` (or `$HERMES_HOME`). The code emits structured audit events for sensitive auth decisions (login start/failure; session verify failures; refresh success). However, the event schema currently emphasizes `ts`, `event`, and provided fields, and there is no strong evidence (in this code) of a standardized, comprehensive shape including actor+tenant+resource type+id across all events—so completeness/consistency vs the full expected audit schema is only partially demonstrated.

high
Define and enforce a single canonical audit event schema (actor, tenant, action, resource type+id, context, timestamp) and update all audit call sites to populate it consistently. Evidence: audit store accepts arbitrary `fields` but does not force required keys.
- hermes_cli/dashboard_auth/audit.py:1-88 — Central audit writer takes arbitrary `**fields` and only guarantees `ts` + `event`; there is no compile-time/runtime enforcement that required keys (actor/tenant/resource) are always present.
med
Add tenant scoping to the emitted audit records (e.g., include `tenant_id`/`org_id` from session/provider context where applicable) and add automated tests that assert the presence of required schema fields for each event type.
- hermes_cli/dashboard_auth/audit.py:1-88 — The writer redacts certain token-like fields and serializes arbitrary fields; missing tenant scoping would be silent without schema-level checks.
low
Harden integrity/tamper-evidence guarantees for the audit store (e.g., hash-chaining and verification on read/export) if required by your governance standard.
- hermes_cli/dashboard_auth/audit.py:1-88 — The implementation is append-only but does not show integrity validation (e.g., hash chain/signatures).

Append-only / tamper-evidence 0%

The codebase has an audit log writer for dashboard-auth events that appends JSON lines to `$HERMES_HOME/logs/dashboard-auth.log`. This satisfies a basic “append” behavior, but it does not implement tamper-evidence (no hash chain/signing/integrity verification or immutability enforcement). As a result, while an audit trail exists, it is not provable as tamper-evident evidence.

high
Upgrade `hermes_cli/dashboard_auth/audit.py:audit_log()` to produce tamper-evident records: implement a hash-chain (store prev-hash with each line, compute next hash from canonicalized entry), optionally sign records (HMAC/Ed25519) with a key whose write access is restricted, and add a verification function used on startup or on demand.
- hermes_cli/dashboard_auth/audit.py:1-88 — Audit is currently written as plain append-only JSON lines without any integrity linkage/validation.
high
Restrict audit log mutation outside the writer: ensure log file permissions/ownership prevent arbitrary modification and ensure any rotation/deletion is either disallowed or itself audited and integrity-preserving (e.g., seal segments with a final signed root hash).
- hermes_cli/dashboard_auth/audit.py:1-88 — Writer tolerates failures and does not implement deletion/rotation protections or audit-log sealing.
med
Add automated tests that verify tamper-evidence: after writing N events, mutate one line on disk and assert verification fails for the chain from that point onward (and/or detect signature mismatch).
- tests/hermes_cli/test_dashboard_auth_audit.py:1-82 — Current tests validate JSON writing and redaction, but not integrity/tamper detection.

Comprehensive event coverage 11%

This codebase has partial comprehensive event coverage: dashboard-auth security events (login/refresh/session verification) are written to a dedicated structured JSON-lines audit log with UTC timestamps and redaction of token-like fields, and the API server emits structured per-run/tool/message SSE events with timestamps and correlation IDs. However, user data export is not audit-covered in a provable/structured way (the session export path only logs client-side notifications), and broader “comprehensive” coverage for permission/role changes and sensitive CRUD/export operations across the API layer is not evidenced by audit writes to the dedicated audit store.

high
Add server-side audit emission for data exports (at the API handler that authorizes and returns export/download content, or at the backend export endpoint), producing a structured audit record (actor/session/tenant, action=export, resource, timestamp) alongside the existing dashboard-auth audit log / evidence store.
- apps/desktop/src/lib/session-export.ts:1-57 — Export occurs client-side via download blob creation and only calls notify/notifyError—no structured audit event is emitted.
high
Extend comprehensive audit coverage from dashboard-auth only to all security-relevant sensitive operations (permission/role changes, CRUD on persisted records like sessions/responses, and approvals) by wiring a single authoritative evidence emitter into those backend handlers instead of relying on SSE-only lifecycles.
- hermes_cli/dashboard_auth/audit.py:1-88 — Dedicated structured audit log exists, but it is specifically scoped to dashboard-auth events.
- gateway/platforms/api_server.py:3450-3800 — Run/tool lifecycle is structured via SSE events, but this is distinct from the dedicated audit log writer and is not shown to cover permission/role changes or exports in persistent audit evidence.
med
Ensure the sensitive-action timeline is queryable for auditors/customers by providing a tenant-scoped (or operator-scoped) audit-read/export interface over the structured audit store, rather than only writing to a local file without an auditable read/export endpoint.
- hermes_cli/dashboard_auth/audit.py:1-88 — Audit events are written to a local log file path ($HERMES_HOME/logs/dashboard-auth.log) but this file-level evidence is not shown to have a read/export API in the audited code paths.

Queryable, provable audit access 0%

The codebase writes some audit-like events to a local JSONL file (dashboard-auth audit), and builds request metadata for security/audit warnings. However, there is no tenant-scoped, paginated audit-read API and no exportable, independently-verifiable audit evidence trail accessible to customers/auditors—so this primitive is absent.

high
Implement a dedicated structured audit event store (separate from application logs) that persists tenant-scoped events with actor, resource identifiers, timestamps, policy/state fields, and cryptographic/tamper-evidence metadata (e.g., hash chaining or signed records).
- hermes_cli/dashboard_auth/audit.py:1-88 — Current behavior is file append of JSON-lines; it is not a structured, tenant-scoped evidence store with independent verifiability and export.
high
Add an external audit-read API surface with tenant scope + pagination (e.g., GET /api/tenants/{tenant_id}/audit-events?page=...&page_size=...) and an export endpoint that outputs verifiable evidence (e.g., signed JSON bundle or CSV + integrity proof).
- gateway/platforms/api_server.py:730-820 — Only request audit context/sanitization is present; there is no tenant-scoped audit listing/export handler exposed for customers/auditors.
med
Wire audit emission so that all security-relevant actions (auth/login outcomes, session verification, revocations, permission/policy changes, exports) produce structured audit events persisted to the evidence store, not only warnings/log lines.
- hermes_cli/dashboard_auth/audit.py:1-88 — The module defines event types and a file-write function; verify which sensitive actions are covered and ensure everything required for provable audit access is persisted to the new evidence store with immutable semantics.

Audit retention & separation of duties 0%

The codebase writes a structured dashboard-auth audit log to a local JSON-lines file, but there is no implemented retention policy enforcement (no TTL/retention config or purge job) and no verifiable separation-of-duties controls around changing retention. As a result, the primitive (enforced audit retention + insider-proof separation of duties) is not provably satisfied.

high
Add an explicit, enforced retention policy for the dashboard-auth audit log (e.g., configurable TTL window meeting compliance requirements) plus an automated purge/archive job that runs on a schedule and is covered by tests verifying it cannot be bypassed by normal admin operations.
- hermes_cli/dashboard_auth/audit.py:38-88 — Current audit writer has no retention enforcement or purge integration; retention must be added and wired.
high
Implement separation of duties for retention changes: ensure only the logging/audit subsystem account (not general admins) can modify retention configuration, and that any retention-change operation itself emits an audit event to the same immutable evidence trail.
- hermes_cli/dashboard_auth/audit.py:38-88 — Audit logging exists but there is no governance layer described/implemented for who can change retention or verify changes are auditable.
med
If regulatory needs require tamper evidence, make the audit log rotation/purge pipeline append-only and consider integrity verification (e.g., hash-chain per entry or periodically signed log segments) and audit the deletion/archival actions.
- hermes_cli/dashboard_auth/audit.py:56-82 — Audit entries are appended via open(..., 'a'), but there is no integrity/tamper-evidence or audited log-deletion/archival pathway.

Data residency / region pinning N/A

The codebase contains logic for selecting a cloud provider region (e.g., AWS Bedrock client creation/caching), but it does not implement data residency / region pinning as a tenant-governed residency control. There is no provable mechanism tying a tenant’s region attribute to where tenant data and compute run (including region-keyed routing across per-region infrastructure).

high
If the product is intended to provide residency guarantees, introduce an explicit tenant-region field in the tenant/org data model and enforce it end-to-end: (1) region-keyed placement of persistent tenant data stores, (2) region-keyed compute/provider routing for all tenant work, and (3) region pinning for every secondary sink (backups, exports, analytics, and any outbound syncs).
- agent/bedrock_adapter.py:60-150 — Current `region` usage is limited to SDK client construction/caching for inference, which demonstrates provider region selection but not tenant residency pinning enforcement.

No cross-region leakage N/A

No cross-region leakage / data-sink residency enforcement is evident in this codebase. While the project has some generic “sync” and “export” functionality and does process/forward content (e.g., session exports, image routing), there is no implemented, auditable mechanism that pins all data sinks to a tenant region (or blocks sync/export/backup/analytics to out-of-region destinations).

high
Identify all data sinks involved in the product’s end-to-end data lifecycle (primary store, backups/snapshots, analytics/telemetry, derived stores, and any third-party sync/export). For each sink, implement region-keyed routing and add enforcement that blocks out-of-region destinations for tenant-scoped data.
- apps/desktop/src/lib/session-export.ts:1-57 — Current export is a local client download with no region-aware sink placement; if exports are also produced server-side/externally elsewhere, they must be region pinned and blocked.
med
Add a tenant region attribute to the authoritative data model (if it does not exist yet) and wire it into every sink configuration (backup/replication/analytics/export). Verify that derived/snapshot/analytics pipelines also reference the tenant region and cannot bypass it.
- agent/image_routing.py:1-220 — Representative example of application routing logic exists, but there is no residency/region enforcement in routing helpers—indicating the need to add region-aware configuration to the data pipeline layer.

Data-subject rights (export & erase) 0%

The codebase contains partial building blocks resembling export/erase: (1) a session export that downloads session messages as JSON, and (2) memory “forget”/delete operations for two memory providers (Supermemory and RetainDB). However, there is no evidence of a complete GDPR/CCPA-style data-subject rights (export & erase) primitive with (a) an identity-verified DSR export handler that returns all subject data, (b) a DSR erase handler that cascades to backups and derived/indexed stores, and (c) an auditable DSR job/event trail that an external auditor can independently verify.

high
Add explicit DSR backend handlers (export and erase) on the server (likely near gateway/platform endpoints) that: verify the requesting subject, enumerate all relevant data across every store used by the product (sessions, message history, memory providers, file stores, any caches/indices), and return a structured response tied to a DSR job id.
- gateway/platforms/api_server.py:1-260 — API layer exists for session operations, but DSR-specific export/erase endpoints and audit linkage are not evidenced in the reviewed code slices.
high
Implement an auditable DSR erase job that guarantees cascade behavior: delete/forget in each backing store plus retractions from derived/indexed representations and any local write-behind/queueing mechanisms; record immutable DSR audit evidence for start/end status and the exact resources targeted.
- plugins/memory/supermemory/__init__.py:600-707 — Current erase (“forget”) calls the external client but does not show DSR job audit evidence or cascade to backups/derived stores.
- plugins/memory/retaindb/__init__.py:240-349 — Current erase is a direct DELETE against RetainDB memory id, with no DSR audit/job tracking or cascade guarantee shown.
med
Extend/replace client-side session export with a backend-driven subject export that guarantees completeness (“all of a subject's data”), includes data provenance, and provides a stable export artifact (e.g., generated server file) tied to a DSR job record.
- apps/desktop/src/lib/session-export.ts:1-57 — Client-side export downloads session messages only; it is not shown as a subject-global DSR export with audit evidence.

Customer-controlled keys N/A

No implementation of “customer-controlled keys” (BYOK/per-tenant customer-managed encryption keys with customer-supplied import, scheduled rotation, and revocation/crypto-shred) was found in this codebase. The code contains credential configuration UI (global env var handling) and platform-specific crypto utilities, but not per-tenant key management suitable for customer-controlled encryption key governance.

high
Add a crypto-governance key-management module that supports per-tenant encryption key references, customer-supplied key import, scheduled rotation, and explicit revocation semantics (crypto-shred). Document the tenant key lifecycle and enforce it in all data-at-rest encryption/decryption paths.
- apps/desktop/src/app/settings/keys-settings.tsx:200-432 — Current “keys” editing is limited to provider credential env vars (set/reveal/clear), not encryption keys with rotation/revocation.
med
Expose customer-facing APIs or UI endpoints for key import, rotation, and revocation that are scoped to a tenant identifier, and ensure rotation is auditable and actually re-encrypts or schedules re-encryption of tenant data as required.
- apps/desktop/src/app/settings/keys-settings.tsx:200-432 — There is no tenant-scoped cryptographic key management surface here; only env-var credential management.
med
Integrate the per-tenant key reference into the encryption layer used for persisted data (and ensure all ciphertext-producing components use the tenant key, not a single global key).
- gateway/platforms/qqbot/crypto.py:1-46 — Crypto helpers exist but are integration-scoped; they do not demonstrate tenant-key selection or encryption-key governance for stored data.

Sub-processor / data-flow transparency 0%

No in-repo, versioned sub-processor / data-flow transparency inventory (or equivalent auditable mechanism) was found. While the codebase contains concrete outbound integrations to third parties (notably OpenAI and Anthropic), those sinks are not backed by a declared, auditable, versioned inventory that would let an auditor reconcile “data flows” with a DPA sub-processor list.

high
Add a versioned, in-repo sub-processor/data-flow inventory artifact (e.g., SUBPROCESSORS.md or dataflow-inventory.json) that explicitly lists each outbound provider, what data types are sent, and where it is used (module/function references). Ensure it is current and matches runtime sinks.
- plugins/google_meet/realtime/openai_client.py:1-120 — Direct OpenAI Realtime sink exists, but no matching declared inventory artifact is apparent near the sink.
high
Create cross-references from each outbound integration module to the inventory (e.g., comments or code-level constants that include an inventory entry id/version), so auditors can verify mapping from code → documented third party/data-flow.
- plugins/image_gen/openai/__init__.py:1-160 — OpenAI image generation provider integration should reference the inventory entry for OpenAI.
- agent/anthropic_adapter.py:1-120 — Anthropic adapter is a provider integration and should reference the inventory entry for Anthropic.
med
Add a lightweight consistency check in CI (static scan or unit test) that flags new third-party SDK imports / outbound endpoints without a corresponding inventory entry update.
- agent/anthropic_adapter.py:1-120 — Anthropic integration is code-driven; without CI enforcement, inventory drift is likely.

Not applicable to this codebase: Data residency / region pinning, No cross-region leakage, Customer-controlled keys.

T2 Execution Velocity

Performance Primitives

A caching layer, an async job runtime, connection pooling, and indexes on the columns that actually need them.

66% 11/11 scored

Redundant work in loops 0%

0/3 expected sites
Bounded interfaces 0%

0/8 expected sites not present
Memoization / caching 122%

4/3 expected sites
Resource reuse / pooling 133%

4/3 expected sites
Off-critical-path execution 100%

2/2 expected sites
Lookup data structures 0%

0/2 expected sites
Batching round-trips 100%

2/2 expected sites
Shared-state synchronization 167%

5/3 expected sites
Bounded concurrency / backpressure 0%

0/2 expected sites
Lazy / minimal computation 100%

2/2 expected sites
Streaming over buffering 0%

0/3 expected sites not present

Redundant work in loops 0%

The codebase does contain this primitive in multiple places: (1) curator skill classification uses deeply nested loops with repeated regex/path matching work, (2) Excel DCF validation performs per-cell nested scanning over multiple error patterns, and (3) Telegram DM topic lookup uses nested scans over cached topics and config on each lookup. I did not find any spot where the redundant work was correctly hoisted/batched/memoized to make the per-iteration cost ~constant.

high
In `agent/curator.py`, reduce the multiplicative nested work in `_classify_removed_skills`: precompile regexes for `needles`, pre-normalize needle variants once, and build indexes from `parsed_calls` (e.g., a mapping from target skill name → referenced removed skill names based on fields). Then replace repeated inner-loop `re.search(...)` and repeated haystack scanning with O(1)/amortized lookups.
- agent/curator.py:630-709 — Shows repeated regex/path-component matching inside multiple nested loops (`for name in removed` → `for args in parsed_calls` → `for key in (...)` → `for needle in needles`).
high
In `gateway/platforms/telegram.py`, change `_get_dm_topic_info` to avoid nested scanning of `_dm_topics_config` for each cache hit. Build a single lookup table keyed by `(chat_id, thread_id, topic_name)` (or `(chat_id:topic_name, thread_id)` mapping) during `_reload_dm_topics_from_config`, and have `_get_dm_topic_info` return directly from that map without iterating `for chat_entry ... for t in chat_entry.get('topics', [])`.
- gateway/platforms/telegram.py:5750-5860 — Demonstrates that `_get_dm_topic_info` iterates over cached items and then iterates again over `_dm_topics_config` and `topics` to construct the full return object.
med
In `optional-skills/finance/dcf-model/scripts/validate_dcf.py`, reduce per-cell × per-error-type scanning: replace the inner `for err in excel_errors` loop with a single classification strategy (e.g., check using a compiled pattern that matches any known error token and extract which one matched), and avoid repeated coordinate lookups where possible (iterate formulas and values together or cache `ws_formulas` row-wise).
- optional-skills/finance/dcf-model/scripts/validate_dcf.py:70-105 — Shows `for cell in row` then `formula_cell = ws_formulas[cell.coordinate]` and then `for err in excel_errors` substring checks for each string cell.

Bounded interfaces 0%

The bounded-interfaces primitive is not applied: multiple public surfaces return complete/unbounded collections (e.g., `list_providers()` across registries, `list_oauth_providers()` in the web server, and skill listing helpers). For these collection-returning APIs/handlers, there are no `limit`/cursor/iterator parameters or other bounding mechanism visible at the interface boundary.

high
Add bounding controls to public collection-returning APIs: introduce `limit` (and ideally `cursor`/`offset` or an iterator/streaming interface) to `list_providers()` functions and ensure callers can request partial results.
- agent/image_gen_registry.py:55-130 — Unbounded `list_providers()` currently returns the full provider registry.
- hermes_cli/dashboard_auth/registry.py:1-59 — Unbounded `list_providers()` currently returns the full auth provider registry.
- agent/transcription_registry.py:80-123 — Unbounded `list_providers()` currently returns the full transcription provider registry.
- providers/__init__.py:1-140 — Unbounded `list_providers()` returns all discovered provider profiles.
high
Paginate collection responses at HTTP boundaries. For `GET /api/providers/oauth`, add query params like `limit` and `cursor` (or at least `limit`) and slice `providers` accordingly.
- hermes_cli/web_server.py:3030-3120 — Endpoint builds and returns the full providers list without any pagination/limit parameters.
med
Bound skill-listing helpers to avoid returning complete directories/manifests in one response; add `limit` (and optionally cursor) and propagate it to CLI/API consumers.
- tools/skill_usage.py:280-390 — `list_agent_created_skill_names()` and `list_archived_skill_names()` return complete sorted lists with no interface-level bound.
med
Expose bounding parameters on Teams artifact listing functions (`list_transcript_artifacts`, `list_recording_artifacts`) rather than collecting the entire paginated result into an in-memory list unconditionally.
- plugins/teams_pipeline/meetings.py:130-280 — Both list_*_artifacts functions return full lists produced via `collect_paginated(...)` with no limit/cursor exposed.

Memoization / caching 122%

Memoization/caching is clearly present. The strongest, correctness-focused implementation is in katex-memo.ts: it uses a bounded LRU cache keyed by equation inputs and safely clones cached subtrees to prevent mutation bugs. Additional caching is used in the desktop UI for external link title fetching (including in-flight request deduping) and for derived pane-state atoms (to keep subscriptions stable).

high
Add bounding/eviction for the external-link title cache (titleCache/titleInflight/titleSubs) to prevent unbounded growth over long sessions; consider an LRU with a reasonable max entries and clearing related inflight/subscriber state on eviction.
- apps/desktop/src/lib/external-link.tsx:10-72 — Caches in titleCache/titleInflight/titleSubs are Maps without any size limit/eviction strategy, so memory can grow without bound.
med
Add explicit invalidation semantics for cached link titles if the bridge can return time-varying results (e.g., rotating “attention required” pages). If titles are effectively stable, document that assumption next to the cache implementation.
- apps/desktop/src/lib/external-link.tsx:78-125 — fetchLinkTitle stores results permanently in titleCache and deletes inflight tracking, but provides no TTL/invalidation pathway.
low
For katex-memo, consider exposing CACHE_LIMIT as a configuration constant (or adding lightweight telemetry) so performance tuning can be done without editing source.
- apps/desktop/src/lib/katex-memo.ts:44-78 — CACHE_LIMIT is currently hard-coded to 512; tuning may be beneficial for different device classes or memory constraints.

Resource reuse / pooling 133%

Resource reuse / pooling is present and correctly applied in the main expensive-client hotspots: Bedrock boto3 clients are cached per region; managed FAL clients for both image generation and video generation are cached and reused across requests (with locking and config identity checks). This avoids rebuilding clients/HTTP connection pools on each call.

high
If there are other call paths that construct per-request API clients (e.g., auxiliary/provider client resolution in agent/auxiliary_client.py for OpenAI/httpx-based SDKs), confirm they use a shared bounded cache with eviction/invalidation and that cache keys include all parameters that affect transports (base_url, auth mode, model/endpoint, etc.). Add/extend tests to assert reuse across consecutive calls similarly to the managed FAL tests.
- agent/auxiliary_client.py:1-75 — Auxiliary client is explicitly positioned as shared; ensure the concrete client construction sites are also pooled and not re-created per call.

Off-critical-path execution 100%

This codebase does apply off-critical-path execution: after each turn, it defers a slow self-improvement review (forked agent + tool/memory/skill writes) into a separate daemon thread. The background worker is isolated (stdout/stderr suppression, tool whitelist) and has robust exception handling and cleanup to prevent foreground latency or crashes.

med
Confirm there are no other per-message/per-turn hot paths in the gateway or agent loop that perform slow network/file operations inline (e.g., media caching, large tool payload processing, provider calls) without deferral to a worker/queue. If found, route them through the existing background-thread/worker patterns or a centralized worker pool with retry/idempotency.
- gateway/platforms/base.py:560-760 — Contains async media/audio caching with retries (network I/O + file writes). Verify call sites don’t invoke these inline on the foreground critical path.
low
For any background jobs added beyond background_review, standardize: (1) tool/file/network work fully in worker, (2) bounded retry policy with idempotency keys for memory/skill writes, (3) explicit cleanup/shutdown in finally blocks, mirroring background_review’s pattern.
- agent/background_review.py:241-598 — Provides a good template: worker isolation + tool whitelist + exception capture + shutdown/close cleanup.

Lookup data structures 0%

This codebase does use lookup data structures effectively (notably in the LSP service manager via dict/set caches and membership checks). However, the skill lookup helpers in tools/skill_manager_tool.py appear to rely on repeated recursive filesystem scans for name-based lookup, where an indexed cache/map would better match the intended O(1)/O(log n) lookup primitive.

high
Add a cached index for skill-name → skill directory (and optionally per-profile) built once (or incrementally updated) and used by _find_skill and _find_skill_in_other_profiles instead of rglob scans on every call.
- tools/skill_manager_tool.py:330-405 — _find_skill performs nested loops over directories and rglob("SKILL.md") with a string comparison; this is a linear scan pattern for a repeated lookup key (skill name).
- tools/skill_manager_tool.py:405-488 — _find_skill_in_other_profiles repeats the same recursive search pattern across profile roots; should reuse a prebuilt lookup map.
med
If filesystem indexing is hard, at least memoize the results for the active roots (with invalidation on skill directory mtime changes or when write actions occur) so repeated calls in a session avoid re-walking the tree.
- tools/skill_manager_tool.py:330-405 — A straightforward memoization layer can wrap _find_skill(name) because the function’s result depends on filesystem state under get_all_skills_dirs().
low
Confirm whether _find_skill / _find_skill_in_other_profiles are on any hot path by checking call frequency from tools/skill_manager_tool entrypoints and CLI flows; if rare, prioritize index work elsewhere first.
- tools/skill_manager_tool.py:330-520 — These helpers are used for collision detection and error messaging; their exact frequency will determine whether an index is worth the complexity.

Batching round-trips 100%

Batching round-trips exists and is applied correctly at key I/O boundaries: SSE text deltas are buffered and emitted as combined updates on a short timer, and nested delegate tool progress events are summarized in batches rather than relayed one-by-one.

high
Search for remaining patterns where the code performs per-item network/database writes inside loops (e.g., per-message/per-tool/per-record persistence or per-item HTTP calls) and replace them with bulk/batched variants (multi-write, executemany, bulk insert, batched API calls).
- gateway/platforms/api_server.py:2440-2555 — Shows the desired batching approach at the streaming boundary; use as a reference when refactoring other per-item emissions.
med
Add/extend tests that assert batching behavior indirectly (e.g., number of SSE/emitted deltas or number of parent progress relays stays sub-linear with N deltas/tools).
- gateway/platforms/api_server.py:2440-2555 — Batch size/timer is explicit (0.05s); this makes it testable by driving multiple deltas and counting emitted calls.

Shared-state synchronization 167%

This codebase does implement the shared-state synchronization primitive. Critical shared mutable state is protected with minimal-scope locking: in-process shared dictionaries for dashboard auth providers and WS tickets are guarded by `threading.Lock`, and cross-process gateway runtime ownership is synchronized via OS file locks. No broad or unguarded shared writes were observed in the inspected synchronization hotspots.

med
Extend the audit to other shared caches/singletons (e.g., any shared dict/registry used by concurrent request handlers) and confirm each read-modify-write path is consistently guarded or uses atomic/lock-free structures appropriate to the language runtime.
- hermes_cli/dashboard_auth/registry.py:1-59 — Demonstrates the desired locking pattern; use it as the reference standard when checking additional shared registries/caches elsewhere.
low
For cross-process locks in `gateway/status.py`, ensure all call sites always pair acquire/release correctly (especially around exceptions) so the lock handle lifecycle remains consistent.
- gateway/status.py:360-470 — Lock ownership is tracked via `_gateway_lock_handle` and released by `release_gateway_runtime_lock`; call-site pairing should be verified.

Bounded concurrency / backpressure 0%

The codebase does have bounded concurrency/backpressure in at least one hot fan-out: trajectory_compressor.py caps concurrent in-flight summarization/API calls using an asyncio.Semaphore. In the gateway message dispatcher, new message handling intentionally spawns background tasks for interruption support, but there is no corresponding global concurrency/in-flight cap; bursts can therefore create unbounded fan-out (this is a should-be site that is currently un-matched). During shutdown, cancel_background_tasks does use bounded drain rounds and timeouts to avoid infinite or unbounded shutdown waiting.

high
Add a global (or per-adapter/per-platform) in-flight cap for _start_session_processing / _process_message_background so inbound message bursts can’t spawn unbounded concurrent tasks. For example, wrap background task creation or the body of _process_message_background with an asyncio.Semaphore (or a task queue with max workers) and ensure tasks beyond the cap apply backpressure (delay/queue/merge) instead of immediately spawning.
- gateway/platforms/base.py:3772-3962 — handle_message spawns background tasks to process messages while allowing interruption; this is the primary message fan-out point that should be bounded.
- gateway/platforms/base.py:3560-4620 — _process_message_background creates per-task additional concurrent work (e.g., typing_task via asyncio.create_task), compounding unbounded fan-out when message-processing tasks are unbounded.
med
Consider bounding the per-message typing-task concurrency as well (e.g., reuse a shared rate-limited typing indicator scheduler, or guard asyncio.create_task(self._keep_typing(...)) behind a semaphore) to reduce multiplicative load under bursts.
- gateway/platforms/base.py:3560-4620 — Typing indicator is started as a new asyncio task inside _process_message_background for each message-processing task.
low
Extend the same semaphore/worker-pool pattern used in trajectory_compressor.py to other high-cardinality fan-out utilities (if present elsewhere) to keep concurrency behavior consistent across CLI tooling and the live gateway.
- trajectory_compressor.py:980-1240 — Demonstrates the desired pattern: create many tasks but gate entry into the heavy I/O section with an asyncio.Semaphore.

Lazy / minimal computation 100%

The primitive is present and applied cleanly in two main places: (1) KaTeX rendering during streaming is memoized so only newly-arrived/changed math expressions are recomputed (minimal computation at the UI work boundary), and (2) tool-gateway readiness/token logic avoids synchronous OAuth refresh by using a cheap probe by default and only refreshing when the caller actually needs a valid token for a request.

high
Add a small explicit comment or test assertion demonstrating that readiness checks (is_managed_tool_gateway_ready) do not trigger refresh when cached tokens are present, to lock in the intended lazy boundary.
- tools/managed_tool_gateway.py:176-192 — Shows the default wiring to peek_nous_access_token for readiness; documenting/locking this via a test prevents regressions back to refresh-on-scan.
med
For katex-memo.ts, consider adding a targeted benchmark/test that validates cache hits avoid calling katex.renderToString (e.g., by stubbing katex in a unit test) to ensure the minimal-computation boundary stays effective.
- apps/desktop/src/lib/katex-memo.ts:126-205 — Cache lookup and miss-only rendering are the core minimal-computation mechanism; a regression test would best protect this behavior.

Streaming over buffering 0%

I did not find a code path that applies the “streaming over buffering” primitive (bounded-memory streaming/chunk iteration over arbitrarily large inputs). Instead, the main anti-pattern appears in `read_file_raw()` (full `cat` into a single string) and in patch validation, which calls `read_file_raw()` for UPDATE operations, buffering whole files before processing.

high
Replace `read_file_raw()` usage in patch validation (`_validate_operations` / apply flow) with an incremental/streaming algorithm that does not require full-file buffering (e.g., process line-by-line with bounded windows matching the patch hunks, or apply hunks using streaming search/replace over iterators).
- tools/patch_parser.py:239-260 — UPDATE validation calls `file_ops.read_file_raw(op.file_path)` which buffers the entire file content before patch simulation.
high
Redesign `read_file_raw()` (and/or its backends) to be bounded-memory: return an iterator/stream of lines/chunks (or accept a callback/consumer) rather than a full in-memory string, when the input size is untrusted or can be large.
- tools/file_operations.py:950-1023 — `read_file_raw()` reads the whole file using `cat` and returns `content=raw_content` (single full string), which breaks the constant-memory requirement for large inputs.
med
Add guardrails/tests that assert memory boundedness for patch/update flows on large files (e.g., generate a large temporary file and verify the code does not call `read_file_raw()` or load full contents when offset/limit or patch hunks would suffice).
- tests/tools/test_patch_parser.py:1-260 — There are patch parser tests, but none assert bounded-memory behavior or specifically target large-file patch/validation paths.

Reliability Primitives

Retries, circuit breakers, idempotency keys, health checks, and a runbook for each service.

72% 10/11 scored

Timeouts 100%

4/4 expected sites
Retry with backoff + jitter 67%

1/1 expected sites
Idempotency 0%

0/1 expected sites
Graceful degradation / fallback 100%

3/3 expected sites
Error handling & propagation 0%

0/1 expected sites
Deterministic resource cleanup 100%

1/1 expected sites
Atomicity / all-or-nothing 67%

1/1 expected sites
Input / boundary validation 100%

4/4 expected sites
Failure isolation / bulkheading 100%

1/1 expected sites
Graceful shutdown 83%

2/2 expected sites

Timeouts 100%

Timeouts are implemented and wired through the core auxiliary LLM client path. Provider/model timeout configuration is centralized in hermes_cli/timeouts.py, propagated through agent/auxiliary_client.py into provider request kwargs via _build_call_kwargs, and the Codex streaming adapter additionally enforces a hard monotonic deadline with client close/eviction on timeout.

high
Audit other I/O boundaries for missing/optional timeout propagation (e.g., non-LLM streaming tools, subprocess readers, websocket/SSE loops) by locating each external/blocking call site and verifying it always receives a deadline/timeout or is wrapped by an equivalent bounded watchdog.
- agent/auxiliary_client.py:586-780 — Codex streaming is handled correctly (deadline + close/evict), so the main risk is other adapters/clients that may not have the same level of enforcement.
med
Standardize timeout semantics across call chains (confirm consistent meaning of `timeout` across providers/tasks: connect timeout vs total request timeout vs stream idle timeout) so that timeouts behave predictably under retries and streaming.
- hermes_cli/timeouts.py:1-83 — Current helpers distinguish request timeout vs stale timeout; ensure all call sites interpret these consistently when wiring into streaming vs non-streaming code.

Retry with backoff + jitter 67%

The codebase contains a dedicated jittered exponential backoff helper (`agent.retry_utils.jittered_backoff`) with a capped budget and bounded jitter. However, the main production retry path for transient message send failures (`gateway/platforms/base.py::_send_with_retry`) implements exponential backoff with only a small fixed-range jitter, and does not use the shared jittered-backoff utility. It is still applied in the right general retry location, but the implementation is only partially aligned with the 'backoff + jitter' primitive quality expectations.

high
Update `gateway/platforms/base.py::_send_with_retry()` to use the shared `agent.retry_utils.jittered_backoff()` (or match its behavior): make jitter proportional to the computed delay, use a configurable `max_delay` cap (in addition to `max_retries`), and ensure delays are decorrelated across concurrent sessions.
- gateway/platforms/base.py:3269-3344 — Current implementation uses `delay = base_delay * (2 ** (attempt - 1)) + random.uniform(0, 1)` inside the retry loop.
- agent/retry_utils.py:1-58 — Provides the correct primitive behavior: exponential backoff + `max_delay` cap + bounded proportional jitter via `jitter_ratio`.

Idempotency 0%

Idempotency is implemented for the API server’s non-streaming chat-completions path via an _IdempotencyCache that deduplicates concurrent requests using (Idempotency-Key + a request fingerprint). However, the generic send retry logic in gateway/platforms/base.py appears to retry by re-sending without a visible idempotency/dedup guard at the retry boundary, which is the highest-risk gap for duplicate side effects on transient failures.

high
Add an idempotency/dedup mechanism to the send retry boundary (gateway/platforms/base.py:_send_with_retry). For example: generate a per-attempt/per-message idempotency token and/or consult a per-chat/outbound-message dedup cache so repeated self.send() calls on transient errors don’t create duplicates.
- gateway/platforms/base.py:3269-3349 — Retry loop re-calls self.send() on transient/network errors; without a dedup/idempotency guard in this boundary, duplicates are possible on unhappy-path retries.
med
For outbound adapters, ensure the dedup strategy used for inbound events (MessageDeduplicator) is also applied (or complemented) for outbound retries at the exact send boundary, not only for inbound message handling.
- gateway/platforms/base.py:3269-3349 — The retry boundary is centralized here; platform-specific inbound dedup doesn’t automatically prevent duplicate outbound messages.

Circuit breaking / fail-fast N/A

Agent produced no parseable output for this item.

Graceful degradation / fallback 100%

The codebase contains solid graceful degradation/fallback patterns. Desktop runtime-readiness probes catch gateway failures and return a fallback/unknown readiness result with an explicit reason. The agent chat helpers switch to a configured fallback model/provider chain when the primary backend fails. The gateway streaming consumer degrades from edit-based streaming to chunked final sends when edits stop working, preserving delivery of the core response.

high
Audit remaining non-critical dependency calls in the same end-to-end flows (desktop onboarding readiness/model option fetching, agent provider/model routing, and streaming edits→fallback) to ensure all error branches either (a) return real fallbacks with explicit staleness/uncertainty or (b) preserve core output instead of aborting; focus specifically on catch blocks that currently return null/empty without an explicit staleness reason.
- apps/desktop/src/lib/runtime-readiness.ts:39-69 — Demonstrates the intended fallback contract (return error+null value) for gateway failures; other nearby probes should follow this contract.

Error handling & propagation 0%

The primitive is implemented in at least one critical area: MCP server lifecycle handling explicitly preserves cancellation semantics (Cancel ledError re-raise), with time-bounded shutdown. However, there is at least one localized catch that swallows failures (`_write_stderr_log_header` uses `except Exception: pass`), which violates the 'never silently drop failures' expectation.

high
Replace the silent swallow in `_write_stderr_log_header` with context-rich logging (and/or re-raising if this logging is considered important). At minimum, log the exception with `logger.debug/warning` including `server_name` to avoid silent failures.
- tools/mcp_tool.py:150-167 — Contains `except Exception: pass`, which drops errors without surfacing context.
med
Audit other broad `except` blocks in `tools/mcp_tool.py` (and similar lifecycle/transport modules) for 'silent fallback' patterns. Where fallback is acceptable, ensure errors are logged with enough context (server name / operation / timeout) and that the fallback cannot mask partial corruption or deadlocks.
- tools/mcp_tool.py:150-190 — Demonstrates both acceptable (debug fallback to devnull) and unacceptable (pass) error-handling styles within nearby stderr-log helpers.

Deterministic resource cleanup 100%

Deterministic resource cleanup is present and correctly applied at the primary handle acquisition site observed: `gateway/status.py` acquires a lock file descriptor with `os.open` and guarantees release by scoping it inside a `with os.fdopen(...)` block, covering the exception path during JSON writing.

med
Apply the same pattern (scope-bound `with`/`finally`/RAII) at any other raw acquisition sites (e.g., other direct `os.open`/lock/socket acquisitions) where release is not obviously guaranteed on the throw path.
- gateway/status.py:578-636 — This is the validated reference pattern in the codebase: low-level acquisition is immediately wrapped in scope-bound cleanup (`with os.fdopen(...)`).

Atomicity / all-or-nothing 67%

Atomicity/all-or-nothing behavior is present primarily in `agent/curator_backup.py` where the rollback of the skills directory is implemented with staging and best-effort restoration if extraction fails. Broader atomicity guarantees for DB multi-step updates (transactions) were not exhaustively audited here; the strongest concrete all-or-nothing pattern observed is the file-tree rollback recovery.

high
Audit DB mutation sequences for multi-step consistency: identify functions that perform multiple related writes (e.g., inserting/updating several tables/rows that must stay consistent) and ensure they use explicit transactions (BEGIN/COMMIT/ROLLBACK or equivalent in the DB layer).
- hermes_state.py:260-520 — SessionDB has careful WAL setup and retry strategy, but an atomicity audit requires reading specific multi-step write methods to confirm they wrap related mutations in transactions on failure paths.
med
For file-tree atomicity, strengthen rollback recovery semantics: when staging/moving or extract fails, consider writing the snapshot into a fully isolated tempdir and then using an atomic rename/swap for the final step (where the filesystem supports it), instead of mutating the live directory and relying on best-effort move-back.
- agent/curator_backup.py:529-667 — Current approach mutates the live skills directory via tar extraction and then restores from staged contents on failure; this is good best-effort atomicity, but it isn’t as robust as an atomic rename/swap finalization step.

Input / boundary validation 100%

The codebase applies input/boundary validation strongly in the dashboard-auth OAuth routes: it validates the `next` redirect target and enforces PKCE/state/provider checks on callback inputs before allowing redirects or login completion. CLI argv handling appears to rely on argparse for boundary constraints, but the strongest, most explicit validation is in `hermes_cli/dashboard_auth/routes.py`.

med
Audit other public entry points for the same level of explicit validation (e.g., any remaining FastAPI/HTTP routes that consume query/path/body params), ensuring invalid inputs are rejected at the boundary and not only during downstream processing.
- hermes_cli/dashboard_auth/routes.py:130-245 — This file shows strong validation for `next`, `provider`, and callback state; other routes should be checked for equivalent treatment.

Failure isolation / bulkheading 100%

The codebase contains a strong bulkheading implementation for shared LLM/HTTP client resources in `agent/auxiliary_client.py`. It bounds the shared client cache and isolates async clients by validating the current open event loop, force-closing stale transports and evicting old entries to prevent shared-resource exhaustion from one failing workload.

low
Add/confirm targeted tests that simulate (a) event-loop switching across gateway worker threads and (b) repeated aux calls that would otherwise expand the cache beyond the max size, asserting that stale cached clients are force-closed and that unrelated aux calls still succeed while the cache is being evicted.
- agent/auxiliary_client.py:4180-4900 — These are the key mechanisms (loop_ok validation, FIFO eviction, and _force_close_async_httpx) that should be covered by unhappy-path tests.

Graceful shutdown 83%

Graceful shutdown support is present and well-implemented in the Node/TS TUI via a reusable setupGracefulExit helper that runs async cleanups (including killing the gateway and resetting terminal modes) before exiting. The Python tui_gateway entrypoint also registers signal handlers and uses a bounded grace period with a hard-failsafe exit, though it does not explicitly show draining of in-flight/queued work beyond exiting the stdin dispatch loop.

high
In tui_gateway/entry.py, add explicit shutdown coordination so the signal handler stops accepting new stdin/dispatch work and waits for any in-flight/worker operations to complete (or to reach a safe cancellation point) before sys.exit/os._exit.
- tui_gateway/entry.py:1-220 — Signal handler logs then starts a grace timer and immediately proceeds to sys.exit(0); no explicit drain/stop-accepting-work mechanism is shown in the handler.
- tui_gateway/entry.py:220-299 — Main loop continuously reads sys.stdin and dispatches requests; process exit will abort this loop, but there is no visible coordination to finish work already dispatched before exiting.
med
If gw.kill/cancellation in ui-tui may leave buffered writes in the gateway, ensure gw.kill has bounded completion semantics (e.g., awaiting a cancellation acknowledgement with a timeout) rather than relying solely on the outer failsafe.
- ui-tui/src/entry.tsx:1-115 — Cleanup awaits gw.kill('graceful-exit-cleanup') via the helper’s Promise.allSettled, but the gateway-side kill semantics/timeout are not shown here.

Not applicable to this codebase: Circuit breaking / fail-fast.

API & Extensibility

A checked-in OpenAPI spec, versioned routes, a webhook system with retries and signing, and tenant-scoped rate limits.

18% 7/10 scored

Machine-readable API contract 0%

0/2 expected sites not present
Programmatic auth with scopes 0%

0/2 expected sites not present
Idempotent writes 0%

0/7 expected sites
Consistent pagination & filtering 0%

0/4 expected sites
Consistent errors & status codes 33%

5/5 expected sites
Sandbox / test mode 0%

0/2 expected sites not present
Extension points / plugins 94%

6/6 expected sites

Machine-readable API contract 0%

No checked-in, machine-readable API contract spec (OpenAPI/Swagger/AsyncAPI/proto/GraphQL SDL) was found anywhere in the repository. The codebase does include a public, external-facing OpenAI-compatible API adapter (gateway/platforms/api_server.py) with many documented routes, but there is no corresponding checked-in spec artifact that third parties can rely on to integrate without contacting the maintainers.

high
Create and check in an OpenAPI (or equivalent) spec that covers the full public route inventory exposed by gateway/platforms/api_server.py, including all /v1/* and /api/* session endpoints and /health endpoints. Ensure the spec includes request/response schemas, example payloads, and a common error format with status codes and error codes.
- gateway/platforms/api_server.py:1-70 — This module is the public API adapter and enumerates the external route set and intended clients; an API contract spec should exist alongside it to cover those routes.
high
Add an automated sync/coverage mechanism: generate the spec from route definitions (or contract-test that the spec path list matches the registered route inventory) in CI, failing the build if the spec drifts or covers only a fraction of endpoints.
- gateway/platforms/api_server.py:1-70 — The module declares many public endpoints; without automated coverage checks, a spec will quickly become stale relative to implementation.
med
If you want to keep using the existing /v1/capabilities endpoint, link it to the spec and document how clients can obtain the spec version (e.g., spec URL or embedded version hash), so it remains a stable, discoverable contract.
- gateway/platforms/api_server.py:1-70 — /v1/capabilities is already described as machine-readable, but the primitive requires a checked-in spec file that drives docs/sample payloads and covers all public routes.

Versioning & backward compatibility N/A

I could not confirm a consumer-facing Versioning & backward compatibility strategy (no checked-in API contract/spec and no explicit versioning/deprecation/sunset policy for public HTTP routes). While there are internal “version” fields and some protocol/version compliance in non-HTTP contexts (e.g., dashboard-auth provider contract testing and runtime config/versioning regression guards), this does not amount to a stable, third-party discoverable versioning strategy for the codebase’s public API surface.

high
Inventory the public HTTP API surface and add a checked-in machine-readable contract (OpenAPI/Swagger) covering *all* endpoints. Ensure the spec is tied to route registration (generated or contract-tested) so it can’t drift.
- gateway/platforms/api_server.py:1-80 — This file documents the public HTTP endpoints (including /v1/* and /api/*), but the repo appears to have no checked-in OpenAPI/Swagger/AsyncAPI spec artifact discoverable via filename search.
high
Define and implement a versioning policy for the public API: either (a) explicit versioned routes (/v1, /v2, …) with deprecation+sunset headers and migration docs, or (b) an unversioned route strategy that is strictly backward compatible with explicit deprecation markers for any breaking behavior.
- gateway/platforms/api_server.py:1-80 — Public endpoints include both versioned (/v1/*) and unversioned (/api/sessions, /health) paths; no in-file deprecation/sunset policy is evident from the public surface documentation.
med
Add contract tests for backward compatibility: run CI checks that compare current response/error schemas and pagination/filter conventions against the previous release (or golden contract) to detect breaking changes early.
- tests/gateway/test_whatsapp_reply_prefix.py:1-120 — There are regression-style tests for config-version coverage, showing the project uses version-guard patterns—but those are for internal config bridging, not for HTTP API schema compatibility.

Programmatic auth with scopes 0%

No implementation of programmatic auth with per-credential scopes (scoped, revocable API credentials distinct from the user session) was found. The OpenAI-compatible API server adapter authenticates callers using a single shared `API_SERVER_KEY` bearer token; related sensitive behavior like `X-Hermes-Session-Key` is only gated by whether that global key is configured, with no evidence of per-credential scopes/rotation/revocation/last-used.

high
Replace the single shared `API_SERVER_KEY` bearer auth in `_check_auth` with a scoped credential model: issue per-credential tokens/keys that carry scopes; validate scopes on each endpoint/request (e.g., chat/run/read responses vs session/memory management). Add credential identifiers to logs for auditing and enforce revocation/rotation and last-used tracking.
- gateway/platforms/api_server.py:806-858 — Current auth compares the provided bearer token directly to one configured secret (`self._api_key`) and returns a generic 401 on mismatch; no scope parsing/enforcement exists here.
high
Scope-protect long-term memory/session scoping (`X-Hermes-Session-Key`) rather than gating solely on the global key being configured. Require specific scopes for allowing callers to supply/alter session keys and ensure the enforcement is tied to the credential used for auth (not only server configuration).
- gateway/platforms/api_server.py:873-939 — `_parse_session_key_header` only checks whether `self._api_key` exists; if absent it returns 403, but it does not check any scopes associated with the presented credential.

Per-tenant rate limiting N/A

Rate limiting logic exists for internal platform behaviors (e.g., Signal attachment scheduling and pairing-code flow-control), but there is no evidence of a per-tenant (per consumer) API-edge rate limiter with a stable third-party-facing HTTP contract (standard limit/remaining headers, 429, and retry guidance).

high
If this project exposes any HTTP API surface for third-party integration, add an API-edge per-tenant/per-consumer rate limiter that keys buckets by tenant/credential, and emits standard headers (e.g., X-RateLimit-Limit / X-RateLimit-Remaining / Retry-After) plus a 429 response with clear retry guidance.
- gateway/platforms/signal_rate_limit.py:1-20 — Shows current rate limiting is process-wide platform scheduling, not per-tenant API-edge enforcement with HTTP signaling.
med
Document the consumer/tenant identifier used for rate limiting (e.g., API key subject, org id, or workspace id) and ensure consistent application across all public entrypoints (not just some platform adapters).
- gateway/pairing.py:1-30 — Pairing rate limiting is present but scoped to “per user” within a pairing flow, not a consistent per-tenant contract across a public API surface.

Idempotent writes 0%

This codebase includes an idempotency mechanism for OpenAI-compatible API writes, implemented as an in-memory `_IdempotencyCache` with TTL, fingerprinting, and in-flight deduplication. However, the idempotency-key handling is only applied to the non-streaming portions of `POST /v1/chat/completions` and `POST /v1/responses`. Streaming branches and several dashboard session mutation endpoints (`/api/sessions`, `/api/sessions/{id}/fork`, PATCH updates, and session chat endpoints) do not handle `Idempotency-Key`, leaving retry/double-execution gaps for integrators.

high
Add `Idempotency-Key` handling to the streaming branches of `POST /v1/chat/completions` and `POST /v1/responses` (so retries after timeouts/disconnect don’t double-run the agent).
- gateway/platforms/api_server.py:1700-1860 — Streaming branch for `chat/completions` executes/streams without wrapping in `_idem_cache`.
- gateway/platforms/api_server.py:2500-3100 — Streaming branch for `responses` calls `_write_sse_responses(...)` before any idempotency-key logic is applied.
high
Implement idempotent replay for dashboard/session mutations: `POST /api/sessions`, `PATCH /api/sessions/{session_id}`, `POST /api/sessions/{session_id}/fork`, and `POST /api/sessions/{session_id}/chat` (and `.../chat/stream`) by requiring/handling `Idempotency-Key` for safe retries.
- gateway/platforms/api_server.py:1320-1480 — `_handle_create_session` creates sessions but does not read `Idempotency-Key`.
- gateway/platforms/api_server.py:1360-1480 — `_handle_patch_session` mutates session metadata but does not read `Idempotency-Key`.
- gateway/platforms/api_server.py:1400-1480 — `_handle_fork_session` branches/copies messages but does not read `Idempotency-Key`.
- gateway/platforms/api_server.py:1480-1580 — `_handle_session_chat` runs an agent turn but does not read `Idempotency-Key`.
- gateway/platforms/api_server.py:1580-1700 — `_handle_session_chat_stream` streams agent output without `Idempotency-Key` handling.
med
Improve the correctness contract of idempotency replay beyond in-memory TTL: persist idempotency outcomes for a longer window and make replay behavior explicit (e.g., include a stable replay response identifier and return a distinct error/conflict shape when fingerprints differ for the same key).
- gateway/platforms/api_server.py:620-820 — `_IdempotencyCache` is in-memory with TTL and supports fingerprint match, but does not demonstrate distinct conflict surfacing semantics for mismatched fingerprints or persistence across restarts.

Consistent pagination & filtering 0%

Pagination/filtering is implemented only partially: GET /api/sessions uses bounded limit+offset and a `source` filter, but list endpoints like GET /v1/models, GET /v1/skills, and GET /v1/toolsets are returned unpaginated and do not share a cursor-based pagination/filtering convention. As a result, third-party integration cannot rely on a consistent list contract across collections.

high
Introduce a consistent cursor-based pagination contract across all list endpoints in gateway/platforms/api_server.py (e.g., standard `limit` + `cursor`/`next_cursor` query params) and ensure every list endpoint shares the same bounded page size behavior.
- gateway/platforms/api_server.py:1256-1343 — Shows pagination is implemented, but uses offset-based pagination (`limit`/`offset`) rather than a cursor convention.
- gateway/platforms/api_server.py:1098-1238 — Shows other list endpoints (/v1/models, /v1/skills) return the entire collection without shared pagination params.
high
Establish and apply a common filter convention for list endpoints (same parameter names and semantics across collections), including mapping/aliasing where necessary (e.g., how `source` corresponds to other possible dimensions).
- gateway/platforms/api_server.py:1256-1343 — Demonstrates filtering via `source` on /api/sessions, but other list endpoints do not expose comparable filter parameters.
med
Update the capability discovery payload (/v1/capabilities) to document the pagination/filter query params consistently (so integrators can implement once).
- gateway/platforms/api_server.py:1129-1169 — capabilities exposes the endpoint inventory but does not describe pagination/filtering parameters for the list routes.

Outbound events / webhooks N/A

The codebase implements inbound webhooks (a webhook receiver that accepts POSTs, validates HMAC signatures, rate-limits, deduplicates deliveries, and then triggers internal agent work). However, it does not implement the outbound-event/webhook primitive described in the rubric: there is no subscription store and delivery worker that pushes versioned, HMAC-signed event payloads to integrator-provided callback URLs with retry/backoff and idempotent redelivery.

high
Introduce a true outbound webhook/event subscription model (storage for subscriber callback URLs + signing secrets + event filters), plus a delivery worker that emits events with versioned payloads, HMAC signing, exponential-backoff retries (bounded), idempotent delivery keys, and a redelivery workflow suitable for integrators.
- gateway/platforms/webhook.py:260-420 — Current behavior is `POST /webhooks/{route_name}` (inbound). It returns 202 Accepted after queuing internal handling, rather than performing outbound event callbacks.

Consistent errors & status codes 33%

A shared, machine-parseable OpenAI-style error envelope exists in gateway/platforms/api_server.py and is reused by several public HTTP error paths (auth failure, request-body size checks, session-key validation, multimodal validation). However, it does not include a correlation/request id in the error response body, and the primitive’s full status-code semantics/codes consistency requirements (including 409/422/429 and 5xx fault-only) are not evidenced as enforced through this centralized contract.

high
Extend the centralized error envelope (_openai_error) to always include a correlation/request id (e.g., from an incoming header or generated per request) and ensure every error-returning path uses it (including explicit inline error dicts like the invalid API key response).
- gateway/platforms/api_server.py:502-510 — Shared envelope definition: add correlation id here so all callers inherit it.
- gateway/platforms/api_server.py:830-847 — Auth error currently returns an error envelope without correlation id.
high
Audit and standardize status-code mapping across all public API-server error paths to meet the primitive’s required semantics (400 malformed, 401/403 auth, 409 idempotency conflicts, 422 semantic errors, 429 throttling, 5xx only for true faults).
- gateway/platforms/api_server.py:568-574 — Currently uses 400/413; add evidence/coverage for the other required codes (409/422/429) and ensure they are used consistently.
- gateway/platforms/api_server.py:467-475 — Validation mapped to 400; ensure semantic validation vs malformed request is consistently distinguished (potentially 422 vs 400).
med
Add/extend automated tests asserting that every error response includes the correlation id and that error.code/type/status mapping is consistent for each required category (401/403/409/422/429 + representative 5xx fault).
- tests/gateway/test_api_server.py:1-35 — There is already test coverage for error handling; extend it to assert correlation id presence and required status-code categories.

Sandbox / test mode 0%

No consumer-facing Sandbox / test mode primitive was found. The codebase documents an OpenAI-compatible API server that requires `API_SERVER_KEY` and points at `http://localhost:8642/v1`, but there is no documented sandbox base URL + test credentials + isolated test data for third parties to integrate safely without using production.

high
Add a documented sandbox/test-mode contract for the API server: publish a dedicated sandbox base URL (not localhost), specify an authentication mechanism for test keys (clearly labeled as test-mode), and describe isolation guarantees for sandbox data (e.g., separate response/run/session storage). Update the `api_server.py` docstring or link to a checked-in doc/spec.
- gateway/platforms/api_server.py:1-40 — Current contract only references `http://localhost:8642/v1` and `API_SERVER_KEY`, with no sandbox/test-mode base URL or test credentials.
high
Introduce and document a sandbox configuration surface in the CLI config layer (or an accompanying config doc): e.g., `HERMES_SANDBOX_BASE_URL`, `HERMES_SANDBOX_API_SERVER_KEY`, and any sandbox-only storage namespaces/DB selection, including how test data is cleaned up.
- hermes_cli/config.py:1-20 — Current documentation only covers `~/.hermes/config.yaml` and `~/.hermes/.env` without any consumer sandbox/test-mode keys or isolated test-mode instructions.

Extension points / plugins 94%

This codebase contains a well-defined, documented plugin/extension system (hermes_cli/plugins.py) with a stable PluginContext interface (tools, hooks, platform adapters, skills, auxiliary tasks), a PluginManager that discovers/loads plugins from multiple sources (bundled, user, project, and pip entry points), and a concrete gateway platform registry (gateway/platform_registry.py). A host-owned Plugin LLM facade (agent/plugin_llm.py) further supports safe third-party plugin integration.

high
Audit and document a versioning policy for the extension contracts (PluginContext methods and hook names) and how breaking changes are avoided (e.g., VALID_HOOKS evolution rules, manifest schema versioning in plugin.yaml).
- hermes_cli/plugins.py:1-220 — The extension contract is extensively documented, but this slice does not show an explicit semantic-versioning / compatibility policy for third-party plugin authors.
med
Ensure the gateway platform registration surface is consistently discoverable in docs by linking PluginContext.register_platform and PlatformEntry fields to the referenced developer guide mentioned in the registry comments.
- gateway/platform_registry.py:1-220 — The registry doc references a developer guide contract, but the code slice provided does not include the end-to-end documentation chain for third-party integrators.

Not applicable to this codebase: Versioning & backward compatibility, Per-tenant rate limiting, Outbound events / webhooks.

Integration Depth

Per-system adapters behind one shared interface with bi-directional sync — not per-customer scripts held together with spreadsheets.

43% 8/10 scored

Shared integration abstraction 100%

3/3 expected sites
Per-integration reliability 0%

0/1 expected sites
Sync state & reconciliation 0%

0/2 expected sites not present
Inbound validation & normalization 100%

2/2 expected sites
Per-tenant integration credentials 0%

0/3 expected sites not present
Per-integration observability 0%

0/5 expected sites not present
Connector breadth for the category 100%

2/2 expected sites
Build-vs-buy posture 44%

2/3 expected sites

Shared integration abstraction 100%

This codebase does implement the “Shared integration abstraction” primitive: gateway platform integrations are structured around a shared BasePlatformAdapter interface, with concrete integrations like WebhookAdapter and MSGraphWebhookAdapter inheriting from it. This indicates an architected integration layer rather than N separate bespoke integrations without a common contract.

med
Extend the verification by sampling additional platform adapters (e.g., TelegramAdapter, SlackAdapter, WhatsAppAdapter) and confirm they consistently implement the shared BasePlatformAdapter contract without bespoke direct coupling to gateway internals.
- gateway/platforms/base.py:1-20 — Stated design intent that all platform adapters inherit from BasePlatformAdapter; additional adapter reads would confirm consistency across integrations.

Bidirectional sync N/A

No true “Bidirectional sync” primitive (read + write back to external systems with sync state/cursors and reconciliation semantics) was found. This codebase contains message/webhook receivers (ingress) and message senders (egress) but they do not implement an adapter-level sync workflow that continuously reconciles and writes changes back to the external system as part of a bidirectional data sync.

Metadata-driven mappings N/A

I did not find a clear “metadata-driven mappings” integration/config layer in this codebase—i.e., a runtime service that interprets versioned per-tenant field/entity mapping + transform + validation configuration for external-system integrations. The “schema/transform” parts I found (e.g., Gemini schema sanitization and generic tool-schema sanitizers) are backend-compatibility helpers for LLM tool JSON schemas, not per-tenant, metadata-interpreted mappings from external entities to a canonical internal model.

Per-integration reliability 0%

The codebase implements retry-with-backoff for transient delivery failures (notably in `gateway/platforms/base.py::_send_with_retry`). However, the primitive’s required companion mechanisms—per-integration dead-letter/quarantine parking for events that still fail after all retries, plus alerting/observability for those DLQ’d failures—do not appear to be implemented. Therefore, this primitive is only partially present (retries exist, but undeliverable record handling via DLQ is missing).

high
Add a per-integration dead-letter/quarantine mechanism to `_send_with_retry`: when retries are exhausted, write the undeliverable payload/event (with adapter identity, error, attempt count, and correlation/session metadata) into a DLQ (or persistent store) instead of only notifying the user. Include alerting/metrics for DLQ growth and rate of exhausted retries.
- gateway/platforms/base.py:3180-3345 — Retries are performed in `_send_with_retry`, but the exhaustion path sends a user notice and returns; no DLQ/quarantine write or ops alert is present here.
med
Implement a shared DLQ interface (but used by each platform adapter) so that Telegram/Discord/Feishu/etc. all park failures into the same canonical “failed event” model with per-integration labels, enabling consistent ops dashboards and reprocessing workflows.
- gateway/platforms/base.py:3180-3345 — All platform adapters route delivery through the base adapter retry path, making this the appropriate choke point for adding a DLQ abstraction that can still be labeled per integration.

Sync state & reconciliation 0%

No integration “sync state & reconciliation” primitive was found. The codebase uses inbound webhook delivery idempotency via in-memory TTL caches (and duplicate suppression in MS Graph), but it does not persist cursors/watermarks nor perform any drift detection/reconciliation between external systems and internal state (especially across restarts or missed events).

high
Introduce durable per-integration checkpointing (cursor/watermark) and idempotent upserts for inbound event streams where drift repair matters. Persist progress per route/event type and reconcile by re-fetching or compensating for gaps when the checkpoint is stale or missing.
- gateway/platforms/webhook.py:290-344 — Current idempotency is in-memory TTL only; replace/add a persisted cursor/checkpoint for reconciliation.
high
For MS Graph notifications, persist processed receipts/checkpoints and add drift repair logic (e.g., compare expected state from Graph with internal state, then upsert/repair discrepancies). Ensure correctness across restarts.
- gateway/platforms/msgraph_webhook.py:232-292 — Current dedup is in-memory receipt tracking only; add durable checkpoint + reconciliation.

Inbound validation & normalization 100%

The primitive is present and implemented well at the external ingestion boundaries for webhook integrations. Both the generic webhook adapter and the Microsoft Graph webhook adapter perform fail-closed authentication/validation, parse and normalize incoming payloads into internal canonical MessageEvent objects, and deduplicate repeated deliveries/receipts before dispatching agent work.

high
Add/confirm a quarantine mechanism for malformed or invalid inbound records (e.g., store invalid payloads + error reason in a dedicated holding area), rather than only returning HTTP errors or incrementing counters.
- gateway/platforms/webhook.py:650-725 — On invalid signature / parse failure, the handler returns 4xx/5xx responses and skips processing; there is no visible quarantine/holding persistence for bad records.
- gateway/platforms/msgraph_webhook.py:245-360 — On per-item failures (non-dict, resource not accepted, bad clientState), the adapter increments rejection/authRejected counters and continues; there is no visible quarantine persistence for later inspection.
med
Ensure dedup/idempotency behavior is consistent across deployments by validating whether in-memory dedup caches (_seen_deliveries / _seen_receipts) meet your expected persistence requirements (e.g., restarts, horizontal scaling). If not, back dedup with a shared store.
- gateway/platforms/webhook.py:165-190 — _seen_deliveries is an in-memory TTL cache; dedup state may not survive process restarts or scale-out.
- gateway/platforms/msgraph_webhook.py:98-118 — MS Graph receipt dedup uses in-memory sets/deques; same scale/restart caveat applies.

Per-tenant integration credentials 0%

I did not find an implementation of per-tenant integration credentials with tenant-isolated secret-manager storage and token refresh/revocation. The code appears to use user/process-local credential stores (e.g., ~/.hermes/auth.json and ~/.hermes/auth/google_oauth.json) and env/credential-pool resolution rather than tenant-scoped secret isolation.

high
Confirm the product model: if “tenant” exists for external integrations, refactor the credential resolution/refresh layer (hermes_cli/auth.py and provider-specific OAuth modules like agent/google_oauth.py) to use tenant-scoped secret-manager entries (e.g., secret per tenant+provider), including refresh-token rotation and per-tenant revocation.
- hermes_cli/auth.py:1-24 — Auth is persisted to a single local auth.json; this is the main integration credential refresh boundary.
- agent/google_oauth.py:1-20 — Google tokens are stored in a single local JSON file; provider-specific OAuth refresh is not tenant-scoped.
med
Remove/limit shared-secret sources for integration OAuth (shared env vars and shared local credential pool) on the runtime connector path; instead, plumb tenant_id through to the credential resolver so it always selects tenant-scoped credentials.
- hermes_cli/auth.py:560-616 — API-key secrets are resolved from env or a shared credential pool fallback—this is not tenant-isolated secret selection.

Per-integration observability 0%

I did not find an implementation of Per-integration observability (per connector health/throughput/failure visibility with success/failure rates, latency, and last-sync/last-processed surfaced to ops). There are some basic health counters and health endpoints (e.g., MS Graph webhook health returns accepted/duplicates), but they do not provide the per-integration reliability/latency/last-status telemetry needed to catch broken connectors without customer reports.

high
Add a per-integration metrics/status contract (route_name/platform/provider as the integration key) and implement it across adapters (at minimum: webhook platform routes and MS Graph webhook). Include: total received, success, failure (with error codes), processing latency (p50/p95), and last processed/last success timestamp.
- gateway/platforms/msgraph_webhook.py:150-182 — Current health response is limited to accepted/duplicates and does not include per-integration success/failure rates, latency, or last-processed status.
high
Instrument the webhook hot path to record outcome + latency for each configured route/integration (including idempotent duplicates as a separate outcome). Ensure failures in downstream delivery/agent execution are captured and counted (not just logged).
- gateway/platforms/webhook.py:420-520 — Main webhook receipt path includes idempotency and prompt formatting; it should be extended to record per-route processing outcome and latency.
med
Instrument auxiliary-provider call lifecycle (the point where provider HTTP requests are executed) to emit per-provider success/failure rates and latency and expose it via the existing /health/detailed or an internal status endpoint.
- agent/auxiliary_client.py:3300-3380 — This centralizes provider wrapping/routing; it is a natural place to ensure consistent observability across providers once the actual HTTP call sites are identified/instrumented.

Connector breadth for the category 100%

Connector breadth for this codebase exists in the form of an explicit connector catalog/discovery mechanism: a central `PlatformRegistry` for platform adapters plus an explicit built-in outbound-delivery platform list in the generic webhook adapter. This supports auditing which target systems are covered (and where gaps may exist) without relying on spaghetti per-connector wiring.

high
Add/confirm a single “connector coverage” report surface (CLI/UI endpoint or structured output) that enumerates all registered platforms (from `PlatformRegistry`) and highlights missing vertical table-stakes targets for the intended market (e.g., identity/CRM/data warehouse if applicable).
- gateway/platform_registry.py:1-220 — The registry already contains the metadata needed for an inventory/coverage report; the remaining gap would be a standardized consumer of that inventory for “breadth” auditing.
med
Ensure webhook delivery breadth is consolidated/automated: reduce drift between `_BUILTIN_DELIVER_PLATFORMS` and plugin registrations by deriving the list (where feasible) from `platform_registry` rather than maintaining a static set.
- gateway/platforms/webhook.py:1-120 — A static built-in delivery-platform set is maintained here; automation would improve breadth correctness/consistency over time.

Build-vs-buy posture 44%

This codebase appears to implement its own first-party integration abstractions for external-system connectors (gateway platform adapters and CLI proxy upstream adapters). There is evidence of deliberate “build” posture via shared adapter interfaces, rather than embedding an external iPaaS-style integration platform. I did not find evidence of embedded third-party integration platforms (e.g., n8n/Zapier/Workato/Nango) from the limited platform-name scan (only a Tailwind merge library showed up).

high
Confirm build-vs-buy at the connector level: enumerate distinct gateway platform adapters (e.g., slack/telegram/slack/etc.) and verify they all implement the same BasePlatformAdapter contract rather than drifting into bespoke per-platform code paths without a shared canonical model.
- gateway/platforms/base.py:1-25 — Base adapter is the shared interface; next step is to inspect multiple concrete adapters to ensure consistency and bounded divergence.
med
Check for any runtime reliance on a third-party embedded integration platform by searching the repo for common iPaaS/vendor libraries and SDKs beyond the initial narrow scan (e.g., workato/zapier/n8n/nango/tray/app integrations libraries). If none, document that connector coverage is owned and bounded by adapter contracts.
- gateway/platforms/webhook.py:1-70 — Webhook adapter architecture is a first-party integration boundary; verifying no external iPaaS SDK usage here would strengthen the posture claim.

Not applicable to this codebase: Bidirectional sync, Metadata-driven mappings.

Deployability

CI/CD as code, infrastructure as code, per-environment isolation, and a one-command local boot.

71% 11/11 scored

Reproducible one-command build 0%

0/3 expected sites not present
Automated CI pipeline 80%

4/5 expected sites
Automated deployment (CD) 267%

3/1 expected sites
Infrastructure as code 0%

0/1 expected sites not present
Environment isolation 0%

0/3 expected sites not present
Local/production parity 89%

3/3 expected sites
Config & secrets externalized per env 83%

2/2 expected sites
Decouple deploy from release 0%

0/2 expected sites not present
Reversibility / rollback 67%

2/3 expected sites
Delivery cadence (DORA proxy) 100%

2/2 expected sites
Deploy-tooling ownership 100%

2/2 expected sites

Reproducible one-command build 0%

No clear implementation of a deterministic “one-command build” primitive was found. While the repo provides a one-command curl|bash installer and a contributor setup script, the contributor/bootstrap dependency path includes a non-deterministic fallback (lockfile sync failure/missing lockfile triggers non-hash-verified resolution). The README’s developer instructions also show a multi-step manual ritual, so the specific acquisition gate (clean clone + one command + determinism via pinned dependencies) is not met in a way that scores as correctly applied.

high
Make the clean-clone build+boot path explicitly one command in the root docs (e.g., `./setup-hermes.sh --no-wizard --boot` or similar) and ensure it performs no non-deterministic dependency fallback; fail hard if `uv.lock` cannot be honored.
- README.md:164-178 — README documents one-command install and a separate multi-step contributor path; the primitive needs one command that covers deterministic build+boot from a clean clone.
- setup-hermes.sh:200-260 — The script explicitly falls back to non-hash-verified installs when lockfile syncing fails or the lockfile is absent, breaking determinism.
med
Add a lockfile integrity requirement to the bootstrap: detect `uv.lock` and use `uv sync --locked` only, exiting non-zero if it can’t be applied rather than falling back.
- setup-hermes.sh:200-260 — Fallback behavior explicitly allows unlocked/transitive re-resolution; for reproducibility, replace fallback with a hard failure.
low
Align the “one-command” story by documenting the exact command that a user should run after cloning (not only curl|bash from main), including what “booted” means (e.g., runs `hermes` or starts the gateway).
- README.md:164-178 — Contributor instructions describe multiple steps and a manual test script; document a single local command that results in a running instance.

Automated CI pipeline 80%

This codebase has a real automated CI pipeline. A dedicated `.github/workflows/tests.yml` workflow runs on every push to main and every PR targeting main, executing both unit tests and e2e pytest suites automatically. Additionally, `.github/workflows/lint.yml` includes a blocking `ruff check .` job intended to enforce code quality and gate merges.

med
Ensure branch protection / required status checks include the blocking test and lint jobs (e.g., require `Tests/test` and `Lint/ruff-blocking`) so merges are fully gated by CI outcomes.
- .github/workflows/lint.yml:95-144 — The blocking gate exists in CI code (`ruff check .`), but merge-gating correctness ultimately depends on repository branch protection requiring these checks.

Automated deployment (CD) 267%

Automated deployment (CD) exists via `.github/workflows/deploy-site.yml`: it deploys the website/docs to GitHub Pages and triggers a Vercel deploy on published releases. The CD implementation is solid for the site production path, but this audit only found CD wiring for the site/docs deployment (not a broader application-to-production deploy pipeline).

high
If the intent is “full app CD to production”, add (or audit for) a separate, versioned deploy workflow that rolls the running service forward (e.g., to Kubernetes/VM/PaaS) and wire it to the same release published event, including rollback/redeploy steps. Current CD evidence strongly targets the website/docs production path.
- .github/workflows/deploy-site.yml:1-106 — Observed CD pipeline automation for site/docs (GitHub Pages) and a Vercel deploy hook; does not by itself prove application deployment CD.
med
Ensure deploy workflow inputs/conditions match the desired release governance (e.g., require protected environments, enforce concurrency, and document the release-to-prod mapping). The workflow already uses `release: published` and `environment: github-pages`, which is good; extend the same rigor to any additional production targets.
- .github/workflows/deploy-site.yml:20-38 — Shows concurrency grouping and an explicit environment for GitHub Pages deployment.

Infrastructure as code 0%

No Infrastructure-as-Code definitions (IaC/PaaS descriptors) were found in the repository tree (no Terraform/CloudFormation/Pulumi/Helm/k8s/serverless/etc.). The production deployment path appears to be handled via GitHub Actions calling console-integrations (deploy-pages for GitHub Pages and a Vercel deploy hook via a secret) without accompanying versioned IaC that would make the infra reproducible end-to-end.

high
Add versioned IaC for the production deployment targets used here (at minimum: GitHub Pages + Vercel project/webhook configuration), so the same environments can be recreated from a clean checkout without relying on pre-existing console setup. Concretely: introduce Terraform/Pulumi (or equivalent) that declares the Pages site configuration, domains if applicable, and the Vercel linkage needed for releases.
- .github/workflows/deploy-site.yml:1-106 — Production deploys are executed from GitHub Actions using a Vercel deploy hook and deploy-pages actions, but there is no IaC in-repo to reproduce those targets.
med
Connect the IaC to CI/CD by adding a pipeline job that runs `plan` on PRs and `apply` on protected releases/tags, ensuring infra changes are reviewable and drift is detectable.
- .github/workflows/deploy-site.yml:1-106 — Current workflow deploys directly on release publish; infra change management/diffing is not represented as code.

Environment isolation 0%

No evidence of true environment isolation (dev/staging/prod with isolated data/credentials/accounts) was found in code. The codebase appears to support only a single active environment configuration via '~/.hermes/.env' and an optional project env path, plus Bitwarden secret injection that is not clearly stage-scoped.

high
Implement explicit stage selection and stage-scoped env loading in hermes_cli/env_loader.py (e.g. HERMES_ENV={dev|staging|prod}) and support separate env files and/or directories per stage ('.env.dev'/'config.dev.yaml', etc.). Ensure only the selected stage’s values are loaded.
- hermes_cli/env_loader.py:220-260 — Current logic loads '~/.hermes/.env' (and optionally a single project env file) rather than distinct dev/staging/prod configurations.
high
Scope external secrets by stage: require separate Bitwarden project IDs/access tokens (or separate secret namespaces) for dev vs staging vs prod, and wire the stage selection into the Bitwarden config lookup.
- hermes_cli/env_loader.py:260-330 — Secret application is driven by a single 'config.yaml' secrets section and applies credentials without any stage discriminator.
med
Replace/extend .env.example with stage-specific templates (or a documented mechanism to generate them) so production credentials are not accidentally reused in non-prod.
- .env.example:1-6 — The template instructs copying to a single '.env' and does not provide stage-specific examples.

Local/production parity 89%

This codebase implements local/production parity via a container-first approach: docker-compose runs the same image built from the repo’s Dockerfile, and the Dockerfile bakes a deterministic runtime (pinned bases + frozen dependency installs + built assets). Additionally, a Nix flake/devShell provides a reproducible local developer environment using lockfile-driven tooling (uv), further reducing drift.

high
Ensure the documentation explicitly recommends one primary parity path for contributors (preferably `docker compose up` using the repo’s Dockerfile) and provides a short ‘local matches prod’ checklist (ports, required env vars, and how to persist ~/.hermes).
- docker-compose.yml:1-77 — The compose file already documents intended usage/security and runs production-style commands; adding explicit onboarding guidance would strengthen adoption of the parity mechanism.
med
Align the Nix devShell Python dependency strategy even more directly with the Dockerfile (e.g., ensure the devShell uses the same lockfiles/uv flags that the Docker build uses, not just uv + hooks).
- nix/devShell.nix:1-42 — Dev shell uses uv and hooks, but parity strength would increase if it is confirmed/extended to mirror the Dockerfile’s exact uv sync behavior and extras.

Config & secrets externalized per env 83%

This codebase clearly externalizes configuration and secrets per environment through a dedicated config layer: it loads secrets/config values from os.environ and/or ~/.hermes/.env (get_env_value / reload_env / save_env_value) and provides an .env.example template for operators. I did not find evidence of env-specific production endpoints/keys being hardcoded in the audited config/bootstrap layer.

high
Audit provider/runtime modules for hardcoded environment-specific endpoints/base URLs or credentials. Specifically, verify that any provider client construction (e.g., OpenAI/Anthropic/OpenRouter/Gemini/etc.) pulls API keys/base URLs from get_env_value()/os.environ or config.yaml rather than embedding production URLs.
- hermes_cli/config.py:5500-5595 — The presence of get_env_value() suggests intended usage; remaining risk is whether other modules bypass it and hardcode endpoints/keys.
med
Confirm there is no separate “fallback” path that hardcodes production defaults for secrets/endpoints when env variables are missing (e.g., base_url defaults should be provider-safe, while secret-like values must never be literals).
- .env.example:1-200 — .env.example documents base URLs/overrides as operator configuration; ensure production-like values are not duplicated as literals elsewhere.

Decouple deploy from release 0%

No implemention of a decouple-deploy-from-release primitive (feature-flag/rollout gating that separates deployment from activation) was found in the audited on-graph code. The only flag-like logic observed is runtime UI mode switching (embedded dashboard chat), not a progressive release/activation mechanism.

high
Introduce a production-grade rollout/feature-flag system (server-controlled) and wire it into branching logic for production-visible behavior changes (routes, new UI modules, and agent/tool execution paths). Ensure flags support percentage/canary rollout and are not permanent; keep them governed/configured in code.
- web/src/App.tsx:1-120 — Top-level production UI composition without evidence of rollout-controlled activation gating in the inspected slice.
- web/src/lib/dashboard-flags.ts:1-16 — Client-injected embedded-chat mode toggle; not a release/rollout governance mechanism.
med
Replace or complement ad-hoc runtime toggles (like `window.__HERMES_DASHBOARD_*`) with a shared rollout configuration layer that can be switched server-side (and supports canary/percentage), so deployments don’t automatically activate new functionality for all users.
- web/src/lib/dashboard-flags.ts:1-16 — Shows the current toggle approach is client-injected and not tied to progressive rollout.

Reversibility / rollback 67%

Reversibility/rollback is implemented for curator-driven changes via an explicit snapshot/restore system in `agent/curator_backup.py`, exposed through the `hermes curator rollback` CLI in `hermes_cli/curator.py`. Rollback is designed to be safe and undoable by taking a pre-rollback snapshot, performing defensive extraction, and reconciling cron job skill references while preserving unrelated live cron scheduling state. Codex runtime plugin migration writes managed sections but does not show a corresponding rollback/undo mechanism in the audited code slices.

high
Add a rollback/undo path for `hermes codex-runtime migrate` that preserves user config safety (e.g., snapshot `~/.codex/config.toml` managed section before write, and provide `hermes codex-runtime rollback` to restore the previous managed block).
- hermes_cli/codex_runtime_plugin_migration.py:520-700 — The migration code handles idempotent regeneration and TOML managed-block manipulation, but the audited slice does not provide any persistent undo/rollback mechanism for the config write.
med
For Codex migration rollback readiness, add automated tests that validate: (1) rollback restores a previously existing user-managed section, (2) rollback does not delete unrelated user TOML content outside the managed block, and (3) rollback works after partial/corrupted writes (defensive behavior).
- hermes_cli/codex_runtime_plugin_migration.py:520-700 — Managed-block rendering/insertion is sophisticated; adding rollback tests would align this feature with the strong rollback guarantees already present for curator snapshots.

Delivery cadence (DORA proxy) 100%

Delivery cadence appears present: git history shows frequent main commits/merges and regular release tagging. On-graph, the repo also automates delivery artifacts and site updates via GitHub Actions (docker image builds/publishes on main pushes for relevant paths and on releases; site deploys on release publish and on main pushes affecting website/skills). No single click-op deploy wiring was found in the audited workflows.

high
Ensure these delivery workflows are also triggered for broader main changes where appropriate (review whether the path filters are too restrictive), so that cadence remains small-batch across more PR types.
- .github/workflows/docker-publish.yml:1-35 — The docker publish pipeline is gated by `paths` filters under the main-branch push trigger; validate this coverage matches the team’s definition of “production-delivered” changes.
- .github/workflows/deploy-site.yml:1-36 — The site deploy trigger is gated to `website/**` and `skills/**` paths; confirm that other production-affecting content changes also flow into this or another CD workflow.

Deploy-tooling ownership 100%

Deploy/infra tooling does exist in versioned CI workflows (not click-ops): production-adjacent delivery is implemented in .github/workflows/deploy-site.yml and the release gate CI is in .github/workflows/tests.yml. Off-graph git authorship evidence indicates the deploy/infra tooling is not dominated by a single author (42 authors; top author ~0.178), so the single-engineer CI/CD time-bomb risk is mitigated.

low
Maintain shared ownership by continuing to require reviews on workflow changes (branch protection) and encouraging multiple contributors to touch CI/deploy workflows, especially deploy-site.yml.
- .github/workflows/deploy-site.yml:1-106 — Primary delivery workflow surface; ensuring review/ownership distribution here preserves the current good ownership signal.

T3 Exit Cleanliness

Engineering Org Resilience

No single-author critical paths: git-blame concentration, CODEOWNERS coverage, and reviewer diversity across the codebase.

64% 7/10 scored

Critical-path bus factor 100%

3/3 expected sites
Ownership clarity 0%

0/1 expected sites not present
Documentation density ("why") 89%

3/3 expected sites
Operational runbooks 0%

0/4 expected sites not present
Onboarding reproducibility 100%

4/4 expected sites
Tests as executable knowledge 92%

4/4 expected sites
Decision history legibility 67%

4/4 expected sites

Critical-path bus factor 100%

For the critical-path areas (apps/agent/gateway/plugins/skills/tools/hermes_cli), git-history bus-factor signals show knowledge is distributed: each critical directory has many distinct authors and the top author does not dominate (i.e., no bus-factor-1 gravity well). I verified key critical modules (Kanban DB coordination, Kanban CLI surface, and gateway runtime status helpers) as concrete representative sites, and they align with the distributed bus-factor picture.

med
Add/strengthen co-ownership durability artifacts for the most operationally critical modules (at minimum: an ownership manifest and a short gateway-status/Kanban-DB runbook covering failure modes and recovery steps). This reduces reliance on implicit knowledge even if history remains healthy.
- gateway/status.py:1-220 — Gateway operational gating helpers live here (PID detection, lock/runtime status). A runbook/ownership record would make the failure-mode recovery process explicit.
- hermes_cli/kanban_db.py:1-80 — Kanban DB implements concurrency strategy (WAL/BEGIN IMMEDIATE/CAS) and shared coordination semantics. A short recovery-oriented guide would complement distributed code authorship.

Single-author hotspots N/A

The repository does not exhibit the single-author hotspot anti-pattern: the 12-month high-churn hotspots analysis returned no danger-zone files (no file with both high commit frequency and only one/two lifetime authors). Therefore, there are no concrete hotspot sites to record as found (and no should-be sites because nothing currently matches the primitive’s threat condition).

low
No immediate action required for this primitive. Continue monitoring hotspots periodically (e.g., quarterly) to catch future ownership drift in high-churn areas.
- : — hotspots mode returned `danger_files: []`.

Review diversity N/A

This repo shows evidence of a PR-based integration process (non-zero PR-referenced share) and multiple human integrators (distinct_mergers_human=15), so review context is present and not strictly centralized. However, pr_referenced_share is relatively low (0.295), suggesting a substantial portion of work still lands outside the PR review path; additionally, reviewed-by trailer signals are very low (1), and there is significant bot involvement in merges (top_merger GitHub bot). Overall: review diversity exists, but is not consistently strong.

high
Increase PR-referenced landing for changes that affect core behavior: enforce branch protection / required PRs for mainline (or for specific paths) so pr_referenced_share rises toward (ideally) majority-of-merges. This is the most direct lever for spreading review context.
- GIT_HISTORY:N/A — pr_referenced_share=0.295 indicates PRs are not the dominant merge path.
high
Reduce single-path “gravity well” risk by ensuring merges are not effectively mediated by bots/single integrators for critical areas: require a human maintainer review before merge and ensure merges are performed by a rotating set of maintainers for core modules.
- GIT_HISTORY:N/A — Top mergers show GitHub bot accounts for 604 merges; effective human review-diversity can be reduced even when distinct_mergers_human is high.
med
Standardize PR collaboration signals: encourage use of reviewed-by/co-authored-by trailers (or equivalent team conventions) so the collaboration/review process is reflected in commit/merge metadata and is easier to audit over time.
- GIT_HISTORY:N/A — reviewed_by_trailers=1 is very low, suggesting review attribution/tracking may not be consistently captured.

Ownership clarity 0%

Ownership clarity (an explicit ownership manifest covering critical paths) is absent. The repository’s org-knowledge artifacts include docs/onboarding but no ownership/CODEOWNERS/OWNERS manifest category, and the on-repo development guide does not define a “who owns what” mapping for the critical subsystems it enumerates.

high
Add an ownership manifest for the enumerated critical subsystems in AGENTS.md (at minimum: agent/, hermes_cli/, tools/, gateway/, plugins/, ui-tui/, cron/, and tests/). Include 2+ human owners per area (not bots), and keep it current.
- AGENTS.md:1-120 — AGENTS.md enumerates the critical subsystem boundaries that should have explicit owners. This is the natural anchor point for creating an ownership manifest aligned to real change areas.
med
Cross-check the named owners against git history for each critical subsystem and adjust owner lists until no single person is a near-single-author gravity well for that area.
- AGENTS.md:1-120 — The critical areas listed here should be used as the scope for the history cross-check; the guide defines the canonical subsystem set.
low
Add a short “How to use owners” section to the onboarding/docs so contributors know where to look and how to request ownership changes (without blaming individuals).
- AGENTS.md:1-120 — Onboarding guidance is already present; adding a small owners-lookup instruction would make the manifest operational rather than ceremonial.

Retained vs. departed knowledge N/A

This primitive is not implemented/represented as an explicit retention/transfer mechanism anywhere in the codebase. While git-history signals indicate very low overall “departed authorship share” (0.002) and only one clearly departed email in recency mode, there is no in-repo ownership/operational documentation structure to confirm that critical-path context is retained by current staff (e.g., no ownership manifest / no runbooks / no ADRs).

high
Add an ownership manifest (e.g., CODEOWNERS or an OWNERS file) for critical areas (agent runner, gateway, CLI, core adapters). Ensure owners are the same people who actively maintain these modules (not just historical authors).
- CONTRIBUTING.md:1-40 — Current docs focus on contribution priorities and setup, not on durable ownership/retention for critical components—so it would need to be complemented by an explicit ownership manifest.
high
Create operational runbooks for each critical runtime component (gateway runner, CLI entry points, cron/schedulers, and provider integrations). These should include restart/diagnosis steps and “what not to change” guidance to prevent reliance on an individual’s memory.
- CONTRIBUTING.md:1-120 — No operational runbook structure is described in contribution/onboarding materials, which is where knowledge-retention expectations typically get anchored.
med
Add decision records (ADRs) for major architectural choices that affect critical workflows (tool calling loop, gateway lifecycle, memory provider architecture boundaries).
- CONTRIBUTING.md:1-120 — Contribution guidance does not reference ADRs/decision history as a required artifact for architecture-critical changes.

Documentation density ("why") 89%

This codebase has durable architecture “why” documentation in the docs and website docs: the multi-gateway kanban deployment guide explains the operational/concurrency rationale; the Docker network egress guide explains the threat model and architectural boundary; and the ACP internals guide documents lifecycle and key bridging/intent-heavy decisions. ADR/runbook/ownership artifacts are absent per org artifacts scan, but this audit is limited to the presence/quality of “why” documentation where it is required.

high
Add missing decision/operational durability artifacts: introduce an ADR process (e.g., docs/adr) for architecture decisions with explicit rationale, and add runbooks for critical services so the operational “why” doesn’t live only in heads.
- N/A (git_org_signals artifacts scan):N/A — git_org_signals artifacts mode reports absent categories: adr (0) and runbook (0) and ownership (0). While not directly verifiable via code_read line citations, this indicates missing durable rationale artifacts for key maintenance workflows.
med
For each critical integration surface (e.g., adapter/event bridge layers like ACP), ensure the docs consistently include: (1) threat/intent boundary, (2) why design constraints exist, and (3) what invariants must not be broken. Use ACP Internals as the model and replicate its structure.
- website/docs/developer-guide/acp-internals.md:1-85 — ACP Internals already contains architecture/rationale elements (component responsibilities, bridge concerns, lifecycle). This is a template to standardize on for other adapters/surfaces.

Operational runbooks 0%

Operational runbooks are not present anywhere in the repository’s tracked org-doc artifacts: the `runbook` category is absent. While the codebase has multiple operationally critical entrypoints (gateway runner, cron scheduler, agent runner, and operator diagnostics), there are no corresponding written procedures for deploy/incident/recovery that would mitigate knowledge concentration and reduce reliance on a single “just knows” operator.

high
Create runbooks for each critical operational service/entrypoint: (1) gateway/run.py, (2) cron/scheduler.py, (3) run_agent.py. Each runbook should include: deploy checklist, common incident playbooks (symptom → checks → mitigation), and recovery steps (restart/rollback guidance, what state is affected, verification steps).
- gateway/run.py:1-25 — Gateway runner is a long-running operational daemon; requires deploy/incident/recovery runbook coverage.
- cron/scheduler.py:1-15 — Cron scheduler is a continuously running job executor with locking; requires operational playbooks.
- run_agent.py:1-25 — Agent runner is the core execution engine; needs operational procedures for failure recovery.
med
Add an operator runbook for hermes_cli/doctor.py that documents when to run it during incidents/setup failures, how to interpret the diagnostic output, and the exact follow-up actions (including configuration/environment checks).
- hermes_cli/doctor.py:1-15 — Doctor command is an operational diagnostic entrypoint; should be documented as part of recovery workflows.
high
Add an ownership manifest (CODEOWNERS/ownership doc) for these operational entrypoints and ensure the runbooks list at least 2 co-owners per critical service to reduce the gravity-well risk.
- N/A (org-doc artifacts index):N/A — Artifacts scan also shows `ownership` and `adr` categories are absent; adding ownership complements runbooks to mitigate knowledge concentration.

Onboarding reproducibility 100%

Onboarding reproducibility is implemented well: the repo has multiple written paths (public Quickstart/Installation docs plus CONTRIBUTING) that describe clean-clone-to-productive steps, including a one-command install bootstrap. These docs are reinforced by the actual one-command installer script. No evidence was found of onboarding being purely tribal for the core “get running and verify chat” workflow.

high
Audit whether the docs’ “one-line installer” and the subsequent verification steps are fully deterministic for all supported targets (Linux/macOS/WSL/Termux/Windows) by running the exact documented commands on a clean machine profile; capture any missing flags/required env vars into the docs as a numbered checklist.
- website/docs/getting-started/installation.md:1-120 — This doc is the canonical onboarding bootstrap contract; it should be validated end-to-end from a clean environment.
- website/docs/getting-started/quickstart.md:1-85 — This doc specifies the shortest path and verification expectations; ensure each step is complete and executable as written.
med
Add a single “Developer fast path” section to the onboarding docs (separate from CONTRIBUTING) that references `scripts/run_tests.sh` and `hermes doctor` as the two primary verification gates after setup, to reduce fragmentation across doc locations.
- CONTRIBUTING.md:1-120 — CONTRIBUTING already contains the needed commands (`hermes doctor`, `hermes chat`, and `scripts/run_tests.sh`), but consolidating into a primary onboarding entry could improve discoverability and reduce time-to-productivity.

Tests as executable knowledge 92%

The primitive is clearly present: the repository contains extensive, behavior-focused test suites (especially under tests/run_agent, tests/gateway, and tests/ for agent classifier logic). Tests document intended behavior with concrete assertions, mock external dependencies, and cover critical correctness properties like recovery classification mechanics, API/store semantics, and TUI gateway context/serialization/privacy behavior.

high
Add/expand executable tests that directly pin AIAgent’s end-to-end conversation loop invariants (e.g., session transitions and retry/rotation outcomes) beyond helper/heuristic checks, so refactors of the orchestration surface are protected by intent captured in tests.
- run_agent.py:250-520 — AIAgent is a central orchestration surface; ensure tests cover the core loop/session transition behaviors, not only helper heuristics.
med
Strengthen classifier tests to cover each disambiguation/branching path (especially 402 billing vs rate-limit and other ambiguous pattern families) as executable examples, not just extraction helpers.
- agent/error_classifier.py:1-260 — The classifier drives recovery strategy; it includes ambiguous-pattern logic that should be pinned with explicit tests.
- tests/agent/test_error_classifier.py:1-220 — Current slice covers extraction/invariants; expand toward the actual classification pipeline branches.
low
For each gateway platform test that validates normalization/connectivity (e.g., Feishu), add one or two assertions that link normalized output to downstream adapter expectations (how Hermes consumes the normalized payload), to make integration tests more behavior-end-to-end.
- tests/gateway/test_feishu.py:1-220 — Current Feishu tests validate normalization/config and some adapter setup behavior; add downstream-consumption assertions for stronger executable knowledge.

Decision history legibility 67%

Overall commit history appears decision-legible (high explanatory body share; no evidence of a WIP/fix no-body wall). However, there are no ADRs or ownership/runbooks/ownership manifests in the repo’s tracked org-document artifacts, so durable decision records are missing. The code compensates with strong inline rationale (docstrings/comments) around key ACP integration decisions, so the “why” is largely recoverable from source, but not via separate decision artifacts.

high
Add ADRs for the main cross-protocol/integration decisions (at minimum: Hermes todo→ACP plan mapping, tool-progress callback concurrency/id tracking, ACP approval semantics mapping, and logging/noise suppression policy). Ensure ADRs reference the exact functions/modules that implement the decisions.
- acp_adapter/events.py:32-62 — Architectural choice: mapping todo state into ACP plan updates with explicit rationale; ideal ADR anchor.
med
Create a lightweight “decision record” checklist for PRs touching integration surfaces (ACP adapter, permissions, tool progress callbacks) requiring either (a) a referenced ADR ID or (b) a commit body paragraph summarizing the decision and tradeoffs.
- acp_adapter/events.py:64-157 — Tool-progress callback behavior is non-obvious (concurrency/id tracking); easy to regress without a decision summary.
low
For ops/logging decisions (like benign probe traceback suppression), add a short ADR or “operational decision note” explaining acceptance criteria (what to suppress vs not) to reduce future guesswork.
- acp_adapter/entry.py:14-55 — Benign-probe filter rationale is present inline; externalizing it improves durability across refactors/rewrites.

Not applicable to this codebase: Single-author hotspots, Review diversity, Retained vs. departed knowledge.

IP & OSS License Hygiene

An SBOM in CI, no AGPL/GPLv3 in the dependency tree, CVEs triaged by severity, and no outside-contributor commits without IP assignment.

33% 9/12 scored

License compliance 0%

0/3 expected sites not present
Known-vulnerability scan 0%

0/5 expected sites not present
Known-exploited CVEs 0%

0/4 expected sites
Dependency freshness 67%

3/4 expected sites
Upstream maintenance 0%

0/1 expected sites
Remediation velocity 100%

1/1 expected sites
Supply-chain integrity 67%

2/3 expected sites
Dependency-confusion resistance 67%

3/4 expected sites
IP ownership / provenance 0%

0/1 expected sites not present

Software bill of materials N/A

SBOM hygiene is not implemented as a concrete, verifiable primitive in this codebase (no SBOM-generation tooling or CI/release wiring referencing syft/cyclonedx/SPDX-style outputs was found). The repository does have committed dependency lockfiles (package-lock.json, uv.lock) and dependabot configuration for GitHub Actions, but there is no evidence of an automated SBOM generation step being produced and kept current as part of release/CI.

high
Add an SBOM generation job to the main CI/release workflow(s) that runs after dependencies are installed and before packaging/release. Use a concrete tool (e.g., Syft for Linux/CLI, or CycloneDX/SPDX generators) and emit an artifact (e.g., sbom.spdx.json or sbom.cdx.json). Ensure it includes transitive dependencies and is consistent across the repo’s ecosystems (npm + uv/PyPI).
- package-lock.json:1-20 — Pinned npm dependencies exist, so SBOM generation in CI is feasible and should be wired to the lockfile.
- uv.lock:1-20 — Pinned Python dependencies exist, so SBOM generation in CI is feasible and should be wired to the lockfile.
high
Publish the SBOM as a build/release artifact and (optionally) attach it to releases. Also add a CI check that fails if the SBOM generation step cannot run (or produces an empty output).
- package-lock.json:1-20 — Lockfile presence indicates the expected source of truth for SBOM content; the missing piece is the release/CI emission + verification.
med
Add a periodic SBOM comparison/accuracy check: generate SBOM in CI and ensure it matches (or is a strict subset/superset of) the dependency inventory derived from the lockfiles (ground truth).
- uv.lock:1-20 — Because uv.lock is committed, the repo can compute an authoritative resolved dependency inventory; SBOM accuracy checks should be tied to it.

License compliance 0%

I did not find an explicit, code-enforced “license compliance” primitive for transitive dependency licensing/NOTICE obligations (e.g., SBOM + license scanning + gating on strong/network-copyleft or unknown-tier licenses; and collection/packaging of dependency NOTICE/licenses). The repo does have a dependency-update mechanism (Dependabot for GitHub Actions only) and pinned dependency lockfiles (uv.lock), but those do not constitute license compliance enforcement by themselves.

high
Add CI automation that generates an SBOM for *all* included lockfiles (at least uv.lock + package-lock.json variants) and runs a transitive license scan; fail the build (or require manual legal approval) if any strong-copyleft or network-copyleft (AGPL/SSPL) dependency is present, or if licenses are unresolved/unknown.
- .github/dependabot.yml:1-45 — Current automation is scoped to GitHub Actions only and explicitly not for source dependencies; no license-compliance gating is evidenced.
high
Ensure NOTICE/LICENSE attribution obligations are satisfied: collect dependency license texts/NOTICE files (from scan output or package metadata) into a standardized third-party notices location and verify it is kept current with lockfile updates.
- package.json:1-35 — Project license declaration exists, but there is no evidence of automated attribution/NOTICE handling for transitive dependencies.
med
Document the re-pricing rule for network-copyleft and add an explicit exception process (legal sign-off) if any copyleft licenses are introduced—do not rely on usage/reachability to mitigate license risk.
- .github/dependabot.yml:1-45 — Repo already has a security/pinning posture description; extend it to cover legal licensing obligations and the failure/exception workflow.

Known-vulnerability scan 0%

A known-vulnerability scan over lockfiles (OSV/CVE findings with HIGH/CRITICAL triage) is not found as a primitive in this codebase. While there is an OSV-based malware check (MAL-* advisories) for MCP extension packages, it is not the required lockfile vulnerability scan practice (and it is fail-open on network errors). The repo does contain multiple committed lockfiles (npm and uv), which are the natural sites where the known-vulnerability scan should be wired into CI, but no such implementation was located.

high
Add a CI job that runs osv-scanner (or equivalent) in *vulns* mode over every committed lockfile in the repo (root package-lock.json, uv.lock, and each subproject package-lock.json). Configure it to fail the build if there are any untriaged HIGH/CRITICAL findings, and require per-finding remediation or explicit documented exceptions.
- package-lock.json:1-120 — Root npm lockfile exists; should be scanned in CI for pinned dependency vulnerabilities.
- uv.lock:1-120 — Python uv lockfile exists; should be scanned in CI for pinned dependency vulnerabilities.
- scripts/whatsapp-bridge/package-lock.json:1-80 — Subproject npm lockfile exists; should be included in CI scanning.
- website/package-lock.json:1-80 — Website npm lockfile exists; should be included in CI scanning.
med
Ensure results are triaged and prioritized using reachability/context (where possible): for each HIGH/CRITICAL finding, record whether the vulnerable code path is actually used (or at least whether the vulnerable package is required in the runtime bundle).
- tools/osv_check.py:1-156 — Current OSV usage is for MAL-* malware blocking only; it does not implement the required HIGH/CRITICAL triage over dependency vulnerabilities.
med
Avoid fail-open behavior for the known-vulnerability scan primitive: unlike the current MAL-* check (network errors allow proceeding), the dependency vulnerability scan should have a deterministic outcome (e.g., retries with caching, or fail closed with degraded-mode reporting).
- tools/osv_check.py:1-60 — Explicitly documented fail-open behavior on OSV network errors; this pattern should not be reused for required lockfile vulnerability scans.

Known-exploited CVEs 0%

The repo includes OSV-based security tooling (tools/osv_check.py blocks MAL-* advisories; hermes_cli/security_audit.py queries OSV for vulnerabilities), but no evidence was found that it specifically checks for the OSV 'known-exploited CVEs' set. Additionally, .github/workflows entries were not found via code search, so an automated gating primitive for known-exploited CVEs could be missing or located outside the searched workflow paths.

high
Add a dedicated 'known-exploited CVEs' scan step that fails CI/release if any pinned dependency matches the OSV known-exploited set (not just MAL-*). Implement it in a script (or extend hermes_cli/security_audit.py) to explicitly detect those IDs/aliases from OSV results.
- tools/osv_check.py:1-156 — Currently queries OSV and filters only MAL-*; this is the wrong target set for this primitive.
- hermes_cli/security_audit.py:1-520 — Currently returns OSV findings but lacks explicit known-exploited identification/filtering in the inspected code.
high
Ensure the repository actually runs the primitive in automation. Confirm/introduce a CI workflow (or equivalent) under .github/workflows or another CI entrypoint that invokes the known-exploited check on every PR and blocks merges.
- /tmp/claude-501/virgil-cs-5TWDtY/repo:N/A — virgil_query search for .github/workflows returned no matches; this suggests either missing workflows or paths not captured by the code-graph search.
med
If OSV network failures are possible, avoid fail-open for this primitive. For known-exploited CVEs, treat scan errors as 'unknown' and fail-closed (or require a verified offline SBOM/CVE DB snapshot) to prevent silent exposure.
- tools/osv_check.py:1-52 — Documentation and exception handling explicitly 'Fail-open: network errors allow the package to proceed.'
low
Unify the two OSV utilities: make tools/osv_check.py a thin wrapper over the richer hermes_cli/security_audit.py (or vice versa) so the repo has one consistent definition of security gating categories (including known-exploited CVEs).
- tools/osv_check.py:1-156 — Separate OSV querying path dedicated to MAL-*.
- hermes_cli/security_audit.py:1-520 — Separate OSV audit path for OSV batch querying and rendering.

Dependency usage & reachability N/A

N/A for this audit run: the required virgil_query primitive surface (template/row/table) for `dependency_usage_reachability` is not present in the tool-backed code graph in this environment, so I cannot enumerate declared-but-never-imported vs phantom deps or confirm vulnerable API reachability via the prescribed reachability queries.

high
Re-run with a virgil-cli configuration/version that supports the `dependency_usage_reachability` template/row (or confirm the correct template/table name for this deployment). As a fallback, provide the exact virgil_query SQL schema/expected tables that implement: (1) declared-but-never-imported, (2) imported-but-undeclared, (3) call_site reachability by receiver.
- N/A:N/A — Tooling errors observed: `unknown template 'dependency_usage_reachability'` and `Table with name dependency_usage_reachability does not exist`. These indicate the primitive’s on-graph machinery is missing/unavailable.
med
Once the on-graph primitive is available, I will: (a) cross-check npm/Python manifests vs `raw_import` for unused/phantom dependency issues, and (b) join `raw_import` to `call_site` to determine whether CVE-flagged vulnerable APIs are actually reached, then anchor every site to the exact manifest/lockfile lines and code call sites.
- package.json:1-35 — This repo uses npm workspaces and declares dependencies (e.g., `@streamdown/math`, `agent-browser`). This is the manifest input that would be compared against on-graph imports/call sites once the reachability queries are available.

Dependency freshness 67%

Dependency freshness controls appear implemented via committed lockfiles for both Node (package-lock.json) and Python (uv.lock), providing deterministic pinning of resolved transitive dependencies. Additionally, Dependabot is set up to keep GitHub Actions fresh (weekly) and to rely on security-update triggers for action SHA patching. Overall, this is a strong baseline for avoiding dangerously stale dependencies, though Dependabot explicitly does not auto-update source dependencies across npm/PyPI (by design).

high
Extend automated freshness checks beyond “pinning exists”: add/verify a CI step that flags dependencies that are many versions behind their upstreams (or simply runs `npm outdated` / `uv pip list --outdated` and fails/warns). Evidence of pinning is present, but the repo should also prove it actively detects staleness.
- package-lock.json:1-35 — Lockfile pinning is present, but freshness detection/alerts are not evidenced by this lockfile alone.
med
Confirm that npm/uv dependency update PRs for source dependencies are actually flowing through the configured process (not just the dependabot.yml scope). Verify via CI logs or a sample merged PR that updates package-lock.json and uv.lock on a cadence.
- .github/dependabot.yml:1-45 — This file shows Dependabot is scoped to GitHub Actions only; source dependency freshness likely depends on another workflow/mechanism.
low
For the python constraints file (requirements.txt), ensure it is either generated from/consistent with the locked uv set (so it doesn’t become an “outdated constraint” source of confusion).
- optional-skills/finance/dcf-model/requirements.txt:1-8 — requirements.txt specifies minimum versions for openpyxl/requests; ensure the locked versions used in builds correspond to these minima and are regularly refreshed.

Upstream maintenance 0%

Upstream maintenance is partially implemented via Dependabot for GitHub Actions (scheduled weekly). However, the repo explicitly excludes Dependabot automation for source dependency ecosystems (pip/npm), so upstream maintenance for the actual application/runtime dependencies is not clearly covered by an automated upstream-patching mechanism in the audited config. No deprecation/abandoned-upstream signals were provided by the inventory output, so the main hygiene gap here is the lack of an actively maintained upstream update workflow for source dependencies.

high
Enable upstream-maintenance automation for source dependencies (at minimum, Dependabot security updates) for the ecosystems the repo actually uses (PyPI and npm), or add a documented CI mechanism that reliably bumps and locks patched versions when upstream releases security/bugfix updates.
- .github/dependabot.yml:1-18 — Config explicitly states Dependabot is NOT enabled for pip/npm/source dependencies, which leaves upstream maintenance for those dependencies without an automated upstream-update mechanism.
med
Add/confirm CI evidence that pinned source-dependency updates occur when upstream publishes patches (e.g., Dependabot security update runs for pip/npm, or a scheduled/manual update workflow that includes security-triggered PRs).
- .github/dependabot.yml:19-45 — Comments describe that source-dependency security updates are intended to be enabled separately, but the checked-in config does not show pip/npm ecosystems. Verification should ensure the intended setting/workflow actually exists and runs.

Remediation velocity 100%

Remediation velocity is present. The repo has an active Dependabot configuration for GitHub Actions with a weekly update cadence. Off-graph provenance indicates that dependency-update PRs are not just configured but have merged recently (39 merges in the last 90 days; 47 in the last 365 days), supporting that the mechanism is working rather than stalled.

high
Verify that Dependabot security updates for pinned source dependencies are actually enabled and resulting in merged CVE-only PRs (not just action-bumps). If they are disabled or slow, adjust the Dependabot security update settings or add targeted automation for the pinned ecosystems (uv.lock / package-lock.json).
- .github/dependabot.yml:1-45 — The config explicitly scopes scheduled updates to github-actions and states source dependency CVE updates are enabled separately via repository security settings. Confirm that those security-update PRs are flowing/being merged (not only the scheduled actions bumps).
med
Add an explicit metric/check (e.g., CI badge or automated report) for “dependency-update PRs merged in last 90 days” to prevent the velocity mechanism from silently degrading over time.
- .github/dependabot.yml:1-45 — A mechanism exists, but there is no in-repo enforcement visible here that ensures continued merge velocity; implementing a monitoring check prevents regression.

Supply-chain integrity 67%

Supply-chain integrity is implemented via committed dependency lockfiles with integrity hashes: `package-lock.json` for npm packages (sha512 integrity fields) and `uv.lock` for Python packages (sha256 hashes for sdists/wheels). A plain `requirements.txt` exists for an optional Python submodule, but the integrity mechanism is correctly provided by the presence of `uv.lock` rather than relying on the un-hashed requirements file.

high
Ensure CI/CD uses the lockfiles in a non-floating way (e.g., `npm ci` / `npm ci --ignore-scripts` as appropriate; `uv sync --locked`) so integrity verification is actually enforced during builds, not only during local installs.
- package-lock.json:1200-1245 — The lockfile has integrity hashes, but build-time enforcement depends on CI using the lockfile install commands.
- uv.lock:1-80 — The lockfile has integrity hashes, but build-time enforcement depends on CI using the locked sync/install workflow.
med
Add/verify CI steps that fail the build if lockfiles are out-of-date (e.g., ensure no dependency installs are allowed without `--locked`/lockfile mode). This prevents accidental drift that would weaken integrity guarantees.
- optional-skills/finance/dcf-model/requirements.txt:1-8 — This manifest lacks hashes; CI should avoid installing it without a corresponding locked/hashes-based mechanism.

Dependency-confusion resistance 67%

Dependency-confusion resistance appears implemented via committed lockfiles for both npm (package-lock.json) and Python (uv.lock). The root package manifest does not show obvious unscoped/private or typo-similar package names in the lines reviewed, and the lockfiles should prevent registry-name swapping from taking effect during installation.

high
Verify there are no unscoped private packages declared anywhere else in the repo (e.g., other package.json files under workspaces, app-specific manifests) and confirm any @org private scopes are the intended namespaces.
- package.json:1-35 — This is the primary surface for dependency names; only the root manifest slice was reviewed.
med
For npm, ensure the install process always uses the committed lockfile in CI/release (e.g., npm ci) and does not fall back to package.json-only resolution.
- package-lock.json:1-80 — Lockfile pinning exists, but the enforcement depends on the CI/install command.
low
Align any loose Python requirements (requirements.txt) with the pinned uv.lock output; avoid relying on ranges for sensitive/private packages.
- optional-skills/finance/dcf-model/requirements.txt:1-8 — requirements.txt uses version ranges (e.g., openpyxl>=3.0.0, requests>=2.28.0), so correctness depends on using uv.lock for installations.

IP ownership / provenance 0%

For the primitive “IP ownership / provenance”, I did not find evidence of a durable IP-assignment mechanism (e.g., CLA/contributor agreement documentation and coverage workflow) that would let an acquirer verify that every meaningful contributor’s IP is assigned to the company. The repo contains a contributor audit script that tracks attributions, but it is not itself an IP-ownership/legal assignment artifact.

high
Add or point to the durable legal mechanism that assigns contributor IP to the company (CLA/contributor agreement), and document it in-repo (e.g., LICENSE/CLA/CONTRIBUTING reference) including how signatures are obtained/recorded for all contributors.
- scripts/contributor_audit.py:1-120 — Contributor audit tooling exists; however, no legal assignment mechanism is evidenced here. This should be connected to the CLA/IP assignment artifact used to cover contributors.
high
Create a searchable provenance record that maps contributors (git author emails/handles) to IP assignment status (e.g., CLA signature IDs/timestamps, or an escrowed/archived record), and ensure the audit script (or a companion CI check) validates coverage against that record.
- scripts/contributor_audit.py:200-360 — The script already enumerates contributors across git history and PR attribution signals; it should be extended to validate against an IP-assignment/CLA coverage dataset rather than only checking release-note mentions.
med
Include/maintain a current “IP roster” (employee/contractor emails) and explicitly document the policy for external contributors (require CLA/assignment before merging). This allows the unassigned-IP cloud to be resolved deterministically during diligence.
- scripts/contributor_audit.py:1-120 — The script performs contributor resolution/exclusion, which is the right starting point, but the roster/legal coverage policy that would resolve unassigned IP is not evidenced.

AI-coding-tool provenance N/A

No AI-coding-tool provenance tracking convention (generated-code markers/headers, commit trailer patterns, or an AI-usage/provenance policy doc) was identified in the code artifacts inspected. Therefore this primitive is treated as absent for this codebase in its current form.

high
Add an explicit AI-provenance policy/document (e.g., in CONTRIBUTING/SECURITY/OSS-HYGIENE docs) defining: what counts as AI-generated code, required attribution/traceability format, and how license/IP review is triggered for AI-generated snippets.
- agent/agent_runtime_helpers.py:1-30 — Example of a core agent runtime module that lacks any AI-provenance metadata convention.
high
Introduce a repo-wide provenance marker convention for AI-generated code (e.g., standardized header comment like `# AI-GENERATED: <tool> <date> <prompt/ref> <license-check-status>` and/or filenames placed under an `ai-generated/` directory).
- optional-skills/mlops/training/trl-fine-tuning/templates/basic_grpo_training.py:1-30 — Template/header area that could carry an AI-provenance marker, but currently does not.
med
Add CI checks to enforce provenance requirements (e.g., scan for missing/incorrect AI-provenance headers in files labeled as AI-generated; require a link/reference to a review or approval record).
- agent/agent_runtime_helpers.py:1-30 — Demonstrates absence of any machine-checkable provenance signals today.

Not applicable to this codebase: Software bill of materials, Dependency usage & reachability, AI-coding-tool provenance.

Implementation & Customization

Configuration over per-customer branches: no "if customer_id == 12345", no pricing literals scattered outside the billing module.

78% 7/10 scored

Configuration over code branches 56%

2/3 expected sites
Centralized pricing/plan logic 67%

2/3 expected sites
Metering decoupled from pricing model 67%

3/3 expected sites
Feature gating via flags, not forks 80%

5/5 expected sites
Documented extension interface 100%

9/8 expected sites
Customization isolation & upgrade safety 100%

7/7 expected sites
Theming / white-label as config 78%

3/3 expected sites

Configuration over code branches 56%

This codebase uses configuration/data to drive meaningful variation (notably tool enablement/provider setup persisted to user config, and custom provider request overrides merged via config-like inputs). However, pricing/rate tables are embedded as large code literals in agent/usage_pricing.py, which is a divergence risk versus a pure config/data approach for evolving pricing rules.

high
Move the pricing/rate tables out of agent/usage_pricing.py into a versioned config/data source (e.g., YAML/JSON fetched locally or packaged with the app), and make the lookup layer load from that data so updates/onboarding new models require config changes not code edits.
- agent/usage_pricing.py:1-220 — The core pricing matrix is hardcoded in _OFFICIAL_DOCS_PRICING with many per-model/per-provider Decimal literals, indicating code-based customization instead of data-driven configuration.
med
Ensure all future per-user/toolset variations (especially plugin-provided toolsets) flow through the same config persistence mechanism and do not require new branching code in hermes_cli/tools_config.py for each new variation.
- hermes_cli/tools_config.py:1-220 — The module already provides a config registry and plugin discovery hooks; keep new variation sources wired through these data structures.
low
Where custom provider behavior uses request_overrides.extra_body, document the schema for custom_providers entries so integrators can add behaviors via config confidently without touching init logic.
- agent/agent_init.py:1-140 — The selection and merge logic for custom_providers/extra_body is present; formalizing its expected config shape reduces the need for code changes.

No hardcoded customer branching N/A

No hardcoded customer/tenant/org/account identity branching was found in the audited code paths. Observed uses of `account_id` / `tenant_id` are configuration or data-scoping inputs (e.g., headers, request URLs, metadata fields), not `if`/switch-style special-casing based on literal identity values.

med
If customer-specific behavior is expected in this repo, add/update tests that fail when business logic branches on literal `customer_id`/`tenant_id`/`org_id`/`account_id` values (e.g., snapshot tests with multiple tenants).
- agent/account_usage.py:92-121 — Example of the desired pattern: identity used to scope data (header), not to change control flow.

Centralized pricing/plan logic 67%

A centralized pricing/plan-cost module exists at agent/usage_pricing.py, with PricingEntry and an _OFFICIAL_DOCS_PRICING snapshot plus routing to official snapshots or provider metadata. Other surfaces (notably the CLI and the agent conversation loop) reuse that module’s pricing helpers rather than duplicating pricing constants or cost calculation logic.

high
Search for any remaining price/tier/discount literals outside agent/usage_pricing.py and the official metadata providers, and refactor them to call get_pricing_entry/estimate_usage_cost/resolve_billing_route.
- agent/usage_pricing.py:1-118 — This file should become the single source of pricing constants/rules; any other literal-based pricing should be eliminated.
med
For the desktop model picker, confirm the UI is only consuming backend-provided price/tier fields (not computing any pricing). If any computations exist in apps/desktop, move them behind the backend pricing endpoints.
- apps/desktop/src/components/model-picker.tsx:1-220 — Model picker includes UI pricing/tier concepts; verify it renders data rather than implementing pricing math.

Metering decoupled from pricing model 67%

The codebase includes a clear separation between usage metering (token counters persisted in the session DB) and pricing model logic (agent/usage_pricing.py maps normalized usage to cost estimates). However, billing/cost fields also exist alongside usage in the session schema (estimated/actual), so the decoupling is strong but not perfectly 'events only + later mapping' throughout the stack.

high
Audit the session write path (where input_tokens/output_tokens/cache_*_tokens and billing_provider/billing_mode are populated) to ensure no pricing literals/rules are embedded in the metering/capture code. If billing_mode/cost_source are set during capture, refactor to store only metering inputs and defer cost-source/pricing-version selection to usage_pricing.
- hermes_state.py:260-320 — Shows usage counters and pricing-related fields coexist in the same persisted record; capture-path coupling may exist and should be checked.
med
Ensure all cost displays (CLI/status/any API) derive from estimate_usage_cost(normalized metering) rather than re-implementing token->price math elsewhere (search for token multipliers or per-million constants outside agent/usage_pricing.py).
- agent/usage_pricing.py:1-260 — Central pricing constants/rates live in one place; verify other surfaces only call into this module.

Feature gating via flags, not forks 80%

Yes—this codebase uses a centralized entitlements/feature-state mechanism (NousFeatureState / NousSubscriptionFeatures) to gate tiered capabilities (web/image/video/tts/browser/modal) via flags derived from account entitlements and configuration, rather than introducing divergent per-plan code paths.

high
Verify (and if needed, tighten) that every tool dispatch path consumes the already-computed enabled_toolsets/disabled_toolsets (from tools_config/nous_subscription) and does not re-check plan/tier in individual tool handlers.
- agent/agent_runtime_helpers.py:1600-2362 — invoke_tool passes enabled_toolsets/disabled_toolsets down to the shared handler; audit downstream tool registry/handlers for any re-introduced tier-specific branching.
med
Ensure gateway/managed-vs-direct selection remains strictly config/flag driven across all UI and runtime surfaces (CLI, TUI, gateway). Add a single contract test that compares feature-state outputs to tool registration results.
- hermes_cli/nous_subscription.py:260-520 — Managed/direct availability is computed here; correctness depends on all callers using this state consistently.
low
If additional flags are introduced in the future, deprecate/retire old gating variables (e.g., legacy tier/message flags) and keep gating logic concentrated in hermes_cli/nous_subscription.py.
- hermes_cli/nous_subscription.py:520-760 — This module is currently the central place where tiered capability states are composed; keep new flags aligned to this model.

Documented extension interface 100%

This codebase has a strong, documented extension interface centered on a plugin system (`hermes_cli/plugins.py`) with stable hook definitions and a `PluginContext` facade for third-party customization. Separately, it uses documented ABC/profile contracts for provider customization (`ProviderProfile` and browser `BrowserProvider`) with registry-driven discovery/dispatch and (for some interfaces) compliance tests to keep the contract stable across upgrades.

high
Add explicit documentation for the customer-facing extension points most likely to be used externally (e.g., browser provider lifecycle, `ProviderProfile` hooks, and dashboard-auth provider protocol), including versioning/compat rules and an example “hello world” plugin for each. Current code has strong docstrings, but external/partner docs often need a dedicated compatibility section.
- hermes_cli/plugins.py:1-110 — Plugin contract exists, but there’s no explicit, single “public contract doc + versioning rules” artifact shown in the evidence slices.
- providers/base.py:1-199 — ProviderProfile is documented in-code; partner adoption typically benefits from a versioned interface spec and compatibility policy.
med
Ensure all extension categories (standalone/backend/platform/exclusive) have equivalent compliance tests (like `test_plugin_platform_interface.py`) so the stability guarantee covers every documented contract, not only gateway platform plugins.
- tests/gateway/test_plugin_platform_interface.py:1-220 — Demonstrates compliance testing for gateway platform plugins, but we only observed this one test suite in evidence.
low
Add at least one upgrade/compat regression test for provider-profile overrides (e.g., ensuring `build_api_kwargs_extras` and alias resolution remain stable), mirroring the approach used for platform interface compliance tests.
- providers/__init__.py:1-120 — Discovery and override semantics are critical for extension safety; a focused regression test would reduce risk during future refactors.

Customization isolation & upgrade safety 100%

This codebase implements customization with explicit, stable extension boundaries (plugin LLM access with config trust gating, shell hooks injected via an existing hook manager with allowlist/consent and idempotent config registration, and image-generation provider backends behind an ABC selected by config). It also centralizes customization-prone credential removal via a registry to avoid per-source bespoke core branching that would require re-validation on upgrades.

high
Add a short, versioned extension contract doc/test for each customization boundary (plugin LLM ctx.llm, shell hooks event matcher+callback shape, ImageGenProvider.generate response schema). Ensure CI has “contract tests” that validate core+plugin integration across upgrades.
- agent/plugin_llm.py:1-70 — Plugin LLM surface and trust-gating contract is defined here; it should be locked down with explicit contract tests.
- agent/shell_hooks.py:1-55 — Shell hook configuration/dispatch contract is defined here; regression safety depends on preserving the wire protocol and callback behavior.
- agent/image_gen_provider.py:1-35 — Image provider contract is defined here; adding contract tests will prevent accidental coupling during core evolution.
med
For plugin trust-gated overrides, ensure there is a single authoritative place that documents the default-deny behavior for missing config and how allowed_* lists interact (including wildcard semantics), and add tests for each override dimension.
- agent/plugin_llm.py:230-320 — Override trust gating logic lives here; expanding test coverage around all override flags reduces upgrade risk.

Theming / white-label as config 78%

The codebase does support theming/white-label as configuration. The clearest, explicitly white-label-oriented implementation is the CLI `skin_engine` (YAML-driven skins with `branding` and `colors`, requiring no code changes to add a new skin). Additionally, both the desktop and web frontends use centralized theming contexts that apply brand/skin differences by building/injecting theme tokens (CSS variables) rather than forking UI code per brand.

high
Verify end-to-end partner onboarding expectations for white-label: confirm there is a supported/configurable mechanism to select or activate a skin/theme per deployment (and/or per environment) without manual code edits or per-customer branches. If selection is currently only user-local (e.g., localStorage), document the intended operational workflow for multi-tenant/partner setups.
- apps/desktop/src/themes/context.tsx:220-335 — Desktop skin/mode are resolved via localStorage and preset lists; confirm whether there’s an operator-level config/branding assignment path for partners beyond per-user local selection.
- hermes_cli/skin_engine.py:1-260 — The skin engine is data-driven via YAML, but this slice is primarily schema/docs and built-in skins; confirm the actual runtime selection/activation mechanism (e.g., config key or CLI flags) is operator-driven and not code-driven.
med
Align naming and theme selection semantics across surfaces (CLI skins vs desktop skins vs web themes) so that ‘brand/theme id’ maps cleanly to all clients. This reduces divergence where different clients use different identifiers/selection rules.
- apps/desktop/src/themes/context.tsx:1-335 — Desktop has its own skin selection keys and preset system.
- web/src/themes/context.tsx:1-220 — Web themes use a separate theme context and theme schema (layers/assets/overrides).

Tenant-configurable behavior surface N/A

After inspecting the code areas most likely to implement or reference a tenant/customer-configurable behavior surface (gateway config, plugin trust/policy knobs, and tenant-named tests), I did not find a clear settings/rules model where customer/tenant-specific behavior variations are expressed as data (per-tenant config rows) rather than code branching. A number of config-driven switches exist (e.g., plugin LLM trust gate, gateway/session policies), but they are not established as a per-tenant behavior surface in the sense required by this primitive.

high
Locate (or confirm absence of) the intended multi-tenant model: search for the authoritative tenant identifier (e.g., `tenant_id`, `org_id`, `customer_id`) used to select tenant-scoped settings/rules, and for a central config/rules loader that feeds behavior (not just infra config). If none exists, define a tenant-scoped settings schema (YAML/DB) and refactor the behavior gates to read from it.
- tests/gateway/test_teams.py:1-120 — Indicates tenant is considered in config validation at least at the adapter level, but no corresponding tenant-configurable behavior surface wiring was confirmed in the inspected code.

Onboarding-by-configuration cost N/A

I did not find an implementation specifically targeting “onboarding a new customer cheaply via configuration/data (self-serve provisioning)”. The codebase has an “onboarding” concept for first-run UX hints (tracked in local config), and also config-driven behaviors (e.g., shell hooks registered from config), but these do not constitute a documented, customer/tenant onboarding-by-configuration pathway.

high
Add/confirm a tenant/customer onboarding-by-configuration contract: a single documented provisioning entrypoint (ideally self-serve) that turns a new tenant’s input data/config into the required runtime setup without per-customer code edits or bespoke deploys. Include example configs and a validation script/command that verifies everything is wired correctly.
- agent/onboarding.py:1-22 — Current onboarding is scoped to first-touch hint flags in local config, not customer/tenant provisioning; this highlights the gap to address.

Not applicable to this codebase: No hardcoded customer branching, Tenant-configurable behavior surface, Onboarding-by-configuration cost.

Procurement Code Readiness

Data-export and data-subject erase/export endpoints, region pinning, and DPA-mapped controls that survive enterprise procurement.

21% 6/10 scored

Self-serve trust documentation 0%

0/2 expected sites
Data export mechanism 0%

0/2 expected sites not present
Deletion / erase-on-request 42%

3/4 expected sites
Data residency commitment 0%

0/1 expected sites not present
Enterprise access controls 83%

2/2 expected sites
Sub-processor transparency 0%

0/2 expected sites not present

Self-serve trust documentation 0%

The repo contains committed security/trust documentation (SECURITY.md and a published website security page), but it is not packaged as a self-serve procurement trust set (it does not present deal-closing artifacts like DPAs/certifications/sub-processor lists/pen-test summaries and maintained control status as prospect-ready materials).

high
Create/maintain a single, prospect-facing trust center (or “Trust & Security” landing page) that self-serve purchasers can use for diligence: include links to the current DPA/contract terms, a versioned sub-processor list, pen-test/audit summaries (with dates and report references), and any maintained control-status overview. Ensure links resolve to committed artifacts that are updated on a schedule.
- SECURITY.md:1-35 — Current trust content is a policy/vulnerability disclosure document; it does not provide the packaged deal-closing self-serve procurement materials expected of this primitive.
- website/docs/user-guide/security.md:1-40 — Current public page documents security layers, but it is not an entry point for procurement/self-serve certifications and contracting artifacts.
med
Augment SECURITY.md (or the trust landing page it links to) with an explicit “Procurement packet” section: what artifacts exist, where they live in-repo (or on a published secure location), and their last-updated dates/versions.
- SECURITY.md:1-35 — SECURITY.md is the natural place to anchor trust commitments, but today it focuses on boundaries and vulnerability reporting rather than procurement-ready packaging.
low
If the project intends to keep these artifacts off-code (e.g., compliance reports in a data room), still add stable, self-serve references and an index page in the repo so buyers don’t have to re-derive everything from scratch.
- website/docs/user-guide/security.md:1-40 — The published security page should act as the prospect-facing index, not only as a general security-model explanation.

Questionnaire response library N/A

No questionnaire response library (e.g., CAIQ/SIG/VSA reusable question-to-answer bank) is present in the repository. This primitive is a DATA-ROOM follow-up artifact, and its absence is expected; procurement should transition to requesting the current, framework-mapped questionnaire response package from the seller.

high
Request the seller’s current security questionnaire response library (CAIQ/SIG/VSA as applicable), including version/date and the mapping to the dominant frameworks/domains used for procurement (and any supporting evidence bundle).
- N/A (data-room artifact; tool-based discovery):N/A — git_artifact_scan reports the 'questionnaire' category as absent (no tracked questionnaire response library files present in the repo).
med
Ask for a control-evidence index accompanying the questionnaire responses (i.e., what documents/reports substantiate each answer) to avoid re-deriving controls during diligence.
- N/A (data-room artifact; tool-based discovery):N/A — Expected companion artifact for procurement packaging is not discoverable in-repo; questionnaire category is absent.

Controls-to-contract mapping N/A

I did not find any controls-to-contract mapping artifact (DPA/MSA -> controls -> audit evidence) that would let procurement verify commitments like encryption, retention, breach notice, and residency against implemented mechanisms and packaged evidence. The repo has security policy and security guidance/deployment documentation, but nothing that reads like the required hybrid controls-mapping doc for contract close readiness.

high
Ask the seller/GC/R&W underwriter for the current DPA/MSA + the specific controls-to-contract mapping (or equivalent schedule) that enumerates each DPA commitment (encryption, retention, breach notice, residency) and attaches/points to the audit evidence used for attestation.
- SECURITY.md:1-31 — Confirms what the repo provides today (security policy) but does not provide contract-commitment traceability.
med
Request/produce a repo-adjacent mapping document under the expected location (e.g., docs/security) that explicitly links each named DPA/MSA commitment to (a) a code-visible control/mechanism and (b) the audit evidence package/version used for that control.
- docs/security/network-egress-isolation.md:1-60 — Example of an existing security artifact that could be referenced by a future controls-to-contract mapping, but currently is not structured as contract traceability.

Data export mechanism 0%

The codebase supports exporting an individual session (web endpoint `/api/sessions/{session_id}/export` and a desktop helper `exportSession`) and the data store includes `export_all()` to export all sessions. However, there is no tenant-scoped “export ALL data out on request” handler/job wired into the web/API layer (the available export endpoint is per-session), so the procurement “data export mechanism” primitive is not fully implemented as required.

high
Add a tenant-scoped “export all data” HTTP endpoint (or async job + polling/download endpoint) that invokes `SessionDB.export_all(...)` and returns data in a portable format (e.g., JSONL/JSON archive). Ensure the export is scoped to the authenticated tenant/user context (not global) and streamed/queued if large.
- hermes_cli/web_server.py:4440-4487 — Current export handler is only per-session (`/api/sessions/{session_id}/export`). A tenant-wide route should be added alongside this pattern.
- hermes_state.py:3068-3097 — `export_all()` is the likely correct underlying implementation for the required tenant-wide export; wire it to a request/tenant-scoped mechanism.
med
Extend the desktop/web UX to request a full export (not just a single session), and plumb it to the new tenant-wide backend export endpoint.
- apps/desktop/src/lib/session-export.ts:14-57 — Desktop export currently downloads only a single session’s JSON (not all tenant data).

Deletion / erase-on-request 42%

The codebase contains code-visible deletion/erase operations for stored memory items (Supermemory forget tool + RetainDB delete endpoints). However, the implementation is primarily id/query-scoped deletes; the code evidence available here does not demonstrate the procurement primitive’s required verifiable, tenant/subject-scoped cascade deletion reaching backups/derived stores with auditable linkage to an erase-on-request contract.

high
Add/locate a data-subject/tenant erase endpoint or job that is explicitly invoked by a deletion request, and ensure it cascades to all data stores (primary + derived + backups) for that subject/tenant; include audit evidence and request identifiers in logs.
- plugins/memory/supermemory/__init__.py:320-350 — Current erase is per-memory id/query; needs subject/tenant-scoped cascade-delete evidence.
- plugins/memory/retaindb/__init__.py:220-310 — Current erase is HTTP DELETE by id for memories/files; needs proof it cascades through backups/derived stores and is auditable.
med
Implement (or expose) deletion-by-subject/session for each memory backend (not only by memory_id), so an erase request can delete all items belonging to a subject across containers/scopes in one verifiable workflow.
- plugins/memory/supermemory/__init__.py:320-350 — forget_by_query deletes one best-match; erase-on-request typically requires full subject/tenant coverage.
- plugins/memory/retaindb/__init__.py:220-310 — delete_memory/delete_file are id-scoped; erase-on-request needs subject/tenant coverage.
low
Ensure UI/admin delete actions (if any) map to the same server-side subject/tenant erase workflow and return a verifiable status/result (including which datasets were purged).
- gateway/platforms/api_server.py:1-260 — API header indicates DELETE /v1/responses/{response_id}, but erase-on-request should be validated at the data-subject layer, not only response deletion.

Data residency commitment 0%

Region concepts exist, but in the audited code they are used to route provider/API calls (e.g., AWS Bedrock client region_name). There is no evidence of a tenant-region attribute that is enforced end-to-end for data/compute placement as a residency commitment (the region usage appears provider-routing, not residency enforcement).

high
Provide (or implement) an explicit tenant-scoped data residency control: a tenant `data_region`/region attribute in the data model, with routing that enforces compute + all persistence/derived/backups in that region. Ensure the enforcement is applied at every persistence boundary (memory/session storage, logs, vector stores, caches, media/object storage, queues).
- agent/bedrock_adapter.py:53-84 — Current `region` usage is only to construct a provider client (`bedrock-runtime`, `region_name=region`), which is insufficient as residency enforcement for tenant data.
med
Add end-to-end tests that assert tenant A’s data never touches tenant B’s region (including any async background jobs, retries, and fallback routing). This should verify both data storage location and any region-dependent derived artifacts.
- agent/bedrock_adapter.py:53-84 — The region routing currently tested/used is provider client creation; add tests specifically covering tenant-scoped persistence boundaries and derived/backup flows.

Enterprise access controls 83%

Enterprise access controls are only partially implemented. The codebase clearly enforces CIDR allowlisting at the boundary for the MS Graph webhook adapter (including fail-closed behavior when the host is network-accessible and CIDRs are missing). However, the code-visible evidence does not demonstrate that this network restriction is managed through an admin UI or applied as a tenant-scoped enterprise control for the admin/dashboard boundary; the dashboard gate is identity/session-based rather than IP-allowlist-based.

high
Confirm and wire an admin UI + persisted configuration that lets enterprise/security teams manage IP allowlists (CIDRs) for the relevant boundaries (at minimum: dashboard/admin endpoints, and ideally per-tenant). Provide code-visible enforcement that reads the admin-managed allowlist values.
- hermes_cli/dashboard_auth/middleware.py:1-75 — Dashboard/admin access gating exists, but it is OAuth/session-based; no IP allowlist enforcement is shown here.
- gateway/platforms/msgraph_webhook.py:166-210 — CIDR allowlisting is enforced for MS Graph webhook endpoints, but there is no demonstrated admin-managed, tenant-scoped control for it.
med
If tenant-scoped enterprise controls are required, refactor the CIDR allowlist configuration model so it is stored per tenant (and then ensure each boundary checks the tenant’s configured CIDRs). Add tests proving tenant A and tenant B have different CIDR behavior.
- gateway/platforms/msgraph_webhook.py:105-142 — CIDRs are parsed from adapter `extra` configuration; tenant scoping is not evidenced in the current implementation snippet.
low
Add/extend procurement-ready documentation in-repo that explicitly lists: (a) which endpoints enforce IP allowlisting, (b) how to configure CIDRs, (c) whether restrictions are global vs tenant-scoped, and (d) how the admin UI manages it.
- gateway/platforms/msgraph_webhook.py:59-142 — Inline docstring exists for the webhook allowlist, but broader procurement-ready packaging (admin UI + tenant scoping) is not evidenced.

Sub-processor transparency 0%

No versioned, in-repo sub-processor transparency artifact (sub-processor list) was found/packaged in this codebase. The repo contains security/trust-model documentation, but it does not provide the required current sub-processor inventory backing a DPA sub-processor clause.

high
Add a dedicated, committed, versioned sub-processor inventory (e.g., docs/subprocessors/SUBPROCESSORS.md or similar) and ensure it is referenced from SECURITY.md and the website trust/security page used by customers and procurement.
- SECURITY.md:1-200 — Shows the current trust/security documentation entrypoint that should link to the sub-processor inventory.
- website/docs/user-guide/security.md:1-200 — Public trust page that should link to the maintained sub-processor inventory backing the DPA clause.
high
Implement a change workflow: when adding a new sub-processor (new third-party data sink), update the inventory with a version/date and provide a documented notification flow to customers under the DPA (e.g., release notes + email template + escalation path).
- SECURITY.md:1-200 — This repo already defines a trust/security posture; it should be extended with an explicit, procurement-grade sub-processor notification/update process.
med
Cross-check the declared sub-processor inventory against actual third-party SDK imports used in the runtime (e.g., OpenAI SDK usage is present) and ensure the inventory entries match the code paths that send data externally.
- docs/security/network-egress-isolation.md:1-196 — Shows the project’s explicit attention to external connectivity and allowlisting; use the same principle to validate that the declared sub-processor list matches actual external calls.

Compliance attestation readiness N/A

This primitive is a DATA-ROOM follow-up artifact (a current SOC 2 Type II / ISO / pen-test attestation readiness package plus control-to-code traceability). In this repo, there is no retrievable in-repo source evidence for a current compliance attestation readiness package. The tooling indicates the corresponding evidence category is not present/packaged in the repository (as expected for data-room artifacts).

high

Request the current compliance attestation package from the seller (e.g., SOC 2 Type II report and pen-test/ISO as applicable) plus control-to-code traceability evidence showing the implemented mechanisms map to the attested controls for the Dim 5 audit evidence set.
med

Ask the seller/GC to provide the report period coverage (start/end dates), scope statement (services, regions, sub-processors), and the specific mapping/traceability artifact (e.g., a controls-to-evidence spreadsheet or appendix) used during audits.
low

If the seller maintains these in another location (e.g., secure customer portal), request a shareable link and confirm the artifacts are current (not expired) and match the deployed production configuration.

Reliability / SLA evidence N/A

No packaged Reliability / SLA evidence (status page/config, SLA terms, incident postmortems) was found in this repository. The only similarly named material is runtime readiness logic used to decide whether the app should proceed, which does not constitute SLA/reliability evidence.

high
Request the buyer-ready Reliability/SLA evidence set from the seller (e.g., current status page URL + uptime reporting methodology, any formal SLA/credits terms, and a runbook + recent incident postmortems). Provide versioned artifacts suitable for procurement diligence.
- apps/desktop/src/lib/runtime-readiness.ts:1-148 — Shows the repository has runtime readiness checks, but nothing here packages SLA/status/incident evidence.

Not applicable to this codebase: Questionnaire response library, Controls-to-contract mapping, Compliance attestation readiness, Reliability / SLA evidence.

Reporting & Data Export

Customer-accessible export endpoints (CSV, Parquet, JSON), scheduled exports, and a documented map of emitted events.

16% 6/10 scored

On-demand data export 0%

0/3 expected sites not present
Export completeness & fidelity 0%

0/2 expected sites
In-product reporting / analytics 75%

4/4 expected sites
Documented export / event schema 0%

0/3 expected sites not present
Export access control & audit 0%

0/1 expected sites not present
Exit portability / no lock-in 22%

2/3 expected sites

On-demand data export 0%

No tenant-scoped, permission-gated, audited on-demand data export handler for exporting an entire customer tenant/account’s data in a portable format was found. The codebase has (a) a desktop-only single-session JSON export and (b) a local CLI backup zip of the user’s Hermes home, but neither satisfies the primitive’s requirement for complete tenant-scoped data egress via a backend handler.

high
Implement a backend on-demand export endpoint in the authenticated API server layer that (1) is tenant-scoped, (2) enforces authorization, (3) exports the full set of tenant/account data categories in a portable format (tabular/columnar or structured JSON/NDJSON), and (4) writes an audit log entry tied to the export request.
- gateway/platforms/api_server.py:1-240 — API server is the correct placement for an export handler that can enforce auth/tenant scoping.
- gateway/platforms/api_server.py:3000-3220 — Shows established patterns for authenticated HTTP handlers; export should follow these patterns.
med
Rework the desktop session export to call the backend tenant export flow (or clearly mark the capability as a limited “session export” that is not the full account/takeout export).
- apps/desktop/src/lib/session-export.ts:1-57 — Current export is limited to a single session’s messages and happens entirely on the client.
med
If you keep the CLI backup, document it as a local device backup tool (not a customer data export primitive), and ensure the actual customer-facing export primitive exists in the API layer.
- hermes_cli/backup.py:1-220 — CLI zip backup targets local ~/.hermes/ content rather than tenant-scoped customer data export.

Export completeness & fidelity 0%

This codebase has export-like mechanisms, but none implement “export completeness & fidelity” as a correct, complete, customer-data portable dump with faithful coverage of all critical categories. The desktop `exportSession` exports only one session’s messages. The CLI `hermes backup` bulk-zips the local `~/.hermes/` directory, but it deliberately skips excluded directories and “secret” files, so it is not a complete/fidelitous customer data export in the required sense (and it is not demonstrated as tenant-scoped/permission-audited).

high
Implement (or add) a tenant/account-scoped, permission-gated “full customer data export” endpoint/handler that covers all customer-critical entities (customer/profile, financial, operational, config, permissions/accounts, integration specs, and historical analytics) and verifies round-trip fidelity (types + relationships) without silent truncation.
- apps/desktop/src/lib/session-export.ts:21-57 — Current export scope is single-session only (`getSessionMessages(sessionId)`), not an account-wide completeness export.
- hermes_cli/backup.py:92-160 — Current bulk export is a filesystem zip with hard-coded exclusions (`_EXCLUDED_DIRS`, `_EXCLUDED_NAMES`, `_SECRET_FILE_NAMES`), which is a direct fidelity/completeness gap for customer-data portability.
high
Harden the bulk export path with explicit authorization checks and an audit-log write on export initiation/completion, and ensure export scope cannot cross tenants/accounts.
- hermes_cli/backup.py:92-260 — The CLI backup reads from local `HERMES_HOME` and writes a zip; there is no tenant scoping, explicit permission gating, or audit-log trail shown in this implementation.
med
Replace/extend the current “session export” with a structured export contract that can be composed into an account export (shared schema + inclusion rules), rather than ad-hoc per-feature JSON downloads.
- apps/desktop/src/lib/session-export.ts:21-42 — The export payload is a single-session JSON with `messages` only; there is no evidence it participates in a wider account export schema.

Large / async export handling N/A

No code-visible primitive for “Large / async export handling” was found. Searches for export-job/worker/task/queue/async-export/stream-export symbols returned no results, and the only “export” behavior found is small, UI/in-request style data export (e.g., exporting a single session to a JSON blob) and a general backup/import mechanism—neither implements an async, tenant-scoped, streaming large-dataset export pipeline with progress/notifications.

high

Add a dedicated large-data export pipeline: (1) tenant-scoped authorization, (2) enqueue an async export job, (3) stream results to a blob/object store (not buffering in memory), (4) provide progress/notification and resumability, and (5) implement a secure download endpoint tied to the job and tenant.

Scheduled / recurring exports N/A

No scheduled/recurring data-export primitive is implemented in this codebase. “Export” behavior found is manual (browser download of a session JSON), and the cron-related server code appears to support scheduled message/delivery plumbing rather than a tenant-scoped, retryable scheduled export that delivers portable customer data to a destination.

high
If the product intends scheduled data exports, introduce a backend schedule store (tenant-scoped), a scheduled runner/worker that batches export jobs, and an export execution pipeline that writes results to a configurable destination with retries + a dead-letter queue (DLQ).
- apps/desktop/src/lib/session-export.ts:1-57 — Current export is manual and download-based; it does not provide a scheduled runner or portable delivery mechanism.
high
Connect scheduling to the actual data-export handler so scheduled exports use the same tenant scoping, permission checks, and audit logging as any on-demand export endpoint would.
- gateway/platforms/api_server.py:620-910 — Cron/job support exists, but it is not wired to a customer data export handler/destination pipeline.

Warehouse sync / reverse-ETL N/A

No Warehouse sync / reverse-ETL primitive is implemented in this codebase. Repo evidence discovery found no maintained warehouse connector configurations (dbt/airbyte/fivetran/singer/meltano-style) and no code path that performs incremental warehouse syncing (the codebase “sync” occurrences appear to be internal state/media sync, not reverse-ETL to a customer warehouse).

high

If warehouse sync is a product requirement for Hermes Agent, add a dedicated reverse-ETL layer: (1) connector-configs (dbt/airbyte/fivetran/singer/meltano) committed under a warehouse_sync_config-like folder, and (2) code that provisions & runs incremental sync jobs tenant-scoped with authz and audit logging.

In-product reporting / analytics 75%

This codebase includes a real in-product analytics/reporting module: the FastAPI server exposes /api/analytics/usage and /api/analytics/models, and the React dashboard page (AnalyticsPage) renders charts from those endpoints. However, the analytics appear geared toward the local/single deployment context rather than clearly implementing tenant-scoped, portable reporting/export for customer analytics/warehouse/exit use.

high
Define and implement a customer-data portability/export path for the analytics reports (e.g., export the same aggregates returned by /api/analytics/* as CSV/JSON), ensuring permission checks and audit logging on the export endpoint.
- hermes_cli/web_server.py:2450-2500 — Analytics are computed and returned for the UI, but this shows only the dashboard data-return mechanism (no export handler evidenced here).
med
Verify authorization + data scoping guarantees for analytics endpoints in the running auth middleware/gate (ensure they cannot leak another customer’s data if multi-tenant is expected).
- hermes_cli/web_server.py:2450-2500 — Endpoint reads directly from hermes_state SessionDB and returns aggregated results; scoping depends on how SessionDB is partitioned.
- hermes_cli/web_server.py:2500-2700 — Second analytics endpoint also reads from the same sessions data and returns per-model aggregates.
low
Add end-to-end tests that assert the analytics UI renders correctly from API responses (including response shape stability for daily/by_model/totals and models capabilities enrichment).
- web/src/pages/AnalyticsPage.tsx:1-220 — The UI visualization logic relies on AnalyticsDailyEntry fields (input_tokens/output_tokens/day); tests should cover these contracts.

Event stream completeness N/A

No auditable “event stream completeness” primitive is present for reporting/data export eventing. Although the codebase clearly emits/dispatches events internally (e.g., the Ink UI event dispatcher/emitter), there is no maintained documented event catalog (event-name → schema/expectation) in-repo that can be diffed against the code-emitted event set. The off-graph doc scan only surfaced unrelated schema/XSD files in the export_event_schema bucket, not a real event documentation map relevant to a reporting/export event stream.

high
Add/restore a real documented event catalog (asyncapi/openapi/events.md or equivalent) that enumerates the externally promised reporting/export event names and payload schemas, then wire the reporting/event emission layer to a single source of truth so you can diff “documented vs emitted” in CI.
- ui-tui/packages/hermes-ink/src/ink/events/emitter.ts:1-41 — Current emission is internal UI event handling; the repo needs an external reporting/export event system with a documented catalog to meet this primitive.
med
If the intended scope is actually the Ink terminal/UI event system, explicitly document the terminal/UI event types and their payloads, and define completeness expectations. Then compare emitted event types (call sites like emit/dispatch/track) against that documented set.
- ui-tui/packages/hermes-ink/src/ink/events/dispatcher.ts:1-243 — Ink’s dispatcher is the core event dispatch mechanism; documentation is needed to make “completeness vs docs” measurable.

Documented export / event schema 0%

No consumer-facing, documented export/event schema (e.g., asyncapi/openapi/event catalog with versioned payload definitions) was found. The codebase does implement data export (e.g., exportSession builds a JSON payload and downloads it), but the exported format is not accompanied by a maintained schema document that external consumers can depend on.

high
Add a versioned, documented JSON schema for the session export payload (exportSession), including the exact structure of `messages` returned by getSessionMessages, plus metadata fields like exported_at, session_id, title, message_count.
- apps/desktop/src/lib/session-export.ts:19-49 — Exports a concrete JSON object shape; document it as a stable schema for portability.
high
Publish (and keep in sync) a documented event catalog/schema for any externally consumable event stream. If events are internal-only, explicitly document that and avoid implying stable external payload contracts.
- ui-tui/packages/hermes-ink/src/ink/events/emitter.ts:14-39 — EventEmitter.emit forwards arbitrary args and has custom Event propagation semantics; a documented payload contract is required for external portability.
med
Add CI checks to prevent schema drift: generate schemas from the TypeScript types (where possible) or validate exported payload instances against the published schema during tests.
- apps/desktop/src/lib/session-export.ts:19-49 — Payload is constructed in code; schema validation/generation should be anchored to this source.

Export access control & audit 0%

No code-visible implementation of “Export access control & audit” was found. The only clearly identified export mechanism is the desktop-side `exportSession()` which fetches messages and downloads a JSON file, but it does not perform authorization/tenant scoping checks or write an audit log entry on the export path.

high
Move export authorization + audit logging to a server-side export handler (or server API call invoked by the desktop client): enforce the caller’s permission to export the specific tenant/session, ensure tenant isolation in the data query, and write an audit event that includes who exported what (session id / tenant id), from where, and when.
- apps/desktop/src/lib/session-export.ts:13-57 — Current export implementation directly calls `getSessionMessages(sessionId)` and triggers a browser download; there is no permission check, tenant scoping validation, or audit-log write in this export path.

Exit portability / no lock-in 22%

There is a real exit-portability mechanism: `hermes_cli/backup.py` provides `hermes backup`, which scans the entire local Hermes home directory and packages it into a zip for download/transfer. For SQLite-backed state, it uses a WAL-safe snapshot (`sqlite3.backup()`) before writing into the archive. However, I did not find any tracked contractual no-lock-in / data-portability terms in-repo (the contract clause is a hand-off item to the buyer’s GC). The desktop UI also supports exporting individual sessions to JSON, but the codebase’s “full account” exit export is primarily the CLI backup.

high
Locate/confirm the contractual no-lock-in/data-portability clause in tracked legal/terms artifacts (MSA/termination/offboarding terms). Hand off to the buyer’s GC to verify it explicitly permits full-account export before termination/lock-in and that any revocation/termination does not block retrieval.
- __external_contract/exit_terms:N/A — git_doc_scan reported `exit_terms` category as absent (0 files), so there is no in-repo contract evidence to cite here.
med
Add a clear, user-facing reference in the product/CLI help (or docs) that `hermes backup` is the official full-account export for exit portability, including what is included/excluded and how to restore (`hermes import`)—to reduce risk of users extracting an incomplete dataset.
- hermes_cli/backup.py:1-20 — The backup/import behavior is described in the module docstring, but this should be surfaced as authoritative product guidance for exit portability.
low
Consider adding an integrity/manifest step to the exported zip (e.g., checksum list + schema/version metadata) so users can validate portability and completeness after exit/transfer.
- hermes_cli/backup.py:78-176 — The zip is created and populated, but there is no manifest/integrity listing in the shown implementation segment.

Not applicable to this codebase: Large / async export handling, Scheduled / recurring exports, Warehouse sync / reverse-ETL, Event stream completeness.