CS All case studies

NousResearch/hermes-agent

github.com/NousResearch/hermes-agent · audited 2026-06-03 · commit 6420220

44% ERI composite

Hermes Agent is a well-built single-user agent application — a desktop app (apps/desktop) over a Python CLI core (hermes_cli). Run through the Enterprise Readiness Index, it scores a 44% composite, and the shape of that number is more interesting than the number itself.

Where it’s strong

The execution-velocity tier holds up. Implementation & Customization (78%), AI / Data Foundation (72%), Reliability Primitives (72%), Deployability (71%), and Performance Primitives (66%) all land in healthy territory: configuration-driven variation instead of per-customer branches, sane data handling, real deploy signals, and an engineering org without an obvious single-owner cliff (Engineering Org Resilience 64%).

Where the thesis breaks

The exit-cleanliness and enterprise-control dimensions score low — and that’s the honest signal, not a defect. Tenancy Isolation (0%), Audit / Governance / Residency (14%), Reporting & Data Export (16%), API & Extensibility (18%), and Procurement Readiness (21%) all reflect the same root cause: this codebase was never architected as a multi-tenant B2B platform. There are stray tenant_id fields (e.g. TeamsMeetingRef.tenant_id) and a narrow dashboard-auth audit log, but no row-level isolation, no cross-tenant tests, no append-only audit spine, no customer-facing export surface.

The read

For its actual purpose — a fast, local, single-operator agent — Hermes is in good shape. For an upmarket B2B thesis that assumes multi-tenancy and enterprise procurement, the gap is structural and would be a re-platform, not a sprint. The dimension breakdown below shows exactly which sites are expected versus found, with remediation linked to the audited commit.

T1 Thesis Viability

AI / Data Foundation

Versioned data pipelines, pinned model versions, and a real vector or feature store — not scattered cron jobs and model="latest".

72% 16/19 scored
  • Declarative, tested transformations 67%
    3/3 expected sites
  • Orchestrated pipelines 89%
    3/3 expected sites
  • Data quality validation / contracts 67%
    2/2 expected sites
  • Data + pipeline versioning 0%
    0/3 expected sites
  • Data lineage / provenance 89%
    3/3 expected sites
  • Feature management 58%
    3/4 expected sites
  • Vector / embedding store 75%
    4/4 expected sites
  • Model version pinning 33%
    1/2 expected sites
  • Prompt / model-call management 92%
    4/4 expected sites
  • Reproducibility / determinism 8%
    1/4 expected sites
  • AI output validation 100%
    2/2 expected sites
  • Grounding / wrongness check 67%
    1/1 expected sites
  • Self-correction / feedback loop 0%
    0/3 expected sites not present
  • Actionable diagnostics 100%
    2/2 expected sites
  • Positive confirmation 200%
    2/1 expected sites
  • Machine-readable contracts 100%
    4/4 expected sites
Declarative, tested transformations 67%

Declarative, tested transformations exist primarily through Hermes’ plugin system hooks (transform_llm_output / transform_tool_result / transform_terminal_output). The repo provides unit + integration tests that load real plugins from manifests and verify the transformation contract end-to-end (dispatch semantics, wiring of kwargs, replacement rules, truncation/redaction interactions, and exception fallback). The main potential gap is that the core application seam locations (e.g., run_agent/model_tools/terminal_tool boundaries) are not directly evidenced here as hook-invocation sites—though the tests strongly indicate the transformation layer is governed and validated.

  • high

    Add (or locate and reference) a direct code-evidence slice in run_agent.py/model_tools.py/terminal_tool.py showing exactly where each hook (transform_llm_output / transform_tool_result / transform_terminal_output) is invoked during production execution, so the primitive’s presence is proven at the critical seams (not just via tests).

    • run_agent.py:1-260 — run_agent.py is the expected LLM-output seam, but the provided evidence here is only module header range; hook invocation wiring lines need confirmation.
    • model_tools.py:1-260 — model_tools.py is the expected tool-result seam; the provided evidence here is only the module header range and does not yet show the hook call site.
    • tools/terminal_tool.py:1-260 — tools/terminal_tool.py is the expected terminal-output seam; the provided evidence here is only the module header range and does not yet show the hook call site.
  • med

    Ensure transformation assets are clearly versioned per plugin (e.g., plugin.yaml version + any compatibility constraints) and that each transformation hook has at least one dataset/boundary test case for empty outputs and malformed plugin return types (some exist, but consolidate per hook into a consistent suite).

  • low

    Document (in a short developer guide) the expected plugin contract for each transform hook (input kwargs, replacement semantics, return-type rules, truncation/redaction expectations) and point to the corresponding test files as the authoritative spec.

    • hermes_cli/plugins.py:120-260 — VALID_HOOKS enumerates transform hooks and describes replacement semantics at a high level, but a dedicated “contract + tests” doc would reduce reliance on reading tests.
Orchestrated pipelines 89%

This codebase contains an orchestrated, dependency-style pipeline implementation for the Teams meeting summary flow. It externalizes orchestration state using a durable store, persists step-by-step lifecycle statuses, classifies retryable vs terminal failures, and wires the pipeline into the gateway via an explicit scheduler callback.

  • high

    Add a first-class, queryable DAG/asset definition for the pipeline steps (e.g., a versioned pipeline manifest describing step graph, retry policy, and step inputs/outputs) and surface it for observability tooling. Right now, the step graph is implicit in control flow within `run_job`.

  • med

    Strengthen the retry mechanism into an explicit scheduler-backed retry loop (with bounded attempts and backoff scheduling persisted per job), rather than relying only on `retry_scheduled` status updates and eventual invocation.

    • plugins/teams_pipeline/pipeline.py:260-560 — `run_job` catches `TeamsPipelineRetryableError` and persists `retry_scheduled`, but the actual retry scheduling policy/worker loop is not shown in the orchestration hot path we inspected.
  • low

    Expose an auditable summary of pipeline runs (job_id, event_id/dedupe_key, step timestamps, last error) via a small API/CLI command that reads the durable store. This would improve operational observability and reproducibility.

Data quality validation / contracts 67%

This codebase does have data-quality/contract-like validation layers, but they are not consistently applied as a single “data contracts” primitive across all ingestion boundaries. Strongest evidence is present for (1) tool JSON schema sanitization to prevent ingestion failures, and (2) file-tool input guards (size limits and blocked device paths) with unit tests that confirm rejection/quarantine behavior. Other ingestion-style boundaries (e.g., delivery routing inputs) appear to rely more on parsing and downstream logic than on a clearly governed contract gate.

  • high

    Introduce (or standardize) an explicit data-quality contract gate for delivery routing inputs in gateway/delivery.py—e.g., strict schema/validation for DeliveryTarget.parse inputs (and any content/metadata shape), with a quarantine/error return type that prevents malformed targets from reaching platform adapters.

    • gateway/delivery.py:1-120 — Delivery routing accepts and parses target strings; evidence available shows best-effort parsing without an explicit contract gate for malformed/unsafe routing inputs.
  • med

    Extend file_tools validation coverage to include comprehensive handler-level shape validation for all tool entrypoints (keys/types/ranges) in addition to existing path/device/size guards, and ensure each validation rule has a unit test that asserts rejection behavior.

    • tools/file_tools.py:1-520 — Existing guards cover some high-risk cases (blocked devices, size caps, path resolution); broader ingestion-boundary shape contracts appear partially covered but not comprehensively demonstrated in the sampled sections.
  • low

    Document the schema/validation contract pattern (what constitutes the “contract”, where it runs, what error payload looks like, and how quarantine is expressed) and reuse it across modules that accept LLM/tool inputs.

    • tools/schema_sanitizer.py:1-220 — Schema sanitization already follows a contract mindset; formalizing the pattern would help apply it consistently to other ingestion boundaries.
Raw / immutable source layer N/A

I did not find an immutable “raw/source preserved unmodified” landing layer anywhere in the codebase. The only “raw/Raw…” hits are UI-level naming (e.g., `RawAnsi`) or data-fetch scripts that output processed/normalized artifacts directly, without a governed immutable raw layer for audit/reproducibility.

  • high

    Introduce a dedicated immutable raw landing layer for external fetches (e.g., Wikipedia/Wikidata, YouTube transcript, other tool ingestions): persist the exact HTTP response payload (and request parameters/headers + timestamps + source identifier) to a versioned store before any parsing/normalization; write transforms downstream from this stored raw artifact.

  • med

    Add an ingestion manifest/spec for the raw layer (schema + required fields, retention policy, and a deterministic reprocessing command that reads raw artifacts without re-fetching external sources).

  • low

    Rename or namespace UI-layer “raw” components (e.g., `RawAnsi`) to avoid confusion with the data-layer primitive name, and document clearly which “raw” refers to UI bytes vs. immutable source-data persistence.

Data + pipeline versioning 0%

The codebase contains a strong reproducibility primitive via filesystem snapshotting (Checkpoint Manager using a shared shadow git store) and a durable pipeline state store for the Teams meeting pipeline. However, there is no clear evidence that data + pipeline versions are tightly coupled and recorded for each pipeline release/job (e.g., no explicit pipeline-version/data-provenance manifest tying code version to specific input/output versions).

  • high

    Add explicit pipeline versioning metadata capture and persistence: when a checkpoint/snapshot is created for a pipeline run, store (and persist into the pipeline store) the pipeline code version (git commit hash of the pipeline module), configuration hash, and the input artifact/version identifiers that determine the produced outputs.

  • med

    Extend TeamsPipeline job/sink records to include deterministic identifiers: pipeline logic hash, input artifact keys (meeting artifact/transcript/audio versions), and output schema/version. Persist these at upsert_job/upsert_sink_record time so later runs can replay exactly.

  • low

    Add tests that verify reproducibility: given the same pipeline version + captured input versions, outputs should be identical (or produce the same sink record identifiers).

Data lineage / provenance 89%

Data lineage/provenance exists in this codebase primarily as durable conversation/session provenance in the SQLite-backed Hermes state store (not as an external lineage standard like OpenLineage/DataHub). Sessions link derivations via `parent_session_id`, messages store tool-call identifiers and timestamps, and retrieval tooling (session_search_tool) reconstructs lineage roots and corrects anchor rebinding for consistency. An observability plugin (Langfuse) provides external trace emission, but the lineage primitive here is more “conversation/tool provenance” than “dataset lineage” for data pipelines.

  • high

    Audit and document the end-to-end provenance emission path (session creation → message insertion → tool call linkage) to ensure lineage is always emitted for every transformation/derivation event, and that identifiers (session_id/task_id/tool_call_id) are consistently propagated across modules.

    • hermes_state.py:230-320 — Provenance fields exist in schema (`parent_session_id`, tool call fields), but the wiring that guarantees they are always populated should be verified/standardized at the write points.
  • med

    Add a machine-queryable provenance export (e.g., JSON export of a session lineage graph or an internal API endpoint) so lineage can be validated without requiring consumers to understand internal DB semantics.

    • tools/session_search_tool.py:1-170 — Current lineage correctness is enforced during retrieval; providing a first-class export would externalize lineage for change-management review.
  • low

    If external observability is the intended governed provenance system, extend/confirm that trace metadata includes all lineage-critical ids (session lineage root, parent/child relationships, tool_call_id/session_id mapping) consistently across all trace/span creation points.

Feature management 58%

This codebase has a centralized feature-management layer for Tool Gateway entitlements: `hermes_cli/nous_subscription.py` computes governed, structured feature state (available/active/provider/managed) and is backed by unit tests. Other surfaces (CLI tool config + portal status) import and use this computation, reducing training/serving-style skew risk. However, the audit evidence available shows imports for consumers, but not the full wiring where each consumer uses the computed states to gate runtime behavior (so overall quality is not “perfect”).

  • high

    Verify end-to-end wiring that every runtime decision/surface using features (especially those affecting which tools/skills are exposed to the agent) derives from `get_nous_subscription_features`/`apply_nous_managed_defaults`, rather than re-implementing entitlement logic. Add/extend tests that assert consistent feature gating across the main execution entrypoints.

  • med

    Add a single “contract test” that compares feature computation inputs/outputs across all entrypoints that call it (e.g., CLI tools selection, portal status, agent runtime). This prevents drift when new features/backends are added.

Vector / embedding store 75%

The codebase includes a persisted vector/embedding-like store for its memory system: HRR vectors are stored in SQLite tables (`facts.hrr_vector` and `memory_banks.vector`) and are recomputed on ingestion/rebuild. Retrieval code uses these persisted vectors for similarity scoring rather than keeping everything ephemeral in process memory. However, the implementation is not a managed, model+content-version governed embedding store; it primarily persists vectors locally without explicit linkage to a producing model version (beyond parameters like HRR dimension).

  • high

    Add explicit version governance to the persisted embeddings: store an embedding-config version (e.g., vector type + embedding parameters + producing model identifier/version if any) alongside each vector/bank, and refuse or automatically rebuild vectors when the producing configuration changes.

  • med

    If the intent is to satisfy the “managed, queryable, versioned store” requirement, replace/augment the local SQLite vector persistence with a dedicated vector DB interface (or at least isolate it behind a vector-store abstraction that exposes upsert/query/delete and versioned namespaces).

  • med

    Add an automated freshness/consistency check at query time (or before retrieval): verify that vector banks exist and match the current embedding configuration; otherwise trigger `rebuild_all_vectors` (bounded, logged) and fall back safely.

Model version pinning 33%

Model identity pinning is partially supported: the agent’s public constructor takes an explicit `model` string and threads it into initialization, but direct enforcement that the string is not a floating alias (e.g., rejecting `latest`/`stable` as runtime model IDs) was not found in the inspected model-call wiring. The codebase includes careful model-string parsing/handling for tags like `latest`/`stable`, but pinning enforcement at invocation appears incomplete based on observed call-path slices.

  • high

    Add an explicit guardrail at the model-identity ingress point (agent init / request kwargs build): detect `model` values that end with or equal floating aliases (e.g., `latest`, `stable`) and either (a) reject with a clear error or (b) resolve to a pinned concrete ID via a version resolver with persisted results for reproducibility.

    • run_agent.py:1080-1125 — This is the choke point where model identity is first accepted and forwarded; pinning validation should occur here or immediately after.
  • med

    Ensure the resolved/pinned model ID (post-alias-resolution) is the value stored in session state / logs (e.g., the session DB row) and used for the API request, not the user-provided alias string.

    • run_agent.py:1080-1098 — Session/model fields are already tracked (model identity is present on the agent); align the stored value with the pinned/concrete ID after validation/resolution.
  • low

    Add unit tests that explicitly cover `model='latest'` and `model='stable'` for each provider path (OpenAI-compatible, OpenRouter, Anthropic, local/Ollama), asserting deterministic behavior (reject or resolve-to-pinned).

    • agent/model_metadata.py:35-55 — Model string parsing currently recognizes `latest`/`stable` tags; tests should verify how this impacts runtime model selection and reproducibility guarantees.
Prompt / model-call management 92%

The codebase has a clear prompt governance layer: prompt fragments live in agent/prompt_builder.py, the assembled system prompt is constructed in agent/system_prompt.py (with explicit caching described), and specialized prompts for background review are centralized in agent/background_review.py. The conversation loop further centralizes system-prompt cache restore/build via _restore_or_build_system_prompt(), helping prevent prompt literal drift near model call sites.

  • high

    Audit remaining model-provider call paths (e.g., where chat/completions/responses are invoked) to ensure they always consume the governed system prompt built/restored by the system_prompt + conversation_loop layers, and do not inline literal prompt strings near the SDK calls.

  • med

    Where review or specialized prompts must evolve, add automated snapshot/equality tests to detect prompt drift across variants (e.g., memory vs combined review prompt) and confirm the background-review prompt strings remain the only source-of-truth.

    • agent/background_review.py:1-220 — All background review prompt strings are centralized here; add/extend tests to lock them down against accidental duplication elsewhere.
  • low

    For the largest prompt fragments (identity/guidance blocks), consider moving the biggest text literals into versioned markdown/template assets (if not already done elsewhere) and have prompt_builder load them, so changes remain even more diffable and reviewable than code-only constants.

    • agent/prompt_builder.py:1-220 — Many guidance strings are currently embedded as Python constants; externalizing to versioned text assets would further strengthen diffability.
Reproducibility / determinism 8%

This codebase contains partial determinism/reproducibility primitives: deterministic tool-call IDs (to avoid UUID-driven cache invalidation) and a trajectory persistence utility for recording conversation outcomes. However, at run/turn boundaries there is no evidence of a fully captured “replay manifest” (pinned model/provider identifiers, generation parameters, preprocessing settings, and RNG seeds) sufficient to recreate runs exactly from pinned inputs. The trajectory evidence trail exists, but it appears incomplete for determinism.

  • high

    Add a run-boundary “repro manifest” captured alongside each trajectory/run: exact model ID/provider/base_url, generation params (temperature/top_p/any seed), preprocessing/compression parameters, and the specific code/data versions used. Store it in the same output directory (or JSONL alongside trajectory entries) and include schema validation for required fields.

    • agent/trajectory.py:1-57 — Trajectory entries currently record only `timestamp`, `model`, `completed`, and `conversations`—missing deterministic replay inputs (seed/generation params/preprocessing settings).
    • agent/conversation_loop.py:1-220 — Core per-turn execution lives here; this is where determinism-affecting generation parameters should be captured/propagated into the trajectory manifest.
  • med

    Version and document the determinism-critical hashing inputs for tool-call IDs (argument serialization normalization). Ensure the exact serialization function and its version are included in persisted artifacts so that cached/deterministic IDs remain stable across code changes.

    • agent/codex_responses_adapter.py:128-164 — Deterministic call IDs depend on `arguments` (as a string) and `index`; without explicit serialization normalization/versioning captured, the same logical tool call might not hash identically across runs.
  • low

    Reduce nondeterministic fields in replay artifacts where possible (e.g., keep `timestamp` for auditing, but ensure a separate `run_id` is derived deterministically from pinned inputs).

    • agent/trajectory.py:1-57 — Uses `datetime.now().isoformat()` for trajectory entries, which is not deterministic; consider adding deterministic IDs derived from the repro manifest.
AI output validation 100%

The codebase has a strong, centralized AI output validation primitive for LLM auxiliary calls: `_validate_llm_response` enforces the expected `.choices[0].message` shape and throws a clear `RuntimeError`. `call_llm` consistently applies this validator immediately after each model invocation and reuses it across retries/fallback paths, preventing raw/unvalidated model payloads from flowing downstream.

  • high

    Create/extend tests that intentionally return malformed LLM payloads (e.g., `response=None`, `response.choices=[]`, or missing `.message`) and assert that each retry path also fails via `_validate_llm_response` with the same error wording (no open-loop: retries must reuse the same schema gate).

    • agent/auxiliary_client.py:4865-4955 — Validation error message and failure conditions are defined here; tests should assert these exact behaviors across retries.
    • agent/auxiliary_client.py:4955-5300 — `call_llm` wraps model calls with `_validate_llm_response(...)` across multiple retry/fallback branches; tests should cover at least one malformed-payload trigger per major branch.
  • med

    Audit the main agent LLM call path(s) (non-auxiliary) to ensure the same (or equivalent) validator is applied right after model invocations, not only for auxiliary routing.

Grounding / wrongness check 67%

This codebase has an output-grounding/wrongness check for structured plugin LLM calls: `PluginLlm.complete_structured()` enforces JSON parsing and (when a schema is provided) validates the parsed output against `json_schema` using `jsonschema`, failing closed on mismatch. However, beyond this structured path, I did not find evidence of a broader “claim-by-claim context grounding/judge” loop for free-form assistant text responses.

  • high

    Add a general grounding/wrongness check for free-form LLM outputs that are surfaced to users or used for actions: introduce a judge-based or context-based verification step (or retrieval-backed citation verification) and enforce a bounded re-check/retry policy on failure.

    • agent/plugin_llm.py:604-746 — Current enforcement is only for structured outputs. This is evidence of the existing check pattern, which can be extended to cover non-structured/free-form claims.
  • med

    For structured paths, ensure failure modes are explicit and observable (e.g., include the validation error details in the audit trail and/or add a deterministic fallback output schema on validation failure).

    • agent/plugin_llm.py:604-746 — Validation failures raise `ValueError` with a message, but the surrounding retry/fallback behavior is not shown in the slice; improving audit + fallback would strengthen closed-loop safety.
Self-correction / feedback loop 0%

No closed self-correction/feedback loop was found. The code detects judge-output parse/validation failures and uses fail-open + bounded pausing, but it does not feed the specific error back to the model for a re-check within the same validation path.

  • high

    Implement a closed retry loop inside GoalManager/judge_goal for judge contract failures: when _parse_judge_response() reports parse_failed (empty/non-JSON), re-prompt the same judge model with an error-specific instruction (e.g., 'Output exactly one JSON object with keys done and reason; your previous output was <reason snippet>'). Bound attempts (e.g., 2-3) before falling back to the current auto-pause behavior.

    • hermes_cli/goals.py:225-360 — parse failure is detected and returned as parse_failed, but the failure details are not used to create an error-fed-back next judge prompt.
    • hermes_cli/goals.py:600-705 — GoalManager.evaluate_after_turn() currently only pauses/limits after repeated parse failures, rather than retrying with the error fed back to the judge model.
  • med

    Add targeted tests that assert the loop is closed: when the judge output is non-JSON for the first attempt, the second attempt must include the specific failure and must re-run parsing. (For example, unit tests around judge_goal/_parse_judge_response with a mocked auxiliary client returning controlled invalid outputs.)

    • hermes_cli/goals.py:225-360 — The parse_failed contract provides the data needed to craft error-specific feedback, so tests can verify the prompt augmentation and re-parse.
Evaluation harness + scoring N/A

I did not find an “Evaluation harness + scoring” primitive in this codebase. There are some benchmark/unit tests for a specific runtime evaluation path (browser CDP evaluation), but no offline golden set with automated scoring, no eval runner that logs inputs/outputs, and no evidence of recurring production eval/scoring distinct from the per-request execution loop.

  • high

    Add a dedicated eval layer (e.g., an `evals/` package + CLI entrypoint) that runs an offline golden set, scores outputs with explicit metrics/rubrics, and logs results (inputs, model versions, prompts, outputs, scores, and pass/fail) to a persistent store.

  • med

    Instrument the production request/agent loop to emit structured “eval candidates” (prompt/input + model version + tool context + ground-truth key/labels when available) into the logging store, but keep eval execution in the separate eval layer.

  • low

    Optionally integrate an eval framework dependency (e.g., promptfoo/deepeval/langsmith/ragas) only after establishing the repo’s own artifact structure (golden set format, scorer definitions, and logging schema).

Runnable correctness checks N/A

I did not find any documented, one-command runnable pass/fail correctness-check entrypoint for this codebase (e.g., a CI workflow or root-level `test`/`check`/`build` command wired to return an unambiguous green/red status). While the repo contains Python test files (e.g., under `tests/` and `skills/.../tests/`), the required governance layer that makes correctness checks trivially runnable and externally verifiable from a single command was not located.

  • high

    Add or document a single root command (e.g., `pytest` invocation or `make test`/`just test`) that runs the existing test suite and returns a clear pass/fail exit code; ensure it covers the agent-facing correctness scope (setup/config flows + any workflow logic with mocks).

Actionable diagnostics 100%

The codebase includes a strong “actionable diagnostics” primitive via `hermes_cli/kanban_diagnostics.py`, which produces structured diagnostics with fix-oriented `actions`, and via `agent/lsp/reporter.py`, which externalizes LSP diagnostics with severity and exact line/column positioning. Both are runnable/consumable outputs rather than implicit logs or ad-hoc strings.

  • high

    Audit other failure-producing surfaces (e.g., CLI “status” outputs, update/check commands, and any tool preflight errors) to ensure they consistently emit structured diagnostics with (1) a stable diagnostic code/kind, (2) precise location/context when applicable, and (3) explicit operator actions/hints—not just error strings or stack traces. Reuse the existing `Diagnostic`/`DiagnosticAction` patterns where possible.

Positive confirmation 200%

The codebase contains a clear instance of positive confirmation: `tools/terminal_tool.py::check_terminal_requirements()` returns an explicit boolean success/failure signal (not only log messages), and `tests/tools/test_terminal_requirements.py` asserts on that success signal with runnable pytest pass/fail conditions.

  • high

    Search for other operational gates (e.g., startup readiness checks, backend selection checks, tool availability checks) and ensure they also expose an explicit positive success signal (boolean/structured status) and have corresponding tests asserting the success case.

  • med

    If CI/workflow files exist elsewhere in the repository but were not indexed by the code-graph query, add/verify a documented one-command test run that guarantees an unambiguous green/yellow/red outcome (positive confirmation at the repo level).

Machine-readable contracts 100%

This codebase does have machine-readable contracts: it externalizes tool parameter expectations as JSON schema (e.g., computer_use), and it provides explicit schema sanitizer/translation modules for backend/provider compatibility (generic sanitizer + Gemini + Moonshot). The presence of focused schema modules plus targeted tests indicates the contracts are managed rather than implicit.

  • high

    Add (or confirm) a single registry/manifest that enumerates all tool contracts (e.g., where schemas live, how to load them, and their versions), so an agent can query the available contracts without knowing file locations.

  • med

    Ensure the schema contract assets are consistently referenced/validated at the point where tool schemas are emitted to each provider (i.e., confirm call sites always use the sanitizers rather than duplicating shape assumptions).

Not applicable to this codebase: Raw / immutable source layer, Evaluation harness + scoring, Runnable correctness checks.

Tenancy Isolation

A tenant_id on every business table, row-level security in the database, and tests that prove a cross-tenant request returns 403.

0% 6/12 scored
  • Tenant key on every record 0%
    0/1 expected sites
  • Cache key namespacing 0%
    0/2 expected sites not present
  • Object/blob partitioning 0%
    0/3 expected sites not present
  • Per-tenant resource limits 0%
    0/2 expected sites not present
  • Tenant-scoped key management 0%
    0/1 expected sites not present
  • Cross-tenant isolation tests 0%
    0/4 expected sites not present
Tenant key on every record 0%

This codebase has tenant identifiers in some business-layer models (notably `TeamsMeetingRef.tenant_id`) and propagates them when normalizing Teams meeting data. However, the tenancy primitive does not appear to be applied consistently across related business records: for example, `MeetingArtifact` lacks a tenant key on the record itself.

  • high

    Add `tenant_id` (or an appropriate tenant/org/workspace FK) to all other Teams pipeline record types derived from meetings (e.g., `MeetingArtifact`) and ensure normalization functions populate it from the source `TeamsMeetingRef` or Graph payload.

  • med

    Audit other domain models/dataclasses that represent persistable business records for missing tenant identifiers (not just Teams pipeline), and add a lightweight invariant/test that asserts `tenant_id` is present on all records in those model modules.

Database-enforced isolation N/A

No database-enforced (row-level security / FORCE RLS / tenant-scoped schema) isolation primitive is present. The codebase uses shared SQLite databases for state/session data and uses filesystem-per-board SQLite DBs for kanban separation, but there is no evidence of tenant/org row-level filtering or enforced DB policies that would prevent cross-tenant reads if application code forgot the filter.

  • high

    If the system is intended to be multi-tenant, add a tenant identifier column (e.g., tenant_id/org_id) to each shared, writeable table in the DB layer (starting with sessions/messages and any other shared persistence). Then enforce access with database mechanisms (e.g., PostgreSQL RLS equivalent for SQLite if feasible, or move to a DB that supports RLS). Ensure policies are FORCE/mandatory so table owners cannot bypass them.

    • hermes_state.py:220-520 — Shared `sessions` and `messages` tables are defined without any tenant/org column, so there is no place for tenant-scoped DB policies to apply.
    • hermes_cli/kanban_db.py:1-220 — Kanban isolation is achieved via per-board separate SQLite files; this does not provide the requested defense-in-depth against missing tenant filters within a shared DB table.
  • med

    Add automated integration tests that attempt cross-tenant reads and list/export of resources by crafting a request with a different tenant identity (or by directly querying the DB without applying tenant filters) and assert the access is denied / rows are not returned.

    • hermes_state.py:220-520 — Current schema and design imply no tenant boundary exists at the persistence layer for state/session data, so cross-tenant tests should be introduced to validate the new enforcement.
Default-scoped queries N/A

I did not find an implementation of “default-scoped queries” (i.e., a data-access base model/repository that automatically applies tenant scoping to every query when no tenant filter is provided). The codebase appears to rely on isolation via partitioning (e.g., different SQLite DB files per Kanban board) rather than default-scoped query enforcement in the data-access layer, so this primitive is not applicable/present in this repository.

  • med

    If this system is expected to be multi-tenant at the database row level, introduce a tenant-aware data-access layer (base repository/model with an implicit tenant predicate) and add tests that attempt cross-tenant reads/lists/exports without specifying tenant filters.

    • hermes_cli/kanban_db.py:1-60 — Current isolation approach is board/DB-path partitioning; there is no indication of query-level default scoping in a shared repository/base model.
Tenant context at the boundary N/A

Agent produced no parseable output for this item.

Cache key namespacing 0%

No evidence of a cache-key namespacing primitive (tenant-prefixed cache keys like `tenant:{id}:...`) was found. Cache implementations observed include (1) a shared on-disk sticker cache keyed only by `file_unique_id`, and (2) a shared in-process memoization cache keyed only by config path/mtime—neither includes any tenant component.

  • high

    Introduce tenant-aware cache key namespacing (or per-tenant cache partitioning) in `gateway/sticker_cache.py` so cached sticker descriptions cannot be read across tenants when multiple tenants share the same Hermes home/process.

  • high

    Partition the module-level in-process memoization cache in `agent/skill_utils.py` by tenant (e.g., include tenant id in the cache key tuple, or create separate caches per tenant context).

    • agent/skill_utils.py:290-383 — The cache key is `(str(config_path), stat.st_mtime_ns)` and the cache is a global dict `_EXTERNAL_DIRS_CACHE`, with no tenant component.
Object/blob partitioning 0%

No evidence was found that object/blob storage artifacts are partitioned by tenant. The clearest persistence mechanism (`tools/tool_result_storage.py`) writes oversized tool outputs into a shared sandbox directory using only `tool_use_id` in the filename, with no tenant/org/workspace component in the storage path.

  • high

    Tenant-scope persisted tool-result storage paths by default. For example, change `remote_path` to include a tenant-derived prefix (e.g., `.../{tenant_id}/{tool_use_id}.txt`) and update any corresponding read/retrieval logic to require the same tenant prefix.

  • med

    Make `_resolve_storage_dir()` return a tenant-scoped base directory (or accept tenant_id and append it), so the entire persistence stack is isolated even when called from new tool code paths.

  • med

    Add integration tests that attempt cross-tenant retrieval of persisted artifacts (e.g., persist a large tool output under tenant A, then try to read it while authenticated as tenant B, asserting denial/not-found).

Tenant context in async work N/A

This codebase does not appear to implement a “tenant context in async work” primitive. It uses `contextvars` to propagate per-message/per-session gateway context (`HERMES_SESSION_*`) safely across concurrent asyncio tasks, but there is no corresponding tenant/org/workspace context type that is set into async workers/handlers and then enforced as mandatory before data access.

  • high

    If the product requires multi-tenant isolation, introduce a dedicated tenant context primitive (e.g., `HERMES_TENANT_ID` as a `contextvars.ContextVar`) and ensure every async entry point (queue/message/event handlers, background tasks, tool execution, SSE/run events) re-establishes tenant context before touching any tenant-scoped storage.

    • gateway/session_context.py:1-195 — Current async context propagation is session-scoped only; this is the layer where tenant-scoped context should be added if multi-tenancy exists.
  • med

    Add integration tests that attempt cross-tenant async operations (list/read/write/export, and also async job/event paths) and assert they are denied using uniform not-found/forbidden semantics.

    • tests/gateway/test_session_env.py:1-220 — There are concurrency/isolation tests for session contextvars, but none for tenant isolation; extend the pattern to tenant context once the tenant primitive exists.
Per-tenant resource limits 0%

The codebase includes multiple rate-limit/quota-related mechanisms (e.g., a shared Nous Portal “rate-limited” breaker file and a process-wide Signal attachment token bucket), but they are not applied per tenant. The mechanisms appear global/shared rather than keyed by tenant, so noisy-neighbor isolation for this primitive is missing.

  • high

    Change Nous Portal rate-limit breaker state to be tenant-scoped: derive tenant from the authenticated request/session context and include it in the persisted key/path (and in any in-memory representations). This ensures one tenant’s 429/cooldown cannot block other tenants’ Nous usage.

    • agent/nous_rate_guard.py:1-120 — State is written to a single shared file path returned by _state_path() using a fixed subdir/filename; no tenant component exists.
  • high

    Key the Signal attachment scheduler/bucket on tenant: ensure acquire/refill/feedback accounting happens per tenant (e.g., SignalAttachmentScheduler per tenant, or a dict of buckets keyed by tenant id).

Tenant-scoped key management 0%

No evidence of tenant-scoped encryption key management (per-tenant KMS/envelope keys, per-tenant crypto-erase, or explicit tenant-key references) exists in the codebase. Crypto functionality present (e.g., WeCom callback AES-CBC) uses key material passed in and reused without tenant scoping.

  • high

    Introduce tenant-scoped key management at the lowest crypto layer: implement a key-provider that resolves the correct per-tenant key (or envelope data key) using a tenant identifier derived from the trusted session/context, not from request parameters. Update crypto call sites (e.g., `WXBizMsgCrypt`) to take a tenant context and fetch/decrypt the correct key per tenant before encrypt/decrypt.

  • med

    Add integration tests that attempt cross-tenant encryption/decryption with tenant A vs tenant B credentials to ensure keys/envelopes cannot be mixed and that the wrong tenant cannot decrypt data.

Admin / role scoping N/A

This codebase does not implement the “Admin / role scoping” tenancy isolation primitive. There is no evidence of a tenant membership–scoped elevated role model with an `isAdmin`-style boolean that is tied to a tenant/membership, nor an explicit separate audited cross-tenant admin capability. Where “admin”/roles appear, it is either non-tenant (e.g., diagnostics) or used for other purposes (e.g., generic ACP permission bridging).

  • high

    Introduce a tenant membership–scoped elevated role model in the authz layer (e.g., roles bound to a membership/tenant_id FK) and ensure cross-tenant elevated access is handled via a separate, explicitly audited capability. Confirm enforcement is default at the lowest layer (DB/RLS or a centralized repository scope), not scattered app-level checks.

  • med

    Add integration tests that attempt cross-tenant admin actions (read/list/export/approval) and assert denial with uniform error behavior. This should be done at the boundary where admin checks are performed (authz middleware/handler and the data-access layer).

    • tests/gateway/test_teams.py:1-220 — Existing gateway/platform tests mock SDKs and cover platform behavior, but none (from the audited searches/evidence) cover tenant-scoped admin isolation.
Uniform not-found vs. forbidden N/A

I did not find any implementation of the “uniform not-found vs forbidden” tenancy isolation primitive (i.e., returning the same not-found response for both missing and access-denied, to avoid leaking resource existence across tenants). The codebase’s visible access-control layer for the dashboard uses 401/redirect semantics rather than a tenant-scoped 403-vs-404 pattern.

  • high

    Identify the tenant-scoped data model and the specific HTTP/API endpoints that fetch tenant/org-scoped resources by ID (where cross-tenant reads could return either 403 or 404). Then implement a single shared error/exception mapping at the data-access boundary so that access-denied is converted to the same not-found response as missing records for any tenant-scoped fetch.

  • med

    Add an integration test that creates/uses two different tenant/org identities and attempts a cross-tenant read (and list/export if applicable). Assert the response for (a) non-existent resource id and (b) an existing resource in another tenant are identical (same status code and response body/error).

Cross-tenant isolation tests 0%

No cross-tenant isolation test suite boundary was found. Existing “isolation” tests focus on concurrency/session context leakage, cache aliasing, and platform-based state namespacing—none attempt cross-tenant read/write/list/export/async operations and assert denial.

  • high

    Add a dedicated cross-tenant isolation integration test module that creates two tenants (or two identities bound to different tenants), then attempts cross-tenant read, write, list, and export for each tenant-scoped resource and asserts failure (prefer uniform not-found/forbidden behavior depending on your API contract). Include async/enqueued paths as well.

  • high

    Extend the test strategy to cover any cross-request/shared storage and background workflows (caches, tool registries, event/queue handlers) with explicit tenant separation assertions (tenant A cannot observe tenant B outputs).

  • med

    Create a test harness/fixture for “tenant context” setup (two tenants + tenant-bound identity/session) so every resource test can reuse the same cross-tenant deny assertions across read/write/list/export/async.

Not applicable to this codebase: Database-enforced isolation, Default-scoped queries, Tenant context at the boundary, Tenant context in async work, Admin / role scoping, Uniform not-found vs. forbidden.

Identity & Access

SAML/OIDC libraries, SCIM provisioning endpoints, and a real roles/permissions schema — not a hard-coded isAdmin boolean.

57% 9/11 scored
  • Federated SSO (SAML/OIDC) 100%
    5/5 expected sites
  • RBAC modeled as data 0%
    0/3 expected sites not present
  • Centralized authorization 67%
    2/2 expected sites
  • No hardcoded privilege shortcuts 0%
    0/1 expected sites not present
  • Deny-by-default 100%
    2/2 expected sites
  • MFA / step-up auth 0%
    0/3 expected sites not present
  • Session & token hygiene 94%
    6/6 expected sites
  • Scoped machine credentials 0%
    0/2 expected sites not present
  • IP allowlists / network constraints 150%
    3/2 expected sites
Federated SSO (SAML/OIDC) 100%

Federated SSO exists in the dashboard authentication layer using a standardized, provider-pluggable OAuth/OIDC-like flow. The code wires a federated login start endpoint, a security-critical callback that validates CSRF state before completing login, a centralized middleware that verifies/refreshes sessions on each protected request, and logout/session invalidation plus guarded WS ticket minting.

  • high

    Review and document per-provider verify_session semantics (e.g., whether tokens are cryptographically validated server-side vs. introspected) and ensure all registered providers conform to the expected “returns None on expiry/invalid” contract without bypasses.

  • med

    Add explicit automated tests asserting that the callback rejects attacker-controlled next/state/cookie tampering across all registered providers (covering the state mismatch branch and missing_pkce_cookie branch).

  • low

    Ensure the public /api/auth/providers endpoint is rate-limited or protected against abuse if it can expose provider metadata in a sensitive deployment context.

Directory provisioning (SCIM) N/A

There is no Directory provisioning (SCIM) primitive implemented in this codebase. The only “directory” concept found is a channel directory cache (messaging reachability), and authentication is implemented for the CLI/agent via OAuth/API keys rather than any SCIM 2.0 Users/Groups lifecycle (including deprovisioning). Because this repository does not appear to expose any SCIM-compatible identity API surface, there are no concrete SCIM lifecycle sites to validate for correctness.

  • high

    If SCIM provisioning is a requirement for this product, add a dedicated identity/provisioning API module that implements the SCIM 2.0 surface (/scim/v2/Users and /scim/v2/Groups) including PATCH and a full lifecycle with deactivation that actually revokes access (e.g., sets user inactive, invalidates sessions/tokens, and prevents future authorization).

    • hermes_cli/auth.py:1-60 — Current auth is CLI/agent authentication (OAuth/API keys), indicating no existing identity provisioning API wiring to extend.
  • med

    Add integration tests that cover the SCIM lifecycle end-to-end (create user → verify access → deactivate/suspend → verify access is revoked; and optionally delete → verify access is removed).

RBAC modeled as data 0%

RBAC modeled as data (roles/permissions/memberships with centralized, permission-first authorization checks) is not implemented as a distinct primitive in this codebase. The gateway implements authorization for slash commands via config-derived allowlists (admin_user_ids and per-command allow sets) in gateway/slash_access.py, but this does not reflect an RBAC roles/permissions data model checked through one policy layer.

  • high

    Introduce (or integrate) a data-driven RBAC model: define role, permission, role_permissions, and memberships (whether persisted or loaded from config), and build an authorization service/engine that takes (principal, action/resource) → allow/deny based on permissions derived from role memberships.

    • gateway/slash_access.py:90-140 — policy_from_extra() currently builds authorization from allowlist keys; replace this with RBAC membership/role assignment resolution and permission aggregation.
  • high

    Centralize authorization decisions so every slash-command dispatch consults the same RBAC engine (deny-by-default), instead of evaluating admin/user_allowed_commands in multiple places (or inlined policy objects).

    • gateway/slash_access.py:30-70 — SlashAccessPolicy.can_run() is the current enforcement point; refactor it to perform permission checks from the RBAC engine rather than admin/user_allowed_commands logic.
  • med

    If you must keep backward compatibility with existing config formats, add a translation layer that maps legacy allowlist fields (allow_admin_from, user_allowed_commands) into synthetic RBAC roles/permissions assignments, so authorization decisions remain uniformly data-driven.

    • gateway/slash_access.py:90-140 — DM/group scope-specific keys are currently parsed here; keep parsing for compatibility but convert into RBAC membership/permission structures for enforcement.
Centralized authorization 67%

The codebase implements centralized authorization mainly in the messaging gateway: a single `_is_user_authorized(...)` decision point is used to gate user-originated events before dispatch, with a default-deny posture and consistent logging on unauthorized attempts. For the Hermes dashboard, there is also a single auth-gate middleware (`gated_auth_middleware`) with allowlisted public paths and enforced session verification; however, the code slice reviewed was only the middleware header/initial portion, so only the gateway decision point is fully evidenced as a correct centralized authz chokepoint.

  • high

    Audit and document the dashboard authorization flow end-to-end: confirm that `gated_auth_middleware` (when `auth_required=True`) is the single place that makes allow/deny decisions for all non-public dashboard routes, and ensure every authorized/denied outcome is logged via the existing `audit_log` events.

    • hermes_cli/dashboard_auth/middleware.py:1-200 — Middleware is explicitly described as a centralized auth gate that enforces verified sessions for all routes except the configured public allowlist; verify the remainder of the file confirms a single decision boundary plus decision logging.
  • med

    Reduce drift risk in gateway authz semantics by ensuring adapter-owned access policy (`enforces_own_access_policy` / `dm_policy` etc.) is the only alternative path, and add explicit tests that assert `_is_user_authorized` is still the single decision chokepoint even when plugins run (confirm intended bypasses).

    • gateway/run.py:7300-7450 — Plugin hook runs before auth and can return `skip`; ensure these bypasses are intended and cannot accidentally create implicit authorization gaps.
No hardcoded privilege shortcuts 0%

The codebase does not correctly apply the primitive 'No hardcoded privilege shortcuts'. In the slash-command authorization flow, privileged access is determined by checking whether the caller’s identity (`user_id`) is present in an operator-configured admin list, rather than by deriving privilege from a roles/permissions model.

  • high

    Remove the identity-string-based privilege shortcut in `SlashAccessPolicy.is_admin` / `can_run`. Replace it with role/permission evaluation sourced from the canonical role model (e.g., memberships → role_permissions → permissions) and enforce the decision via a centralized policy module.

    • gateway/slash_access.py:67-103 — The privilege gate is implemented by `return str(user_id) in self.admin_user_ids` and then used by `can_run` to allow all commands for admins. This is the exact anti-pattern the primitive forbids.
Deny-by-default 100%

The codebase implements deny-by-default at the dashboard auth boundary. Non-loopback mode uses a centralized allowlist (_path_is_public / PUBLIC_API_PATHS + public path prefixes) and otherwise returns 401/redirect unless a verified session is attached; legacy loopback mode similarly requires the ephemeral session token for all /api/* routes except explicitly listed public endpoints. This prevents silently public endpoints when new routes are added under /api/.

  • high

    Ensure any new /api/* endpoints are added only if they are truly non-sensitive, by extending PUBLIC_API_PATHS (or the explicit public prefix list) after threat review; otherwise rely on the default-deny middleware behavior.

  • med

    Keep the legacy and OAuth gates synchronized by routing both through the same PUBLIC_API_PATHS allowlist (already done) and add a regression test for any future drift to confirm newly added public paths do not get exposed unintentionally.

AuthN before AuthZ at the boundary N/A

Agent produced no parseable output for this item.

MFA / step-up auth 0%

I found dashboard authentication enforced via OAuth session cookies with token verification/refresh, but no MFA/step-up (no second-factor libraries or enrollment/verify challenges, and no step-up enforcement logic on high-risk operations). Therefore, this primitive appears absent in this codebase.

  • high

    Introduce an MFA/step-up mechanism integrated into the centralized dashboard auth boundary (the auth gate middleware). Specifically: add a step-up-required decision point for sensitive actions (admin endpoints, credential/config changes), trigger a second-factor verification challenge, and ensure the step-up result is time-bound and auditable (e.g., stored as a claim in the session / separate step-up cookie, with event logs).

  • high

    Add step-up enrollment/verification routes and wiring for the chosen second factor (TOTP or WebAuthn or an enterprise IdP step-up). Ensure the callback/challenge flow results in a server-validated step-up status that high-risk endpoints check before executing.

  • med

    Create/centralize a policy list of “step-up required” operations and ensure it cannot drift from the route allowlist. Use one shared source of truth for (a) unauth bypass paths and (b) auth-only vs. auth+step-up paths.

    • hermes_cli/dashboard_auth/public_paths.py:1-50 — This file is already a shared allowlist for auth bypass; it should be extended/paired with step-up-required policy so sensitive operations are never implicitly allowed without second factor.
Session & token hygiene 94%

This codebase has solid session/token hygiene for the dashboard: access tokens are verified per request, expired access tokens trigger refresh rotation, logout revokes refresh tokens (best-effort) and clears cookies, and WebSocket access uses short-lived, single-use tickets with TTL and server-side consume/delete behavior. However, an additional internal WS credential is explicitly non-expiring and multi-use, which is a hygiene gap relative to 'short-lived, rotated, revocable' tokens.

  • high

    Make the internal WS credential hygiene-aligned: introduce rotation (per-interval or per spawn), expiry, and server-side revocation/invalidating mechanisms. Today internal_ws_credential() is explicitly 'process-lifetime' and 'never expires'.

    • hermes_cli/dashboard_auth/ws_tickets.py:86-132 — internal_ws_credential() is minted once per process, is multi-use, and the docstring states it 'never expires'. This conflicts with the primitive requirement for short-lived, revocable tokens.
  • med

    Confirm provider implementations fully enforce 'refresh dead/reuse-detected' semantics such that refresh_session can reliably raise RefreshExpiredError, ensuring refresh token reuse is handled as a revocation/forced re-login event.

  • low

    Add automated tests that assert logout actually stops replay: after /auth/logout, a previously issued access cookie should be rejected by verify_session (or access token TTL should be minimized) and a refresh cookie should either be revoked or rejected on the next refresh attempt.

Scoped machine credentials 0%

The codebase does not implement the 'Scoped machine credentials' primitive. The API server uses a single shared bearer secret (API_SERVER_KEY) validated via _check_auth, with no per-client scoping, revocability, or service-account model. While the repo has credential pooling for upstream LLM providers, that is not a scoped machine-credential scheme for inbound programmatic access to Hermes itself.

  • high

    Replace the single global API_SERVER_KEY with a service-account / api_keys model that stores (at minimum) scopes/permissions and status for each client credential. Issue short-lived scoped tokens (or store hashed API keys) per service, and validate presented credentials by looking up the credential record (scope enforcement), not by comparing against one shared secret.

  • high

    Add revocation/rotation support for inbound credentials: token expiry (for tokens) and server-side invalidation (e.g., credential status/blacklist) and ensure logout/revocation paths invalidate credentials immediately.

  • med

    Implement least-privilege scope checks at the auth boundary: parse/resolve the authenticated machine credential into allowed operations for that client, and enforce deny-by-default for API routes/handlers.

    • gateway/platforms/api_server.py:843-867 — This is the chokepoint for auth enforcement; it should be extended from 'valid key?' to 'which scoped client and which permissions?' to prevent over-privilege.
IP allowlists / network constraints 150%

An IP allowlist / CIDR-based network constraint exists for the Microsoft Graph webhook adapter. The implementation parses `extra.allowed_source_cidrs`, fails closed at startup when the bind is network-accessible but no CIDRs are configured, and enforces the source-IP allowlist before processing in the health, validation, and notification handlers (returning 403 on mismatch). No comparable per-tenant IP allowlist middleware/guard was found elsewhere in the codebase.

  • med

    If the gateway also exposes other external HTTP entrypoints beyond the MS Graph webhook, identify them and apply the same pattern (per-endpoint CIDR allowlist, fail-closed startup when exposed, and checks at the top of each handler) to keep network constraints consistent.

  • low

    Audit whether reverse-proxy deployments are used for the MS Graph webhook and, if so, ensure `request.remote` reflects the true source IP (e.g., forwarded headers / aiohttp trust settings).

Not applicable to this codebase: Directory provisioning (SCIM), AuthN before AuthZ at the boundary.

Compliance Code Patterns

Envelope encryption, enforced TLS, validated inputs, and zero secrets anywhere in the full git history.

35% 11/11 scored
  • Encryption in transit 0%
    0/3 expected sites not present
  • Encryption at rest 0%
    0/2 expected sites
  • Centralized key management 0%
    0/1 expected sites not present
  • Secrets management 89%
    3/3 expected sites
  • No secrets in git history 0%
    0/1 expected sites not present
  • Input validation at boundaries 0%
    0/1 expected sites
  • Injection-safe data access 100%
    1/1 expected sites
  • Data classification & PII handling 83%
    2/2 expected sites
  • Access logging on protected routes 0%
    0/2 expected sites
  • Retention & secure deletion 11%
    1/3 expected sites
  • Secure defaults / hardening 100%
    4/4 expected sites
Encryption in transit 0%

No evidence of “encryption in transit” enforcement across every hop. The dashboard’s internal WebSocket URLs are hardcoded to `ws://` (plaintext) and the cross-container health probe is HTTP-based, indicating plaintext transport is possible and not redirected/secured by TLS/HSTS in code paths reviewed.

  • high

    Change all internal WebSocket URL constructions from `ws://` to `wss://`, and derive the scheme from trusted request/proxy metadata (e.g., `X-Forwarded-Proto`) or a server config flag; ensure both `/api/ws` and `/api/pub` use TLS in production.

  • high

    Enforce TLS/HTTPS for inter-service HTTP calls. For `_probe_gateway_health`, require `https://` (or add a config that defaults to HTTPS and rejects plain HTTP in production), and optionally validate certificates.

    • hermes_cli/web_server.py:620-720 — Health probe is documented/implemented as an HTTP fetch (`urllib.request.urlopen`) with `http://` examples; no TLS enforcement is shown.
  • med

    Add edge transport hardening in the FastAPI server: redirect HTTP→HTTPS and set transport security headers (HSTS, secure redirects) for all dashboard routes (including websocket upgrade handling via correct proxy configuration).

    • hermes_cli/web_server.py:1-260 — Review of server setup shows CORS/auth/middleware, but no TLS forcing/redirect/HSTS enforcement was found in the inspected boot/middleware portions.
Encryption at rest 0%

Encryption at rest is implemented at least for Matrix E2EE crypto-state persistence: when E2EE is enabled, the adapter uses a SQLite-backed mautrix crypto store (`PgCryptoStore`) located at `.../matrix/store/crypto.db`. However, other sensitive local persistence in the Weixin adapter (account `token` and `context_token` caches) is written to disk as JSON without encryption at rest, so the primitive is not consistently applied across all sensitive-at-rest data surfaces.

  • high

    Encrypt sensitive Weixin on-disk data (`save_weixin_account` token JSON and `ContextTokenStore` context-token JSON) using field-level encryption (or an encrypted storage layer) and ensure encryption covers backups/snapshots as well.

  • med

    Audit other local persistence paths for sensitive data (tokens, session keys, private keys, encrypted media parameters stored locally) and ensure they either use the same encrypted-at-rest mechanism as Matrix crypto.db or adopt an equivalent encrypted storage pattern.

    • gateway/platforms/matrix.py:676-744 — Provides the reference implementation for at-rest encryption of sensitive crypto-state in this codebase (use this as the baseline pattern when extending to other sensitive stores).
Centralized key management 0%

I did not find any centralized key-management implementation (KMS/Vault/KeyVault/Secrets Manager-style) with rotation and revocation logic. The codebase instead appears to fetch keys ad-hoc from environment variables (e.g., LINEAR_API_KEY) rather than from a managed, centrally administered key store.

  • high

    Introduce a centralized managed key store for any encryption/auth keys the system uses (e.g., cloud KMS/Vault/KeyVault/Secrets Manager) and replace direct env-based key retrieval with runtime fetches from the managed store; ensure rotation policies and an emergency revocation mechanism are implemented and exercised.

Secrets management 89%

This codebase implements runtime secrets management by loading credentials from Bitwarden into environment variables. `hermes_cli.env_loader.load_hermes_dotenv()` invokes `_apply_external_secret_sources()` which reads a `secrets:` section from `~/.hermes/config.yaml` and calls `agent.secret_sources.bitwarden.apply_bitwarden_secrets(...)` to fetch secrets via the `bws` CLI. The implementation is centralized and is applied before credential-dependent runtime logic reads `os.environ`. Disk caching exists and is permission-restricted (0600) but is plaintext-equivalent.

  • high

    Consider encrypting the disk cache (`<hermes_home>/cache/bws_cache.json`) or avoiding persistence of secret values altogether; current design stores fetched secret values in a plaintext-equivalent JSON file (even though it uses `chmod 0600`).

  • med

    Run/maintain a stricter policy around committed secret artifacts. A full-history gitleaks scan produced many “REDACTED” hits; confirm none are real credentials and keep test/fixture values clearly marked and rotated/invalidated.

    • hermes_cli/auth.py:60-110 — Example gitleaks hit area includes OAuth client IDs/related constants; ensure none are true credentials and prefer loading client secrets/tokens only via the env_loader/secret source path.
  • low

    Expand secret-source support beyond Bitwarden (Vault/Secrets Manager), but keep the same enforcement pattern (runtime injection into `os.environ` and no plaintext literals).

    • hermes_cli/env_loader.py:1-120 — The env_loader includes `_SECRET_SOURCES` and labels for multiple secret sources (future-proofing), indicating an intended pattern extension.
No secrets in git history 0%

This primitive is NOT satisfied. A full-history gitleaks scan returned many matches for committed secret/credential material, and the current codebase contains hardcoded OAuth/credential-like values in files such as `hermes_cli/auth.py` and `agent/anthropic_adapter.py` (plus test literals). Therefore, the codebase does not meet “No secrets in git history”.

  • high

    Rotate/replace any credentials/keys/client secrets that were committed (treat all matches as compromised), then remove them from history (e.g., git filter-repo/BFG) and force-push. Ensure CI includes a full-history secret scan so future commits are blocked.

  • med

    Update tests and documentation to use non-secret placeholders (explicitly marked as fake) and/or generate ephemeral test secrets at runtime. Avoid committing even-looking tokens; if absolutely needed, ensure they are guaranteed non-functional and validated by a secret-scan rule exception policy.

Input validation at boundaries 0%

The codebase does apply input validation at boundaries in meaningful places (FastAPI/Pydantic validation for request bodies and explicit middleware validation of Host headers). However, at least one sensitive boundary (DELETE /api/webhooks/{name}) accepts a path parameter as a raw string with only normalization (strip/lower) and without a schema/constraints layer, leaving it as an unmatched should-be site.

  • high

    Add explicit schema/constraint validation for webhook path parameter `name` on DELETE /api/webhooks/{name} (e.g., restrict length and allowed characters, and reject empty/invalid values with 400). Consider using a Pydantic model or FastAPI parameter constraints instead of only `(name or '').strip().lower()`.

Injection-safe data access 100%

Injection-safe data access (parameterized/bound queries) is present and correctly applied in the Kanban dashboard API layer. The DB operations observed in request-driven handlers use `?` placeholders with bound parameters, not string concatenation or f-string interpolation of untrusted input into SQL.

  • high

    Do a targeted sweep of all DB access functions that consume HTTP inputs (e.g., anything taking `task_id`, board slugs, user/session identifiers) to confirm they always use placeholders. Pay special attention to any SQL assembled via f-strings for dynamic identifiers; ensure only trusted/static identifiers are interpolated and never untrusted values.

  • med

    If any layer uses dynamic SQL construction for identifiers (table/column names), ensure it is restricted to whitelisted/validated values and not derived directly from request parameters.

Data classification & PII handling 83%

This codebase does have data classification/PII-handling controls: it includes targeted redaction logic for (1) public debug log uploads and (2) Telegram-bound gateway responses/status messages. However, the implementation appears specialized (pattern-based, best-effort) rather than a comprehensive, centrally-enforced sensitivity taxonomy across all logging/export paths.

  • high

    Audit and enforce PII/sensitive-field masking centrally: identify every place that logs/serializes user content or credentials (especially JSON dumps via stringify/toJSON, debug snapshots, and any “dump/export” utilities) and ensure they route through a single sensitivity-aware redaction layer with a field/tag taxonomy (PII vs secrets vs free-form content).

    • gateway/run.py:220-360 — Current redaction enforcement is clearly applied only on Telegram-bound user-facing text for provider failures/status; this suggests other logging/export paths may not uniformly apply the same control.
  • med

    Extend classification from pattern-based masking to structured tagging: introduce a small schema for sensitive fields (e.g., email/phone/token/access-token/query-token/media identifiers) and mask those by key across serialization and log sinks.

    • gateway/run.py:220-360 — _redact_gateway_user_facing_secrets is pattern-based; it’s effective for known secret shapes but not a general “PII field” guarantee.
  • low

    Add/strengthen tests that assert “no PII in logs” for each sink: include property-based or fixture-based tests ensuring redact/mask functions are invoked on every relevant path (not just the Telegram error/status path and debug-share upload).

    • hermes_cli/debug.py:1-220 — The debug-share behavior is documented and presumably tested, but additional sink coverage is needed beyond this boundary.
Access logging on protected routes 0%

The codebase includes a dedicated dashboard-auth audit logger (`audit_log` writing `~/.hermes/logs/dashboard-auth.log`) and it records several authentication lifecycle events (login/refresh/verification failures). However, it does not appear to implement “access logging on protected routes” for every authenticated/sensitive action: the central auth-gate middleware allows verified sessions to proceed (`return await call_next(request)`) without emitting a per-request access log that includes a unique actor identifier for the action.

  • high

    In `hermes_cli/dashboard_auth/middleware.py` (inside `gated_auth_middleware`), add a per-request audit/access log emitted after authentication succeeds (including after refresh succeeds) and before `call_next(request)`. The log entry should include a unique actor identifier (e.g., the authenticated `user_id` from `request.state.session`) and enough request metadata to support attributable auditing.

  • med

    Ensure the per-request access log is applied uniformly across all authenticated/sensitive endpoints (not just refresh/verify failures). If there are additional auth-gates elsewhere, confirm they all delegate to the same middleware logging path.

    • hermes_cli/dashboard_auth/middleware.py:1-239 — The middleware is the primary enforcement location for protected dashboard routes; it contains audit logging for auth lifecycle events, but not for every passed-through authenticated request.
Retention & secure deletion 11%

The codebase does include retention-style enforcement for a few credential/auxiliary data types (WS ticket TTL + immediate removal on consume; debug-share paste auto-expiry with a sweep that deletes remote pastes) and a request-time deletion endpoint for stored responses. However, the audited evidence does not show a comprehensive, system-wide retention window + secure deletion (including cascade to derived data/backups and cryptographic wipe) for persisted conversation/PII-like content. Therefore the primitive is only partially implemented.

  • high

    Add/verify an enforcing retention policy for persisted chat/session/response content (not just deletion endpoints): implement scheduled purge/TTL jobs for response/session storage and ensure DELETE cascades to all related/derived records (and ensure purge reaches backups/exports). Evidence currently shows only DELETE /v1/responses/{response_id} without evidenced retention windows or secure disposal across backups.

    • gateway/platforms/api_server.py:3184-3224 — DELETE /v1/responses/{response_id} deletes from _response_store, but no retention window/purge/secure disposal across derived data/backups is evidenced in the audited slices.
  • med

    For debug-share/log upload flows, document and enforce local secure deletion (if any local intermediate files are written) and verify that any derived/local copies are removed or cryptographically wiped; currently only remote paste auto-deletion is evidenced.

  • low

    Extend test coverage for retention/purge correctness: add integration tests proving that expired items are removed by background sweeps and that delete-on-request removes all related artifacts (not only the primary record).

Secure defaults / hardening 100%

This codebase applies Secure defaults/hardening primarily in its FastAPI dashboard auth layer. Session cookies are hardened (HttpOnly, SameSite=Lax, HTTPS-only Secure, constrained Path, and __Host-/__Secure- naming). The dashboard auth middleware enforces authentication for all non-public routes and clears cookies on invalid/expired sessions to force clean re-auth. However, there is no evidence in the inspected server bootstrap/web server file of security-header middleware (e.g., CSP/X-Frame-Options) being wired as a hardened default.

  • high

    Add a production security-headers middleware for the FastAPI dashboard (e.g., CSP, X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Permissions-Policy) and ensure it is enabled in all non-local/prod paths.

    • hermes_cli/web_server.py:1-120 — Central dashboard server wiring is present (FastAPI app, CORSMiddleware). Security-header hardening middleware wiring (CSP/other headers) was not observed in this inspected region, so this is a likely un-enforced should-be site.
  • med

    Confirm production debug/verbose error behavior is disabled for the web server (e.g., no stack traces returned to clients, and any debug endpoints are gated off in non-local deployments). If it exists elsewhere, document/ensure consistent gating on every web entrypoint.

    • hermes_cli/dashboard_auth/middleware.py:1-112 — The middleware focuses on auth gating and structured 401/redirect responses; additional production-safe error behavior (stack-trace suppression) was not verified via the inspected hardening surfaces.
  • low

    If the compliance target requires inactivity-based session termination (not just token expiry), add/confirm an inactivity watchdog server-side for the dashboard auth session cookies/tokens and test it across refresh flows.

Audit, Governance, Residency

An append-only audit_events table, a queryable audit API, and per-region infrastructure keyed on each tenant’s region.

14% 7/10 scored
  • Dedicated audit event store 89%
    3/3 expected sites
  • Append-only / tamper-evidence 0%
    0/3 expected sites
  • Comprehensive event coverage 11%
    1/3 expected sites
  • Queryable, provable audit access 0%
    0/2 expected sites not present
  • Audit retention & separation of duties 0%
    0/1 expected sites not present
  • Data-subject rights (export & erase) 0%
    0/4 expected sites
  • Sub-processor / data-flow transparency 0%
    0/4 expected sites not present
Dedicated audit event store 89%

A dedicated audit event store exists for dashboard-auth events (`hermes_cli/dashboard_auth/audit.py`) and is written as JSON-lines to `~/.hermes/logs/dashboard-auth.log` (or `$HERMES_HOME`). The code emits structured audit events for sensitive auth decisions (login start/failure; session verify failures; refresh success). However, the event schema currently emphasizes `ts`, `event`, and provided fields, and there is no strong evidence (in this code) of a standardized, comprehensive shape including actor+tenant+resource type+id across all events—so completeness/consistency vs the full expected audit schema is only partially demonstrated.

  • high

    Define and enforce a single canonical audit event schema (actor, tenant, action, resource type+id, context, timestamp) and update all audit call sites to populate it consistently. Evidence: audit store accepts arbitrary `fields` but does not force required keys.

    • hermes_cli/dashboard_auth/audit.py:1-88 — Central audit writer takes arbitrary `**fields` and only guarantees `ts` + `event`; there is no compile-time/runtime enforcement that required keys (actor/tenant/resource) are always present.
  • med

    Add tenant scoping to the emitted audit records (e.g., include `tenant_id`/`org_id` from session/provider context where applicable) and add automated tests that assert the presence of required schema fields for each event type.

  • low

    Harden integrity/tamper-evidence guarantees for the audit store (e.g., hash-chaining and verification on read/export) if required by your governance standard.

Append-only / tamper-evidence 0%

The codebase has an audit log writer for dashboard-auth events that appends JSON lines to `$HERMES_HOME/logs/dashboard-auth.log`. This satisfies a basic “append” behavior, but it does not implement tamper-evidence (no hash chain/signing/integrity verification or immutability enforcement). As a result, while an audit trail exists, it is not provable as tamper-evident evidence.

  • high

    Upgrade `hermes_cli/dashboard_auth/audit.py:audit_log()` to produce tamper-evident records: implement a hash-chain (store prev-hash with each line, compute next hash from canonicalized entry), optionally sign records (HMAC/Ed25519) with a key whose write access is restricted, and add a verification function used on startup or on demand.

  • high

    Restrict audit log mutation outside the writer: ensure log file permissions/ownership prevent arbitrary modification and ensure any rotation/deletion is either disallowed or itself audited and integrity-preserving (e.g., seal segments with a final signed root hash).

  • med

    Add automated tests that verify tamper-evidence: after writing N events, mutate one line on disk and assert verification fails for the chain from that point onward (and/or detect signature mismatch).

Comprehensive event coverage 11%

This codebase has partial comprehensive event coverage: dashboard-auth security events (login/refresh/session verification) are written to a dedicated structured JSON-lines audit log with UTC timestamps and redaction of token-like fields, and the API server emits structured per-run/tool/message SSE events with timestamps and correlation IDs. However, user data export is not audit-covered in a provable/structured way (the session export path only logs client-side notifications), and broader “comprehensive” coverage for permission/role changes and sensitive CRUD/export operations across the API layer is not evidenced by audit writes to the dedicated audit store.

  • high

    Add server-side audit emission for data exports (at the API handler that authorizes and returns export/download content, or at the backend export endpoint), producing a structured audit record (actor/session/tenant, action=export, resource, timestamp) alongside the existing dashboard-auth audit log / evidence store.

  • high

    Extend comprehensive audit coverage from dashboard-auth only to all security-relevant sensitive operations (permission/role changes, CRUD on persisted records like sessions/responses, and approvals) by wiring a single authoritative evidence emitter into those backend handlers instead of relying on SSE-only lifecycles.

  • med

    Ensure the sensitive-action timeline is queryable for auditors/customers by providing a tenant-scoped (or operator-scoped) audit-read/export interface over the structured audit store, rather than only writing to a local file without an auditable read/export endpoint.

    • hermes_cli/dashboard_auth/audit.py:1-88 — Audit events are written to a local log file path ($HERMES_HOME/logs/dashboard-auth.log) but this file-level evidence is not shown to have a read/export API in the audited code paths.
Queryable, provable audit access 0%

The codebase writes some audit-like events to a local JSONL file (dashboard-auth audit), and builds request metadata for security/audit warnings. However, there is no tenant-scoped, paginated audit-read API and no exportable, independently-verifiable audit evidence trail accessible to customers/auditors—so this primitive is absent.

  • high

    Implement a dedicated structured audit event store (separate from application logs) that persists tenant-scoped events with actor, resource identifiers, timestamps, policy/state fields, and cryptographic/tamper-evidence metadata (e.g., hash chaining or signed records).

  • high

    Add an external audit-read API surface with tenant scope + pagination (e.g., GET /api/tenants/{tenant_id}/audit-events?page=...&page_size=...) and an export endpoint that outputs verifiable evidence (e.g., signed JSON bundle or CSV + integrity proof).

  • med

    Wire audit emission so that all security-relevant actions (auth/login outcomes, session verification, revocations, permission/policy changes, exports) produce structured audit events persisted to the evidence store, not only warnings/log lines.

    • hermes_cli/dashboard_auth/audit.py:1-88 — The module defines event types and a file-write function; verify which sensitive actions are covered and ensure everything required for provable audit access is persisted to the new evidence store with immutable semantics.
Audit retention & separation of duties 0%

The codebase writes a structured dashboard-auth audit log to a local JSON-lines file, but there is no implemented retention policy enforcement (no TTL/retention config or purge job) and no verifiable separation-of-duties controls around changing retention. As a result, the primitive (enforced audit retention + insider-proof separation of duties) is not provably satisfied.

  • high

    Add an explicit, enforced retention policy for the dashboard-auth audit log (e.g., configurable TTL window meeting compliance requirements) plus an automated purge/archive job that runs on a schedule and is covered by tests verifying it cannot be bypassed by normal admin operations.

  • high

    Implement separation of duties for retention changes: ensure only the logging/audit subsystem account (not general admins) can modify retention configuration, and that any retention-change operation itself emits an audit event to the same immutable evidence trail.

  • med

    If regulatory needs require tamper evidence, make the audit log rotation/purge pipeline append-only and consider integrity verification (e.g., hash-chain per entry or periodically signed log segments) and audit the deletion/archival actions.

Data residency / region pinning N/A

The codebase contains logic for selecting a cloud provider region (e.g., AWS Bedrock client creation/caching), but it does not implement data residency / region pinning as a tenant-governed residency control. There is no provable mechanism tying a tenant’s region attribute to where tenant data and compute run (including region-keyed routing across per-region infrastructure).

  • high

    If the product is intended to provide residency guarantees, introduce an explicit tenant-region field in the tenant/org data model and enforce it end-to-end: (1) region-keyed placement of persistent tenant data stores, (2) region-keyed compute/provider routing for all tenant work, and (3) region pinning for every secondary sink (backups, exports, analytics, and any outbound syncs).

    • agent/bedrock_adapter.py:60-150 — Current `region` usage is limited to SDK client construction/caching for inference, which demonstrates provider region selection but not tenant residency pinning enforcement.
No cross-region leakage N/A

No cross-region leakage / data-sink residency enforcement is evident in this codebase. While the project has some generic “sync” and “export” functionality and does process/forward content (e.g., session exports, image routing), there is no implemented, auditable mechanism that pins all data sinks to a tenant region (or blocks sync/export/backup/analytics to out-of-region destinations).

  • high

    Identify all data sinks involved in the product’s end-to-end data lifecycle (primary store, backups/snapshots, analytics/telemetry, derived stores, and any third-party sync/export). For each sink, implement region-keyed routing and add enforcement that blocks out-of-region destinations for tenant-scoped data.

    • apps/desktop/src/lib/session-export.ts:1-57 — Current export is a local client download with no region-aware sink placement; if exports are also produced server-side/externally elsewhere, they must be region pinned and blocked.
  • med

    Add a tenant region attribute to the authoritative data model (if it does not exist yet) and wire it into every sink configuration (backup/replication/analytics/export). Verify that derived/snapshot/analytics pipelines also reference the tenant region and cannot bypass it.

    • agent/image_routing.py:1-220 — Representative example of application routing logic exists, but there is no residency/region enforcement in routing helpers—indicating the need to add region-aware configuration to the data pipeline layer.
Data-subject rights (export & erase) 0%

The codebase contains partial building blocks resembling export/erase: (1) a session export that downloads session messages as JSON, and (2) memory “forget”/delete operations for two memory providers (Supermemory and RetainDB). However, there is no evidence of a complete GDPR/CCPA-style data-subject rights (export & erase) primitive with (a) an identity-verified DSR export handler that returns all subject data, (b) a DSR erase handler that cascades to backups and derived/indexed stores, and (c) an auditable DSR job/event trail that an external auditor can independently verify.

  • high

    Add explicit DSR backend handlers (export and erase) on the server (likely near gateway/platform endpoints) that: verify the requesting subject, enumerate all relevant data across every store used by the product (sessions, message history, memory providers, file stores, any caches/indices), and return a structured response tied to a DSR job id.

  • high

    Implement an auditable DSR erase job that guarantees cascade behavior: delete/forget in each backing store plus retractions from derived/indexed representations and any local write-behind/queueing mechanisms; record immutable DSR audit evidence for start/end status and the exact resources targeted.

  • med

    Extend/replace client-side session export with a backend-driven subject export that guarantees completeness (“all of a subject's data”), includes data provenance, and provides a stable export artifact (e.g., generated server file) tied to a DSR job record.

Customer-controlled keys N/A

No implementation of “customer-controlled keys” (BYOK/per-tenant customer-managed encryption keys with customer-supplied import, scheduled rotation, and revocation/crypto-shred) was found in this codebase. The code contains credential configuration UI (global env var handling) and platform-specific crypto utilities, but not per-tenant key management suitable for customer-controlled encryption key governance.

  • high

    Add a crypto-governance key-management module that supports per-tenant encryption key references, customer-supplied key import, scheduled rotation, and explicit revocation semantics (crypto-shred). Document the tenant key lifecycle and enforce it in all data-at-rest encryption/decryption paths.

  • med

    Expose customer-facing APIs or UI endpoints for key import, rotation, and revocation that are scoped to a tenant identifier, and ensure rotation is auditable and actually re-encrypts or schedules re-encryption of tenant data as required.

  • med

    Integrate the per-tenant key reference into the encryption layer used for persisted data (and ensure all ciphertext-producing components use the tenant key, not a single global key).

Sub-processor / data-flow transparency 0%

No in-repo, versioned sub-processor / data-flow transparency inventory (or equivalent auditable mechanism) was found. While the codebase contains concrete outbound integrations to third parties (notably OpenAI and Anthropic), those sinks are not backed by a declared, auditable, versioned inventory that would let an auditor reconcile “data flows” with a DPA sub-processor list.

  • high

    Add a versioned, in-repo sub-processor/data-flow inventory artifact (e.g., SUBPROCESSORS.md or dataflow-inventory.json) that explicitly lists each outbound provider, what data types are sent, and where it is used (module/function references). Ensure it is current and matches runtime sinks.

  • high

    Create cross-references from each outbound integration module to the inventory (e.g., comments or code-level constants that include an inventory entry id/version), so auditors can verify mapping from code → documented third party/data-flow.

  • med

    Add a lightweight consistency check in CI (static scan or unit test) that flags new third-party SDK imports / outbound endpoints without a corresponding inventory entry update.

Not applicable to this codebase: Data residency / region pinning, No cross-region leakage, Customer-controlled keys.

T2 Execution Velocity

Performance Primitives

A caching layer, an async job runtime, connection pooling, and indexes on the columns that actually need them.

66% 11/11 scored
  • Redundant work in loops 0%
    0/3 expected sites
  • Bounded interfaces 0%
    0/8 expected sites not present
  • Memoization / caching 122%
    4/3 expected sites
  • Resource reuse / pooling 133%
    4/3 expected sites
  • Off-critical-path execution 100%
    2/2 expected sites
  • Lookup data structures 0%
    0/2 expected sites
  • Batching round-trips 100%
    2/2 expected sites
  • Shared-state synchronization 167%
    5/3 expected sites
  • Bounded concurrency / backpressure 0%
    0/2 expected sites
  • Lazy / minimal computation 100%
    2/2 expected sites
  • Streaming over buffering 0%
    0/3 expected sites not present
Redundant work in loops 0%

The codebase does contain this primitive in multiple places: (1) curator skill classification uses deeply nested loops with repeated regex/path matching work, (2) Excel DCF validation performs per-cell nested scanning over multiple error patterns, and (3) Telegram DM topic lookup uses nested scans over cached topics and config on each lookup. I did not find any spot where the redundant work was correctly hoisted/batched/memoized to make the per-iteration cost ~constant.

  • high

    In `agent/curator.py`, reduce the multiplicative nested work in `_classify_removed_skills`: precompile regexes for `needles`, pre-normalize needle variants once, and build indexes from `parsed_calls` (e.g., a mapping from target skill name → referenced removed skill names based on fields). Then replace repeated inner-loop `re.search(...)` and repeated haystack scanning with O(1)/amortized lookups.

    • agent/curator.py:630-709 — Shows repeated regex/path-component matching inside multiple nested loops (`for name in removed` → `for args in parsed_calls` → `for key in (...)` → `for needle in needles`).
  • high

    In `gateway/platforms/telegram.py`, change `_get_dm_topic_info` to avoid nested scanning of `_dm_topics_config` for each cache hit. Build a single lookup table keyed by `(chat_id, thread_id, topic_name)` (or `(chat_id:topic_name, thread_id)` mapping) during `_reload_dm_topics_from_config`, and have `_get_dm_topic_info` return directly from that map without iterating `for chat_entry ... for t in chat_entry.get('topics', [])`.

    • gateway/platforms/telegram.py:5750-5860 — Demonstrates that `_get_dm_topic_info` iterates over cached items and then iterates again over `_dm_topics_config` and `topics` to construct the full return object.
  • med

    In `optional-skills/finance/dcf-model/scripts/validate_dcf.py`, reduce per-cell × per-error-type scanning: replace the inner `for err in excel_errors` loop with a single classification strategy (e.g., check using a compiled pattern that matches any known error token and extract which one matched), and avoid repeated coordinate lookups where possible (iterate formulas and values together or cache `ws_formulas` row-wise).

Bounded interfaces 0%

The bounded-interfaces primitive is not applied: multiple public surfaces return complete/unbounded collections (e.g., `list_providers()` across registries, `list_oauth_providers()` in the web server, and skill listing helpers). For these collection-returning APIs/handlers, there are no `limit`/cursor/iterator parameters or other bounding mechanism visible at the interface boundary.

  • high

    Add bounding controls to public collection-returning APIs: introduce `limit` (and ideally `cursor`/`offset` or an iterator/streaming interface) to `list_providers()` functions and ensure callers can request partial results.

  • high

    Paginate collection responses at HTTP boundaries. For `GET /api/providers/oauth`, add query params like `limit` and `cursor` (or at least `limit`) and slice `providers` accordingly.

  • med

    Bound skill-listing helpers to avoid returning complete directories/manifests in one response; add `limit` (and optionally cursor) and propagate it to CLI/API consumers.

    • tools/skill_usage.py:280-390 — `list_agent_created_skill_names()` and `list_archived_skill_names()` return complete sorted lists with no interface-level bound.
  • med

    Expose bounding parameters on Teams artifact listing functions (`list_transcript_artifacts`, `list_recording_artifacts`) rather than collecting the entire paginated result into an in-memory list unconditionally.

Memoization / caching 122%

Memoization/caching is clearly present. The strongest, correctness-focused implementation is in katex-memo.ts: it uses a bounded LRU cache keyed by equation inputs and safely clones cached subtrees to prevent mutation bugs. Additional caching is used in the desktop UI for external link title fetching (including in-flight request deduping) and for derived pane-state atoms (to keep subscriptions stable).

  • high

    Add bounding/eviction for the external-link title cache (titleCache/titleInflight/titleSubs) to prevent unbounded growth over long sessions; consider an LRU with a reasonable max entries and clearing related inflight/subscriber state on eviction.

  • med

    Add explicit invalidation semantics for cached link titles if the bridge can return time-varying results (e.g., rotating “attention required” pages). If titles are effectively stable, document that assumption next to the cache implementation.

  • low

    For katex-memo, consider exposing CACHE_LIMIT as a configuration constant (or adding lightweight telemetry) so performance tuning can be done without editing source.

Resource reuse / pooling 133%

Resource reuse / pooling is present and correctly applied in the main expensive-client hotspots: Bedrock boto3 clients are cached per region; managed FAL clients for both image generation and video generation are cached and reused across requests (with locking and config identity checks). This avoids rebuilding clients/HTTP connection pools on each call.

  • high

    If there are other call paths that construct per-request API clients (e.g., auxiliary/provider client resolution in agent/auxiliary_client.py for OpenAI/httpx-based SDKs), confirm they use a shared bounded cache with eviction/invalidation and that cache keys include all parameters that affect transports (base_url, auth mode, model/endpoint, etc.). Add/extend tests to assert reuse across consecutive calls similarly to the managed FAL tests.

    • agent/auxiliary_client.py:1-75 — Auxiliary client is explicitly positioned as shared; ensure the concrete client construction sites are also pooled and not re-created per call.
Off-critical-path execution 100%

This codebase does apply off-critical-path execution: after each turn, it defers a slow self-improvement review (forked agent + tool/memory/skill writes) into a separate daemon thread. The background worker is isolated (stdout/stderr suppression, tool whitelist) and has robust exception handling and cleanup to prevent foreground latency or crashes.

  • med

    Confirm there are no other per-message/per-turn hot paths in the gateway or agent loop that perform slow network/file operations inline (e.g., media caching, large tool payload processing, provider calls) without deferral to a worker/queue. If found, route them through the existing background-thread/worker patterns or a centralized worker pool with retry/idempotency.

    • gateway/platforms/base.py:560-760 — Contains async media/audio caching with retries (network I/O + file writes). Verify call sites don’t invoke these inline on the foreground critical path.
  • low

    For any background jobs added beyond background_review, standardize: (1) tool/file/network work fully in worker, (2) bounded retry policy with idempotency keys for memory/skill writes, (3) explicit cleanup/shutdown in finally blocks, mirroring background_review’s pattern.

Lookup data structures 0%

This codebase does use lookup data structures effectively (notably in the LSP service manager via dict/set caches and membership checks). However, the skill lookup helpers in tools/skill_manager_tool.py appear to rely on repeated recursive filesystem scans for name-based lookup, where an indexed cache/map would better match the intended O(1)/O(log n) lookup primitive.

  • high

    Add a cached index for skill-name → skill directory (and optionally per-profile) built once (or incrementally updated) and used by _find_skill and _find_skill_in_other_profiles instead of rglob scans on every call.

    • tools/skill_manager_tool.py:330-405 — _find_skill performs nested loops over directories and rglob("SKILL.md") with a string comparison; this is a linear scan pattern for a repeated lookup key (skill name).
    • tools/skill_manager_tool.py:405-488 — _find_skill_in_other_profiles repeats the same recursive search pattern across profile roots; should reuse a prebuilt lookup map.
  • med

    If filesystem indexing is hard, at least memoize the results for the active roots (with invalidation on skill directory mtime changes or when write actions occur) so repeated calls in a session avoid re-walking the tree.

    • tools/skill_manager_tool.py:330-405 — A straightforward memoization layer can wrap _find_skill(name) because the function’s result depends on filesystem state under get_all_skills_dirs().
  • low

    Confirm whether _find_skill / _find_skill_in_other_profiles are on any hot path by checking call frequency from tools/skill_manager_tool entrypoints and CLI flows; if rare, prioritize index work elsewhere first.

    • tools/skill_manager_tool.py:330-520 — These helpers are used for collision detection and error messaging; their exact frequency will determine whether an index is worth the complexity.
Batching round-trips 100%

Batching round-trips exists and is applied correctly at key I/O boundaries: SSE text deltas are buffered and emitted as combined updates on a short timer, and nested delegate tool progress events are summarized in batches rather than relayed one-by-one.

  • high

    Search for remaining patterns where the code performs per-item network/database writes inside loops (e.g., per-message/per-tool/per-record persistence or per-item HTTP calls) and replace them with bulk/batched variants (multi-write, executemany, bulk insert, batched API calls).

  • med

    Add/extend tests that assert batching behavior indirectly (e.g., number of SSE/emitted deltas or number of parent progress relays stays sub-linear with N deltas/tools).

Shared-state synchronization 167%

This codebase does implement the shared-state synchronization primitive. Critical shared mutable state is protected with minimal-scope locking: in-process shared dictionaries for dashboard auth providers and WS tickets are guarded by `threading.Lock`, and cross-process gateway runtime ownership is synchronized via OS file locks. No broad or unguarded shared writes were observed in the inspected synchronization hotspots.

  • med

    Extend the audit to other shared caches/singletons (e.g., any shared dict/registry used by concurrent request handlers) and confirm each read-modify-write path is consistently guarded or uses atomic/lock-free structures appropriate to the language runtime.

  • low

    For cross-process locks in `gateway/status.py`, ensure all call sites always pair acquire/release correctly (especially around exceptions) so the lock handle lifecycle remains consistent.

    • gateway/status.py:360-470 — Lock ownership is tracked via `_gateway_lock_handle` and released by `release_gateway_runtime_lock`; call-site pairing should be verified.
Bounded concurrency / backpressure 0%

The codebase does have bounded concurrency/backpressure in at least one hot fan-out: trajectory_compressor.py caps concurrent in-flight summarization/API calls using an asyncio.Semaphore. In the gateway message dispatcher, new message handling intentionally spawns background tasks for interruption support, but there is no corresponding global concurrency/in-flight cap; bursts can therefore create unbounded fan-out (this is a should-be site that is currently un-matched). During shutdown, cancel_background_tasks does use bounded drain rounds and timeouts to avoid infinite or unbounded shutdown waiting.

  • high

    Add a global (or per-adapter/per-platform) in-flight cap for _start_session_processing / _process_message_background so inbound message bursts can’t spawn unbounded concurrent tasks. For example, wrap background task creation or the body of _process_message_background with an asyncio.Semaphore (or a task queue with max workers) and ensure tasks beyond the cap apply backpressure (delay/queue/merge) instead of immediately spawning.

    • gateway/platforms/base.py:3772-3962 — handle_message spawns background tasks to process messages while allowing interruption; this is the primary message fan-out point that should be bounded.
    • gateway/platforms/base.py:3560-4620 — _process_message_background creates per-task additional concurrent work (e.g., typing_task via asyncio.create_task), compounding unbounded fan-out when message-processing tasks are unbounded.
  • med

    Consider bounding the per-message typing-task concurrency as well (e.g., reuse a shared rate-limited typing indicator scheduler, or guard asyncio.create_task(self._keep_typing(...)) behind a semaphore) to reduce multiplicative load under bursts.

  • low

    Extend the same semaphore/worker-pool pattern used in trajectory_compressor.py to other high-cardinality fan-out utilities (if present elsewhere) to keep concurrency behavior consistent across CLI tooling and the live gateway.

Lazy / minimal computation 100%

The primitive is present and applied cleanly in two main places: (1) KaTeX rendering during streaming is memoized so only newly-arrived/changed math expressions are recomputed (minimal computation at the UI work boundary), and (2) tool-gateway readiness/token logic avoids synchronous OAuth refresh by using a cheap probe by default and only refreshing when the caller actually needs a valid token for a request.

  • high

    Add a small explicit comment or test assertion demonstrating that readiness checks (is_managed_tool_gateway_ready) do not trigger refresh when cached tokens are present, to lock in the intended lazy boundary.

  • med

    For katex-memo.ts, consider adding a targeted benchmark/test that validates cache hits avoid calling katex.renderToString (e.g., by stubbing katex in a unit test) to ensure the minimal-computation boundary stays effective.

Streaming over buffering 0%

I did not find a code path that applies the “streaming over buffering” primitive (bounded-memory streaming/chunk iteration over arbitrarily large inputs). Instead, the main anti-pattern appears in `read_file_raw()` (full `cat` into a single string) and in patch validation, which calls `read_file_raw()` for UPDATE operations, buffering whole files before processing.

  • high

    Replace `read_file_raw()` usage in patch validation (`_validate_operations` / apply flow) with an incremental/streaming algorithm that does not require full-file buffering (e.g., process line-by-line with bounded windows matching the patch hunks, or apply hunks using streaming search/replace over iterators).

    • tools/patch_parser.py:239-260 — UPDATE validation calls `file_ops.read_file_raw(op.file_path)` which buffers the entire file content before patch simulation.
  • high

    Redesign `read_file_raw()` (and/or its backends) to be bounded-memory: return an iterator/stream of lines/chunks (or accept a callback/consumer) rather than a full in-memory string, when the input size is untrusted or can be large.

    • tools/file_operations.py:950-1023 — `read_file_raw()` reads the whole file using `cat` and returns `content=raw_content` (single full string), which breaks the constant-memory requirement for large inputs.
  • med

    Add guardrails/tests that assert memory boundedness for patch/update flows on large files (e.g., generate a large temporary file and verify the code does not call `read_file_raw()` or load full contents when offset/limit or patch hunks would suffice).

Reliability Primitives

Retries, circuit breakers, idempotency keys, health checks, and a runbook for each service.

72% 10/11 scored
  • Timeouts 100%
    4/4 expected sites
  • Retry with backoff + jitter 67%
    1/1 expected sites
  • Idempotency 0%
    0/1 expected sites
  • Graceful degradation / fallback 100%
    3/3 expected sites
  • Error handling & propagation 0%
    0/1 expected sites
  • Deterministic resource cleanup 100%
    1/1 expected sites
  • Atomicity / all-or-nothing 67%
    1/1 expected sites
  • Input / boundary validation 100%
    4/4 expected sites
  • Failure isolation / bulkheading 100%
    1/1 expected sites
  • Graceful shutdown 83%
    2/2 expected sites
Timeouts 100%

Timeouts are implemented and wired through the core auxiliary LLM client path. Provider/model timeout configuration is centralized in hermes_cli/timeouts.py, propagated through agent/auxiliary_client.py into provider request kwargs via _build_call_kwargs, and the Codex streaming adapter additionally enforces a hard monotonic deadline with client close/eviction on timeout.

  • high

    Audit other I/O boundaries for missing/optional timeout propagation (e.g., non-LLM streaming tools, subprocess readers, websocket/SSE loops) by locating each external/blocking call site and verifying it always receives a deadline/timeout or is wrapped by an equivalent bounded watchdog.

    • agent/auxiliary_client.py:586-780 — Codex streaming is handled correctly (deadline + close/evict), so the main risk is other adapters/clients that may not have the same level of enforcement.
  • med

    Standardize timeout semantics across call chains (confirm consistent meaning of `timeout` across providers/tasks: connect timeout vs total request timeout vs stream idle timeout) so that timeouts behave predictably under retries and streaming.

    • hermes_cli/timeouts.py:1-83 — Current helpers distinguish request timeout vs stale timeout; ensure all call sites interpret these consistently when wiring into streaming vs non-streaming code.
Retry with backoff + jitter 67%

The codebase contains a dedicated jittered exponential backoff helper (`agent.retry_utils.jittered_backoff`) with a capped budget and bounded jitter. However, the main production retry path for transient message send failures (`gateway/platforms/base.py::_send_with_retry`) implements exponential backoff with only a small fixed-range jitter, and does not use the shared jittered-backoff utility. It is still applied in the right general retry location, but the implementation is only partially aligned with the 'backoff + jitter' primitive quality expectations.

  • high

    Update `gateway/platforms/base.py::_send_with_retry()` to use the shared `agent.retry_utils.jittered_backoff()` (or match its behavior): make jitter proportional to the computed delay, use a configurable `max_delay` cap (in addition to `max_retries`), and ensure delays are decorrelated across concurrent sessions.

Idempotency 0%

Idempotency is implemented for the API server’s non-streaming chat-completions path via an _IdempotencyCache that deduplicates concurrent requests using (Idempotency-Key + a request fingerprint). However, the generic send retry logic in gateway/platforms/base.py appears to retry by re-sending without a visible idempotency/dedup guard at the retry boundary, which is the highest-risk gap for duplicate side effects on transient failures.

  • high

    Add an idempotency/dedup mechanism to the send retry boundary (gateway/platforms/base.py:_send_with_retry). For example: generate a per-attempt/per-message idempotency token and/or consult a per-chat/outbound-message dedup cache so repeated self.send() calls on transient errors don’t create duplicates.

    • gateway/platforms/base.py:3269-3349 — Retry loop re-calls self.send() on transient/network errors; without a dedup/idempotency guard in this boundary, duplicates are possible on unhappy-path retries.
  • med

    For outbound adapters, ensure the dedup strategy used for inbound events (MessageDeduplicator) is also applied (or complemented) for outbound retries at the exact send boundary, not only for inbound message handling.

Circuit breaking / fail-fast N/A

Agent produced no parseable output for this item.

Graceful degradation / fallback 100%

The codebase contains solid graceful degradation/fallback patterns. Desktop runtime-readiness probes catch gateway failures and return a fallback/unknown readiness result with an explicit reason. The agent chat helpers switch to a configured fallback model/provider chain when the primary backend fails. The gateway streaming consumer degrades from edit-based streaming to chunked final sends when edits stop working, preserving delivery of the core response.

  • high

    Audit remaining non-critical dependency calls in the same end-to-end flows (desktop onboarding readiness/model option fetching, agent provider/model routing, and streaming edits→fallback) to ensure all error branches either (a) return real fallbacks with explicit staleness/uncertainty or (b) preserve core output instead of aborting; focus specifically on catch blocks that currently return null/empty without an explicit staleness reason.

Error handling & propagation 0%

The primitive is implemented in at least one critical area: MCP server lifecycle handling explicitly preserves cancellation semantics (Cancel ledError re-raise), with time-bounded shutdown. However, there is at least one localized catch that swallows failures (`_write_stderr_log_header` uses `except Exception: pass`), which violates the 'never silently drop failures' expectation.

  • high

    Replace the silent swallow in `_write_stderr_log_header` with context-rich logging (and/or re-raising if this logging is considered important). At minimum, log the exception with `logger.debug/warning` including `server_name` to avoid silent failures.

  • med

    Audit other broad `except` blocks in `tools/mcp_tool.py` (and similar lifecycle/transport modules) for 'silent fallback' patterns. Where fallback is acceptable, ensure errors are logged with enough context (server name / operation / timeout) and that the fallback cannot mask partial corruption or deadlocks.

    • tools/mcp_tool.py:150-190 — Demonstrates both acceptable (debug fallback to devnull) and unacceptable (pass) error-handling styles within nearby stderr-log helpers.
Deterministic resource cleanup 100%

Deterministic resource cleanup is present and correctly applied at the primary handle acquisition site observed: `gateway/status.py` acquires a lock file descriptor with `os.open` and guarantees release by scoping it inside a `with os.fdopen(...)` block, covering the exception path during JSON writing.

  • med

    Apply the same pattern (scope-bound `with`/`finally`/RAII) at any other raw acquisition sites (e.g., other direct `os.open`/lock/socket acquisitions) where release is not obviously guaranteed on the throw path.

    • gateway/status.py:578-636 — This is the validated reference pattern in the codebase: low-level acquisition is immediately wrapped in scope-bound cleanup (`with os.fdopen(...)`).
Atomicity / all-or-nothing 67%

Atomicity/all-or-nothing behavior is present primarily in `agent/curator_backup.py` where the rollback of the skills directory is implemented with staging and best-effort restoration if extraction fails. Broader atomicity guarantees for DB multi-step updates (transactions) were not exhaustively audited here; the strongest concrete all-or-nothing pattern observed is the file-tree rollback recovery.

  • high

    Audit DB mutation sequences for multi-step consistency: identify functions that perform multiple related writes (e.g., inserting/updating several tables/rows that must stay consistent) and ensure they use explicit transactions (BEGIN/COMMIT/ROLLBACK or equivalent in the DB layer).

    • hermes_state.py:260-520 — SessionDB has careful WAL setup and retry strategy, but an atomicity audit requires reading specific multi-step write methods to confirm they wrap related mutations in transactions on failure paths.
  • med

    For file-tree atomicity, strengthen rollback recovery semantics: when staging/moving or extract fails, consider writing the snapshot into a fully isolated tempdir and then using an atomic rename/swap for the final step (where the filesystem supports it), instead of mutating the live directory and relying on best-effort move-back.

    • agent/curator_backup.py:529-667 — Current approach mutates the live skills directory via tar extraction and then restores from staged contents on failure; this is good best-effort atomicity, but it isn’t as robust as an atomic rename/swap finalization step.
Input / boundary validation 100%

The codebase applies input/boundary validation strongly in the dashboard-auth OAuth routes: it validates the `next` redirect target and enforces PKCE/state/provider checks on callback inputs before allowing redirects or login completion. CLI argv handling appears to rely on argparse for boundary constraints, but the strongest, most explicit validation is in `hermes_cli/dashboard_auth/routes.py`.

  • med

    Audit other public entry points for the same level of explicit validation (e.g., any remaining FastAPI/HTTP routes that consume query/path/body params), ensuring invalid inputs are rejected at the boundary and not only during downstream processing.

Failure isolation / bulkheading 100%

The codebase contains a strong bulkheading implementation for shared LLM/HTTP client resources in `agent/auxiliary_client.py`. It bounds the shared client cache and isolates async clients by validating the current open event loop, force-closing stale transports and evicting old entries to prevent shared-resource exhaustion from one failing workload.

  • low

    Add/confirm targeted tests that simulate (a) event-loop switching across gateway worker threads and (b) repeated aux calls that would otherwise expand the cache beyond the max size, asserting that stale cached clients are force-closed and that unrelated aux calls still succeed while the cache is being evicted.

Graceful shutdown 83%

Graceful shutdown support is present and well-implemented in the Node/TS TUI via a reusable setupGracefulExit helper that runs async cleanups (including killing the gateway and resetting terminal modes) before exiting. The Python tui_gateway entrypoint also registers signal handlers and uses a bounded grace period with a hard-failsafe exit, though it does not explicitly show draining of in-flight/queued work beyond exiting the stdin dispatch loop.

  • high

    In tui_gateway/entry.py, add explicit shutdown coordination so the signal handler stops accepting new stdin/dispatch work and waits for any in-flight/worker operations to complete (or to reach a safe cancellation point) before sys.exit/os._exit.

    • tui_gateway/entry.py:1-220 — Signal handler logs then starts a grace timer and immediately proceeds to sys.exit(0); no explicit drain/stop-accepting-work mechanism is shown in the handler.
    • tui_gateway/entry.py:220-299 — Main loop continuously reads sys.stdin and dispatches requests; process exit will abort this loop, but there is no visible coordination to finish work already dispatched before exiting.
  • med

    If gw.kill/cancellation in ui-tui may leave buffered writes in the gateway, ensure gw.kill has bounded completion semantics (e.g., awaiting a cancellation acknowledgement with a timeout) rather than relying solely on the outer failsafe.

    • ui-tui/src/entry.tsx:1-115 — Cleanup awaits gw.kill('graceful-exit-cleanup') via the helper’s Promise.allSettled, but the gateway-side kill semantics/timeout are not shown here.

Not applicable to this codebase: Circuit breaking / fail-fast.

API & Extensibility

A checked-in OpenAPI spec, versioned routes, a webhook system with retries and signing, and tenant-scoped rate limits.

18% 7/10 scored
  • Machine-readable API contract 0%
    0/2 expected sites not present
  • Programmatic auth with scopes 0%
    0/2 expected sites not present
  • Idempotent writes 0%
    0/7 expected sites
  • Consistent pagination & filtering 0%
    0/4 expected sites
  • Consistent errors & status codes 33%
    5/5 expected sites
  • Sandbox / test mode 0%
    0/2 expected sites not present
  • Extension points / plugins 94%
    6/6 expected sites
Machine-readable API contract 0%

No checked-in, machine-readable API contract spec (OpenAPI/Swagger/AsyncAPI/proto/GraphQL SDL) was found anywhere in the repository. The codebase does include a public, external-facing OpenAI-compatible API adapter (gateway/platforms/api_server.py) with many documented routes, but there is no corresponding checked-in spec artifact that third parties can rely on to integrate without contacting the maintainers.

  • high

    Create and check in an OpenAPI (or equivalent) spec that covers the full public route inventory exposed by gateway/platforms/api_server.py, including all /v1/* and /api/* session endpoints and /health endpoints. Ensure the spec includes request/response schemas, example payloads, and a common error format with status codes and error codes.

    • gateway/platforms/api_server.py:1-70 — This module is the public API adapter and enumerates the external route set and intended clients; an API contract spec should exist alongside it to cover those routes.
  • high

    Add an automated sync/coverage mechanism: generate the spec from route definitions (or contract-test that the spec path list matches the registered route inventory) in CI, failing the build if the spec drifts or covers only a fraction of endpoints.

  • med

    If you want to keep using the existing /v1/capabilities endpoint, link it to the spec and document how clients can obtain the spec version (e.g., spec URL or embedded version hash), so it remains a stable, discoverable contract.

    • gateway/platforms/api_server.py:1-70 — /v1/capabilities is already described as machine-readable, but the primitive requires a checked-in spec file that drives docs/sample payloads and covers all public routes.
Versioning & backward compatibility N/A

I could not confirm a consumer-facing Versioning & backward compatibility strategy (no checked-in API contract/spec and no explicit versioning/deprecation/sunset policy for public HTTP routes). While there are internal “version” fields and some protocol/version compliance in non-HTTP contexts (e.g., dashboard-auth provider contract testing and runtime config/versioning regression guards), this does not amount to a stable, third-party discoverable versioning strategy for the codebase’s public API surface.

  • high

    Inventory the public HTTP API surface and add a checked-in machine-readable contract (OpenAPI/Swagger) covering *all* endpoints. Ensure the spec is tied to route registration (generated or contract-tested) so it can’t drift.

    • gateway/platforms/api_server.py:1-80 — This file documents the public HTTP endpoints (including /v1/* and /api/*), but the repo appears to have no checked-in OpenAPI/Swagger/AsyncAPI spec artifact discoverable via filename search.
  • high

    Define and implement a versioning policy for the public API: either (a) explicit versioned routes (/v1, /v2, …) with deprecation+sunset headers and migration docs, or (b) an unversioned route strategy that is strictly backward compatible with explicit deprecation markers for any breaking behavior.

    • gateway/platforms/api_server.py:1-80 — Public endpoints include both versioned (/v1/*) and unversioned (/api/sessions, /health) paths; no in-file deprecation/sunset policy is evident from the public surface documentation.
  • med

    Add contract tests for backward compatibility: run CI checks that compare current response/error schemas and pagination/filter conventions against the previous release (or golden contract) to detect breaking changes early.

Programmatic auth with scopes 0%

No implementation of programmatic auth with per-credential scopes (scoped, revocable API credentials distinct from the user session) was found. The OpenAI-compatible API server adapter authenticates callers using a single shared `API_SERVER_KEY` bearer token; related sensitive behavior like `X-Hermes-Session-Key` is only gated by whether that global key is configured, with no evidence of per-credential scopes/rotation/revocation/last-used.

  • high

    Replace the single shared `API_SERVER_KEY` bearer auth in `_check_auth` with a scoped credential model: issue per-credential tokens/keys that carry scopes; validate scopes on each endpoint/request (e.g., chat/run/read responses vs session/memory management). Add credential identifiers to logs for auditing and enforce revocation/rotation and last-used tracking.

    • gateway/platforms/api_server.py:806-858 — Current auth compares the provided bearer token directly to one configured secret (`self._api_key`) and returns a generic 401 on mismatch; no scope parsing/enforcement exists here.
  • high

    Scope-protect long-term memory/session scoping (`X-Hermes-Session-Key`) rather than gating solely on the global key being configured. Require specific scopes for allowing callers to supply/alter session keys and ensure the enforcement is tied to the credential used for auth (not only server configuration).

    • gateway/platforms/api_server.py:873-939 — `_parse_session_key_header` only checks whether `self._api_key` exists; if absent it returns 403, but it does not check any scopes associated with the presented credential.
Per-tenant rate limiting N/A

Rate limiting logic exists for internal platform behaviors (e.g., Signal attachment scheduling and pairing-code flow-control), but there is no evidence of a per-tenant (per consumer) API-edge rate limiter with a stable third-party-facing HTTP contract (standard limit/remaining headers, 429, and retry guidance).

  • high

    If this project exposes any HTTP API surface for third-party integration, add an API-edge per-tenant/per-consumer rate limiter that keys buckets by tenant/credential, and emits standard headers (e.g., X-RateLimit-Limit / X-RateLimit-Remaining / Retry-After) plus a 429 response with clear retry guidance.

  • med

    Document the consumer/tenant identifier used for rate limiting (e.g., API key subject, org id, or workspace id) and ensure consistent application across all public entrypoints (not just some platform adapters).

    • gateway/pairing.py:1-30 — Pairing rate limiting is present but scoped to “per user” within a pairing flow, not a consistent per-tenant contract across a public API surface.
Idempotent writes 0%

This codebase includes an idempotency mechanism for OpenAI-compatible API writes, implemented as an in-memory `_IdempotencyCache` with TTL, fingerprinting, and in-flight deduplication. However, the idempotency-key handling is only applied to the non-streaming portions of `POST /v1/chat/completions` and `POST /v1/responses`. Streaming branches and several dashboard session mutation endpoints (`/api/sessions`, `/api/sessions/{id}/fork`, PATCH updates, and session chat endpoints) do not handle `Idempotency-Key`, leaving retry/double-execution gaps for integrators.

  • high

    Add `Idempotency-Key` handling to the streaming branches of `POST /v1/chat/completions` and `POST /v1/responses` (so retries after timeouts/disconnect don’t double-run the agent).

  • high

    Implement idempotent replay for dashboard/session mutations: `POST /api/sessions`, `PATCH /api/sessions/{session_id}`, `POST /api/sessions/{session_id}/fork`, and `POST /api/sessions/{session_id}/chat` (and `.../chat/stream`) by requiring/handling `Idempotency-Key` for safe retries.

  • med

    Improve the correctness contract of idempotency replay beyond in-memory TTL: persist idempotency outcomes for a longer window and make replay behavior explicit (e.g., include a stable replay response identifier and return a distinct error/conflict shape when fingerprints differ for the same key).

    • gateway/platforms/api_server.py:620-820 — `_IdempotencyCache` is in-memory with TTL and supports fingerprint match, but does not demonstrate distinct conflict surfacing semantics for mismatched fingerprints or persistence across restarts.
Consistent pagination & filtering 0%

Pagination/filtering is implemented only partially: GET /api/sessions uses bounded limit+offset and a `source` filter, but list endpoints like GET /v1/models, GET /v1/skills, and GET /v1/toolsets are returned unpaginated and do not share a cursor-based pagination/filtering convention. As a result, third-party integration cannot rely on a consistent list contract across collections.

  • high

    Introduce a consistent cursor-based pagination contract across all list endpoints in gateway/platforms/api_server.py (e.g., standard `limit` + `cursor`/`next_cursor` query params) and ensure every list endpoint shares the same bounded page size behavior.

  • high

    Establish and apply a common filter convention for list endpoints (same parameter names and semantics across collections), including mapping/aliasing where necessary (e.g., how `source` corresponds to other possible dimensions).

  • med

    Update the capability discovery payload (/v1/capabilities) to document the pagination/filter query params consistently (so integrators can implement once).

Outbound events / webhooks N/A

The codebase implements inbound webhooks (a webhook receiver that accepts POSTs, validates HMAC signatures, rate-limits, deduplicates deliveries, and then triggers internal agent work). However, it does not implement the outbound-event/webhook primitive described in the rubric: there is no subscription store and delivery worker that pushes versioned, HMAC-signed event payloads to integrator-provided callback URLs with retry/backoff and idempotent redelivery.

  • high

    Introduce a true outbound webhook/event subscription model (storage for subscriber callback URLs + signing secrets + event filters), plus a delivery worker that emits events with versioned payloads, HMAC signing, exponential-backoff retries (bounded), idempotent delivery keys, and a redelivery workflow suitable for integrators.

    • gateway/platforms/webhook.py:260-420 — Current behavior is `POST /webhooks/{route_name}` (inbound). It returns 202 Accepted after queuing internal handling, rather than performing outbound event callbacks.
Consistent errors & status codes 33%

A shared, machine-parseable OpenAI-style error envelope exists in gateway/platforms/api_server.py and is reused by several public HTTP error paths (auth failure, request-body size checks, session-key validation, multimodal validation). However, it does not include a correlation/request id in the error response body, and the primitive’s full status-code semantics/codes consistency requirements (including 409/422/429 and 5xx fault-only) are not evidenced as enforced through this centralized contract.

  • high

    Extend the centralized error envelope (_openai_error) to always include a correlation/request id (e.g., from an incoming header or generated per request) and ensure every error-returning path uses it (including explicit inline error dicts like the invalid API key response).

  • high

    Audit and standardize status-code mapping across all public API-server error paths to meet the primitive’s required semantics (400 malformed, 401/403 auth, 409 idempotency conflicts, 422 semantic errors, 429 throttling, 5xx only for true faults).

  • med

    Add/extend automated tests asserting that every error response includes the correlation id and that error.code/type/status mapping is consistent for each required category (401/403/409/422/429 + representative 5xx fault).

Sandbox / test mode 0%

No consumer-facing Sandbox / test mode primitive was found. The codebase documents an OpenAI-compatible API server that requires `API_SERVER_KEY` and points at `http://localhost:8642/v1`, but there is no documented sandbox base URL + test credentials + isolated test data for third parties to integrate safely without using production.

  • high

    Add a documented sandbox/test-mode contract for the API server: publish a dedicated sandbox base URL (not localhost), specify an authentication mechanism for test keys (clearly labeled as test-mode), and describe isolation guarantees for sandbox data (e.g., separate response/run/session storage). Update the `api_server.py` docstring or link to a checked-in doc/spec.

  • high

    Introduce and document a sandbox configuration surface in the CLI config layer (or an accompanying config doc): e.g., `HERMES_SANDBOX_BASE_URL`, `HERMES_SANDBOX_API_SERVER_KEY`, and any sandbox-only storage namespaces/DB selection, including how test data is cleaned up.

    • hermes_cli/config.py:1-20 — Current documentation only covers `~/.hermes/config.yaml` and `~/.hermes/.env` without any consumer sandbox/test-mode keys or isolated test-mode instructions.
Extension points / plugins 94%

This codebase contains a well-defined, documented plugin/extension system (hermes_cli/plugins.py) with a stable PluginContext interface (tools, hooks, platform adapters, skills, auxiliary tasks), a PluginManager that discovers/loads plugins from multiple sources (bundled, user, project, and pip entry points), and a concrete gateway platform registry (gateway/platform_registry.py). A host-owned Plugin LLM facade (agent/plugin_llm.py) further supports safe third-party plugin integration.

  • high

    Audit and document a versioning policy for the extension contracts (PluginContext methods and hook names) and how breaking changes are avoided (e.g., VALID_HOOKS evolution rules, manifest schema versioning in plugin.yaml).

    • hermes_cli/plugins.py:1-220 — The extension contract is extensively documented, but this slice does not show an explicit semantic-versioning / compatibility policy for third-party plugin authors.
  • med

    Ensure the gateway platform registration surface is consistently discoverable in docs by linking PluginContext.register_platform and PlatformEntry fields to the referenced developer guide mentioned in the registry comments.

    • gateway/platform_registry.py:1-220 — The registry doc references a developer guide contract, but the code slice provided does not include the end-to-end documentation chain for third-party integrators.

Not applicable to this codebase: Versioning & backward compatibility, Per-tenant rate limiting, Outbound events / webhooks.

Integration Depth

Per-system adapters behind one shared interface with bi-directional sync — not per-customer scripts held together with spreadsheets.

43% 8/10 scored
  • Shared integration abstraction 100%
    3/3 expected sites
  • Per-integration reliability 0%
    0/1 expected sites
  • Sync state & reconciliation 0%
    0/2 expected sites not present
  • Inbound validation & normalization 100%
    2/2 expected sites
  • Per-tenant integration credentials 0%
    0/3 expected sites not present
  • Per-integration observability 0%
    0/5 expected sites not present
  • Connector breadth for the category 100%
    2/2 expected sites
  • Build-vs-buy posture 44%
    2/3 expected sites
Shared integration abstraction 100%

This codebase does implement the “Shared integration abstraction” primitive: gateway platform integrations are structured around a shared BasePlatformAdapter interface, with concrete integrations like WebhookAdapter and MSGraphWebhookAdapter inheriting from it. This indicates an architected integration layer rather than N separate bespoke integrations without a common contract.

  • med

    Extend the verification by sampling additional platform adapters (e.g., TelegramAdapter, SlackAdapter, WhatsAppAdapter) and confirm they consistently implement the shared BasePlatformAdapter contract without bespoke direct coupling to gateway internals.

    • gateway/platforms/base.py:1-20 — Stated design intent that all platform adapters inherit from BasePlatformAdapter; additional adapter reads would confirm consistency across integrations.
Bidirectional sync N/A

No true “Bidirectional sync” primitive (read + write back to external systems with sync state/cursors and reconciliation semantics) was found. This codebase contains message/webhook receivers (ingress) and message senders (egress) but they do not implement an adapter-level sync workflow that continuously reconciles and writes changes back to the external system as part of a bidirectional data sync.

Metadata-driven mappings N/A

I did not find a clear “metadata-driven mappings” integration/config layer in this codebase—i.e., a runtime service that interprets versioned per-tenant field/entity mapping + transform + validation configuration for external-system integrations. The “schema/transform” parts I found (e.g., Gemini schema sanitization and generic tool-schema sanitizers) are backend-compatibility helpers for LLM tool JSON schemas, not per-tenant, metadata-interpreted mappings from external entities to a canonical internal model.

Per-integration reliability 0%

The codebase implements retry-with-backoff for transient delivery failures (notably in `gateway/platforms/base.py::_send_with_retry`). However, the primitive’s required companion mechanisms—per-integration dead-letter/quarantine parking for events that still fail after all retries, plus alerting/observability for those DLQ’d failures—do not appear to be implemented. Therefore, this primitive is only partially present (retries exist, but undeliverable record handling via DLQ is missing).

  • high

    Add a per-integration dead-letter/quarantine mechanism to `_send_with_retry`: when retries are exhausted, write the undeliverable payload/event (with adapter identity, error, attempt count, and correlation/session metadata) into a DLQ (or persistent store) instead of only notifying the user. Include alerting/metrics for DLQ growth and rate of exhausted retries.

    • gateway/platforms/base.py:3180-3345 — Retries are performed in `_send_with_retry`, but the exhaustion path sends a user notice and returns; no DLQ/quarantine write or ops alert is present here.
  • med

    Implement a shared DLQ interface (but used by each platform adapter) so that Telegram/Discord/Feishu/etc. all park failures into the same canonical “failed event” model with per-integration labels, enabling consistent ops dashboards and reprocessing workflows.

    • gateway/platforms/base.py:3180-3345 — All platform adapters route delivery through the base adapter retry path, making this the appropriate choke point for adding a DLQ abstraction that can still be labeled per integration.
Sync state & reconciliation 0%

No integration “sync state & reconciliation” primitive was found. The codebase uses inbound webhook delivery idempotency via in-memory TTL caches (and duplicate suppression in MS Graph), but it does not persist cursors/watermarks nor perform any drift detection/reconciliation between external systems and internal state (especially across restarts or missed events).

  • high

    Introduce durable per-integration checkpointing (cursor/watermark) and idempotent upserts for inbound event streams where drift repair matters. Persist progress per route/event type and reconcile by re-fetching or compensating for gaps when the checkpoint is stale or missing.

  • high

    For MS Graph notifications, persist processed receipts/checkpoints and add drift repair logic (e.g., compare expected state from Graph with internal state, then upsert/repair discrepancies). Ensure correctness across restarts.

Inbound validation & normalization 100%

The primitive is present and implemented well at the external ingestion boundaries for webhook integrations. Both the generic webhook adapter and the Microsoft Graph webhook adapter perform fail-closed authentication/validation, parse and normalize incoming payloads into internal canonical MessageEvent objects, and deduplicate repeated deliveries/receipts before dispatching agent work.

  • high

    Add/confirm a quarantine mechanism for malformed or invalid inbound records (e.g., store invalid payloads + error reason in a dedicated holding area), rather than only returning HTTP errors or incrementing counters.

    • gateway/platforms/webhook.py:650-725 — On invalid signature / parse failure, the handler returns 4xx/5xx responses and skips processing; there is no visible quarantine/holding persistence for bad records.
    • gateway/platforms/msgraph_webhook.py:245-360 — On per-item failures (non-dict, resource not accepted, bad clientState), the adapter increments rejection/authRejected counters and continues; there is no visible quarantine persistence for later inspection.
  • med

    Ensure dedup/idempotency behavior is consistent across deployments by validating whether in-memory dedup caches (_seen_deliveries / _seen_receipts) meet your expected persistence requirements (e.g., restarts, horizontal scaling). If not, back dedup with a shared store.

Per-tenant integration credentials 0%

I did not find an implementation of per-tenant integration credentials with tenant-isolated secret-manager storage and token refresh/revocation. The code appears to use user/process-local credential stores (e.g., ~/.hermes/auth.json and ~/.hermes/auth/google_oauth.json) and env/credential-pool resolution rather than tenant-scoped secret isolation.

  • high

    Confirm the product model: if “tenant” exists for external integrations, refactor the credential resolution/refresh layer (hermes_cli/auth.py and provider-specific OAuth modules like agent/google_oauth.py) to use tenant-scoped secret-manager entries (e.g., secret per tenant+provider), including refresh-token rotation and per-tenant revocation.

    • hermes_cli/auth.py:1-24 — Auth is persisted to a single local auth.json; this is the main integration credential refresh boundary.
    • agent/google_oauth.py:1-20 — Google tokens are stored in a single local JSON file; provider-specific OAuth refresh is not tenant-scoped.
  • med

    Remove/limit shared-secret sources for integration OAuth (shared env vars and shared local credential pool) on the runtime connector path; instead, plumb tenant_id through to the credential resolver so it always selects tenant-scoped credentials.

    • hermes_cli/auth.py:560-616 — API-key secrets are resolved from env or a shared credential pool fallback—this is not tenant-isolated secret selection.
Per-integration observability 0%

I did not find an implementation of Per-integration observability (per connector health/throughput/failure visibility with success/failure rates, latency, and last-sync/last-processed surfaced to ops). There are some basic health counters and health endpoints (e.g., MS Graph webhook health returns accepted/duplicates), but they do not provide the per-integration reliability/latency/last-status telemetry needed to catch broken connectors without customer reports.

  • high

    Add a per-integration metrics/status contract (route_name/platform/provider as the integration key) and implement it across adapters (at minimum: webhook platform routes and MS Graph webhook). Include: total received, success, failure (with error codes), processing latency (p50/p95), and last processed/last success timestamp.

  • high

    Instrument the webhook hot path to record outcome + latency for each configured route/integration (including idempotent duplicates as a separate outcome). Ensure failures in downstream delivery/agent execution are captured and counted (not just logged).

  • med

    Instrument auxiliary-provider call lifecycle (the point where provider HTTP requests are executed) to emit per-provider success/failure rates and latency and expose it via the existing /health/detailed or an internal status endpoint.

    • agent/auxiliary_client.py:3300-3380 — This centralizes provider wrapping/routing; it is a natural place to ensure consistent observability across providers once the actual HTTP call sites are identified/instrumented.
Connector breadth for the category 100%

Connector breadth for this codebase exists in the form of an explicit connector catalog/discovery mechanism: a central `PlatformRegistry` for platform adapters plus an explicit built-in outbound-delivery platform list in the generic webhook adapter. This supports auditing which target systems are covered (and where gaps may exist) without relying on spaghetti per-connector wiring.

  • high

    Add/confirm a single “connector coverage” report surface (CLI/UI endpoint or structured output) that enumerates all registered platforms (from `PlatformRegistry`) and highlights missing vertical table-stakes targets for the intended market (e.g., identity/CRM/data warehouse if applicable).

    • gateway/platform_registry.py:1-220 — The registry already contains the metadata needed for an inventory/coverage report; the remaining gap would be a standardized consumer of that inventory for “breadth” auditing.
  • med

    Ensure webhook delivery breadth is consolidated/automated: reduce drift between `_BUILTIN_DELIVER_PLATFORMS` and plugin registrations by deriving the list (where feasible) from `platform_registry` rather than maintaining a static set.

Build-vs-buy posture 44%

This codebase appears to implement its own first-party integration abstractions for external-system connectors (gateway platform adapters and CLI proxy upstream adapters). There is evidence of deliberate “build” posture via shared adapter interfaces, rather than embedding an external iPaaS-style integration platform. I did not find evidence of embedded third-party integration platforms (e.g., n8n/Zapier/Workato/Nango) from the limited platform-name scan (only a Tailwind merge library showed up).

  • high

    Confirm build-vs-buy at the connector level: enumerate distinct gateway platform adapters (e.g., slack/telegram/slack/etc.) and verify they all implement the same BasePlatformAdapter contract rather than drifting into bespoke per-platform code paths without a shared canonical model.

    • gateway/platforms/base.py:1-25 — Base adapter is the shared interface; next step is to inspect multiple concrete adapters to ensure consistency and bounded divergence.
  • med

    Check for any runtime reliance on a third-party embedded integration platform by searching the repo for common iPaaS/vendor libraries and SDKs beyond the initial narrow scan (e.g., workato/zapier/n8n/nango/tray/app integrations libraries). If none, document that connector coverage is owned and bounded by adapter contracts.

    • gateway/platforms/webhook.py:1-70 — Webhook adapter architecture is a first-party integration boundary; verifying no external iPaaS SDK usage here would strengthen the posture claim.

Not applicable to this codebase: Bidirectional sync, Metadata-driven mappings.

Deployability

CI/CD as code, infrastructure as code, per-environment isolation, and a one-command local boot.

71% 11/11 scored
  • Reproducible one-command build 0%
    0/3 expected sites not present
  • Automated CI pipeline 80%
    4/5 expected sites
  • Automated deployment (CD) 267%
    3/1 expected sites
  • Infrastructure as code 0%
    0/1 expected sites not present
  • Environment isolation 0%
    0/3 expected sites not present
  • Local/production parity 89%
    3/3 expected sites
  • Config & secrets externalized per env 83%
    2/2 expected sites
  • Decouple deploy from release 0%
    0/2 expected sites not present
  • Reversibility / rollback 67%
    2/3 expected sites
  • Delivery cadence (DORA proxy) 100%
    2/2 expected sites
  • Deploy-tooling ownership 100%
    2/2 expected sites
Reproducible one-command build 0%

No clear implementation of a deterministic “one-command build” primitive was found. While the repo provides a one-command curl|bash installer and a contributor setup script, the contributor/bootstrap dependency path includes a non-deterministic fallback (lockfile sync failure/missing lockfile triggers non-hash-verified resolution). The README’s developer instructions also show a multi-step manual ritual, so the specific acquisition gate (clean clone + one command + determinism via pinned dependencies) is not met in a way that scores as correctly applied.

  • high

    Make the clean-clone build+boot path explicitly one command in the root docs (e.g., `./setup-hermes.sh --no-wizard --boot` or similar) and ensure it performs no non-deterministic dependency fallback; fail hard if `uv.lock` cannot be honored.

    • README.md:164-178 — README documents one-command install and a separate multi-step contributor path; the primitive needs one command that covers deterministic build+boot from a clean clone.
    • setup-hermes.sh:200-260 — The script explicitly falls back to non-hash-verified installs when lockfile syncing fails or the lockfile is absent, breaking determinism.
  • med

    Add a lockfile integrity requirement to the bootstrap: detect `uv.lock` and use `uv sync --locked` only, exiting non-zero if it can’t be applied rather than falling back.

    • setup-hermes.sh:200-260 — Fallback behavior explicitly allows unlocked/transitive re-resolution; for reproducibility, replace fallback with a hard failure.
  • low

    Align the “one-command” story by documenting the exact command that a user should run after cloning (not only curl|bash from main), including what “booted” means (e.g., runs `hermes` or starts the gateway).

    • README.md:164-178 — Contributor instructions describe multiple steps and a manual test script; document a single local command that results in a running instance.
Automated CI pipeline 80%

This codebase has a real automated CI pipeline. A dedicated `.github/workflows/tests.yml` workflow runs on every push to main and every PR targeting main, executing both unit tests and e2e pytest suites automatically. Additionally, `.github/workflows/lint.yml` includes a blocking `ruff check .` job intended to enforce code quality and gate merges.

  • med

    Ensure branch protection / required status checks include the blocking test and lint jobs (e.g., require `Tests/test` and `Lint/ruff-blocking`) so merges are fully gated by CI outcomes.

    • .github/workflows/lint.yml:95-144 — The blocking gate exists in CI code (`ruff check .`), but merge-gating correctness ultimately depends on repository branch protection requiring these checks.
Automated deployment (CD) 267%

Automated deployment (CD) exists via `.github/workflows/deploy-site.yml`: it deploys the website/docs to GitHub Pages and triggers a Vercel deploy on published releases. The CD implementation is solid for the site production path, but this audit only found CD wiring for the site/docs deployment (not a broader application-to-production deploy pipeline).

  • high

    If the intent is “full app CD to production”, add (or audit for) a separate, versioned deploy workflow that rolls the running service forward (e.g., to Kubernetes/VM/PaaS) and wire it to the same release published event, including rollback/redeploy steps. Current CD evidence strongly targets the website/docs production path.

  • med

    Ensure deploy workflow inputs/conditions match the desired release governance (e.g., require protected environments, enforce concurrency, and document the release-to-prod mapping). The workflow already uses `release: published` and `environment: github-pages`, which is good; extend the same rigor to any additional production targets.

Infrastructure as code 0%

No Infrastructure-as-Code definitions (IaC/PaaS descriptors) were found in the repository tree (no Terraform/CloudFormation/Pulumi/Helm/k8s/serverless/etc.). The production deployment path appears to be handled via GitHub Actions calling console-integrations (deploy-pages for GitHub Pages and a Vercel deploy hook via a secret) without accompanying versioned IaC that would make the infra reproducible end-to-end.

  • high

    Add versioned IaC for the production deployment targets used here (at minimum: GitHub Pages + Vercel project/webhook configuration), so the same environments can be recreated from a clean checkout without relying on pre-existing console setup. Concretely: introduce Terraform/Pulumi (or equivalent) that declares the Pages site configuration, domains if applicable, and the Vercel linkage needed for releases.

    • .github/workflows/deploy-site.yml:1-106 — Production deploys are executed from GitHub Actions using a Vercel deploy hook and deploy-pages actions, but there is no IaC in-repo to reproduce those targets.
  • med

    Connect the IaC to CI/CD by adding a pipeline job that runs `plan` on PRs and `apply` on protected releases/tags, ensuring infra changes are reviewable and drift is detectable.

Environment isolation 0%

No evidence of true environment isolation (dev/staging/prod with isolated data/credentials/accounts) was found in code. The codebase appears to support only a single active environment configuration via '~/.hermes/.env' and an optional project env path, plus Bitwarden secret injection that is not clearly stage-scoped.

  • high

    Implement explicit stage selection and stage-scoped env loading in hermes_cli/env_loader.py (e.g. HERMES_ENV={dev|staging|prod}) and support separate env files and/or directories per stage ('.env.dev'/'config.dev.yaml', etc.). Ensure only the selected stage’s values are loaded.

    • hermes_cli/env_loader.py:220-260 — Current logic loads '~/.hermes/.env' (and optionally a single project env file) rather than distinct dev/staging/prod configurations.
  • high

    Scope external secrets by stage: require separate Bitwarden project IDs/access tokens (or separate secret namespaces) for dev vs staging vs prod, and wire the stage selection into the Bitwarden config lookup.

  • med

    Replace/extend .env.example with stage-specific templates (or a documented mechanism to generate them) so production credentials are not accidentally reused in non-prod.

    • .env.example:1-6 — The template instructs copying to a single '.env' and does not provide stage-specific examples.
Local/production parity 89%

This codebase implements local/production parity via a container-first approach: docker-compose runs the same image built from the repo’s Dockerfile, and the Dockerfile bakes a deterministic runtime (pinned bases + frozen dependency installs + built assets). Additionally, a Nix flake/devShell provides a reproducible local developer environment using lockfile-driven tooling (uv), further reducing drift.

  • high

    Ensure the documentation explicitly recommends one primary parity path for contributors (preferably `docker compose up` using the repo’s Dockerfile) and provides a short ‘local matches prod’ checklist (ports, required env vars, and how to persist ~/.hermes).

    • docker-compose.yml:1-77 — The compose file already documents intended usage/security and runs production-style commands; adding explicit onboarding guidance would strengthen adoption of the parity mechanism.
  • med

    Align the Nix devShell Python dependency strategy even more directly with the Dockerfile (e.g., ensure the devShell uses the same lockfiles/uv flags that the Docker build uses, not just uv + hooks).

    • nix/devShell.nix:1-42 — Dev shell uses uv and hooks, but parity strength would increase if it is confirmed/extended to mirror the Dockerfile’s exact uv sync behavior and extras.
Config & secrets externalized per env 83%

This codebase clearly externalizes configuration and secrets per environment through a dedicated config layer: it loads secrets/config values from os.environ and/or ~/.hermes/.env (get_env_value / reload_env / save_env_value) and provides an .env.example template for operators. I did not find evidence of env-specific production endpoints/keys being hardcoded in the audited config/bootstrap layer.

  • high

    Audit provider/runtime modules for hardcoded environment-specific endpoints/base URLs or credentials. Specifically, verify that any provider client construction (e.g., OpenAI/Anthropic/OpenRouter/Gemini/etc.) pulls API keys/base URLs from get_env_value()/os.environ or config.yaml rather than embedding production URLs.

    • hermes_cli/config.py:5500-5595 — The presence of get_env_value() suggests intended usage; remaining risk is whether other modules bypass it and hardcode endpoints/keys.
  • med

    Confirm there is no separate “fallback” path that hardcodes production defaults for secrets/endpoints when env variables are missing (e.g., base_url defaults should be provider-safe, while secret-like values must never be literals).

    • .env.example:1-200 — .env.example documents base URLs/overrides as operator configuration; ensure production-like values are not duplicated as literals elsewhere.
Decouple deploy from release 0%

No implemention of a decouple-deploy-from-release primitive (feature-flag/rollout gating that separates deployment from activation) was found in the audited on-graph code. The only flag-like logic observed is runtime UI mode switching (embedded dashboard chat), not a progressive release/activation mechanism.

  • high

    Introduce a production-grade rollout/feature-flag system (server-controlled) and wire it into branching logic for production-visible behavior changes (routes, new UI modules, and agent/tool execution paths). Ensure flags support percentage/canary rollout and are not permanent; keep them governed/configured in code.

  • med

    Replace or complement ad-hoc runtime toggles (like `window.__HERMES_DASHBOARD_*`) with a shared rollout configuration layer that can be switched server-side (and supports canary/percentage), so deployments don’t automatically activate new functionality for all users.

Reversibility / rollback 67%

Reversibility/rollback is implemented for curator-driven changes via an explicit snapshot/restore system in `agent/curator_backup.py`, exposed through the `hermes curator rollback` CLI in `hermes_cli/curator.py`. Rollback is designed to be safe and undoable by taking a pre-rollback snapshot, performing defensive extraction, and reconciling cron job skill references while preserving unrelated live cron scheduling state. Codex runtime plugin migration writes managed sections but does not show a corresponding rollback/undo mechanism in the audited code slices.

  • high

    Add a rollback/undo path for `hermes codex-runtime migrate` that preserves user config safety (e.g., snapshot `~/.codex/config.toml` managed section before write, and provide `hermes codex-runtime rollback` to restore the previous managed block).

  • med

    For Codex migration rollback readiness, add automated tests that validate: (1) rollback restores a previously existing user-managed section, (2) rollback does not delete unrelated user TOML content outside the managed block, and (3) rollback works after partial/corrupted writes (defensive behavior).

Delivery cadence (DORA proxy) 100%

Delivery cadence appears present: git history shows frequent main commits/merges and regular release tagging. On-graph, the repo also automates delivery artifacts and site updates via GitHub Actions (docker image builds/publishes on main pushes for relevant paths and on releases; site deploys on release publish and on main pushes affecting website/skills). No single click-op deploy wiring was found in the audited workflows.

  • high

    Ensure these delivery workflows are also triggered for broader main changes where appropriate (review whether the path filters are too restrictive), so that cadence remains small-batch across more PR types.

    • .github/workflows/docker-publish.yml:1-35 — The docker publish pipeline is gated by `paths` filters under the main-branch push trigger; validate this coverage matches the team’s definition of “production-delivered” changes.
    • .github/workflows/deploy-site.yml:1-36 — The site deploy trigger is gated to `website/**` and `skills/**` paths; confirm that other production-affecting content changes also flow into this or another CD workflow.
Deploy-tooling ownership 100%

Deploy/infra tooling does exist in versioned CI workflows (not click-ops): production-adjacent delivery is implemented in .github/workflows/deploy-site.yml and the release gate CI is in .github/workflows/tests.yml. Off-graph git authorship evidence indicates the deploy/infra tooling is not dominated by a single author (42 authors; top author ~0.178), so the single-engineer CI/CD time-bomb risk is mitigated.

  • low

    Maintain shared ownership by continuing to require reviews on workflow changes (branch protection) and encouraging multiple contributors to touch CI/deploy workflows, especially deploy-site.yml.

T3 Exit Cleanliness

Engineering Org Resilience

No single-author critical paths: git-blame concentration, CODEOWNERS coverage, and reviewer diversity across the codebase.

64% 7/10 scored
  • Critical-path bus factor 100%
    3/3 expected sites
  • Ownership clarity 0%
    0/1 expected sites not present
  • Documentation density ("why") 89%
    3/3 expected sites
  • Operational runbooks 0%
    0/4 expected sites not present
  • Onboarding reproducibility 100%
    4/4 expected sites
  • Tests as executable knowledge 92%
    4/4 expected sites
  • Decision history legibility 67%
    4/4 expected sites
Critical-path bus factor 100%

For the critical-path areas (apps/agent/gateway/plugins/skills/tools/hermes_cli), git-history bus-factor signals show knowledge is distributed: each critical directory has many distinct authors and the top author does not dominate (i.e., no bus-factor-1 gravity well). I verified key critical modules (Kanban DB coordination, Kanban CLI surface, and gateway runtime status helpers) as concrete representative sites, and they align with the distributed bus-factor picture.

  • med

    Add/strengthen co-ownership durability artifacts for the most operationally critical modules (at minimum: an ownership manifest and a short gateway-status/Kanban-DB runbook covering failure modes and recovery steps). This reduces reliance on implicit knowledge even if history remains healthy.

    • gateway/status.py:1-220 — Gateway operational gating helpers live here (PID detection, lock/runtime status). A runbook/ownership record would make the failure-mode recovery process explicit.
    • hermes_cli/kanban_db.py:1-80 — Kanban DB implements concurrency strategy (WAL/BEGIN IMMEDIATE/CAS) and shared coordination semantics. A short recovery-oriented guide would complement distributed code authorship.
Single-author hotspots N/A

The repository does not exhibit the single-author hotspot anti-pattern: the 12-month high-churn hotspots analysis returned no danger-zone files (no file with both high commit frequency and only one/two lifetime authors). Therefore, there are no concrete hotspot sites to record as found (and no should-be sites because nothing currently matches the primitive’s threat condition).

  • low

    No immediate action required for this primitive. Continue monitoring hotspots periodically (e.g., quarterly) to catch future ownership drift in high-churn areas.

    • : — hotspots mode returned `danger_files: []`.
Review diversity N/A

This repo shows evidence of a PR-based integration process (non-zero PR-referenced share) and multiple human integrators (distinct_mergers_human=15), so review context is present and not strictly centralized. However, pr_referenced_share is relatively low (0.295), suggesting a substantial portion of work still lands outside the PR review path; additionally, reviewed-by trailer signals are very low (1), and there is significant bot involvement in merges (top_merger GitHub bot). Overall: review diversity exists, but is not consistently strong.

  • high

    Increase PR-referenced landing for changes that affect core behavior: enforce branch protection / required PRs for mainline (or for specific paths) so pr_referenced_share rises toward (ideally) majority-of-merges. This is the most direct lever for spreading review context.

    • GIT_HISTORY:N/A — pr_referenced_share=0.295 indicates PRs are not the dominant merge path.
  • high

    Reduce single-path “gravity well” risk by ensuring merges are not effectively mediated by bots/single integrators for critical areas: require a human maintainer review before merge and ensure merges are performed by a rotating set of maintainers for core modules.

    • GIT_HISTORY:N/A — Top mergers show GitHub bot accounts for 604 merges; effective human review-diversity can be reduced even when distinct_mergers_human is high.
  • med

    Standardize PR collaboration signals: encourage use of reviewed-by/co-authored-by trailers (or equivalent team conventions) so the collaboration/review process is reflected in commit/merge metadata and is easier to audit over time.

    • GIT_HISTORY:N/A — reviewed_by_trailers=1 is very low, suggesting review attribution/tracking may not be consistently captured.
Ownership clarity 0%

Ownership clarity (an explicit ownership manifest covering critical paths) is absent. The repository’s org-knowledge artifacts include docs/onboarding but no ownership/CODEOWNERS/OWNERS manifest category, and the on-repo development guide does not define a “who owns what” mapping for the critical subsystems it enumerates.

  • high

    Add an ownership manifest for the enumerated critical subsystems in AGENTS.md (at minimum: agent/, hermes_cli/, tools/, gateway/, plugins/, ui-tui/, cron/, and tests/). Include 2+ human owners per area (not bots), and keep it current.

    • AGENTS.md:1-120 — AGENTS.md enumerates the critical subsystem boundaries that should have explicit owners. This is the natural anchor point for creating an ownership manifest aligned to real change areas.
  • med

    Cross-check the named owners against git history for each critical subsystem and adjust owner lists until no single person is a near-single-author gravity well for that area.

    • AGENTS.md:1-120 — The critical areas listed here should be used as the scope for the history cross-check; the guide defines the canonical subsystem set.
  • low

    Add a short “How to use owners” section to the onboarding/docs so contributors know where to look and how to request ownership changes (without blaming individuals).

    • AGENTS.md:1-120 — Onboarding guidance is already present; adding a small owners-lookup instruction would make the manifest operational rather than ceremonial.
Retained vs. departed knowledge N/A

This primitive is not implemented/represented as an explicit retention/transfer mechanism anywhere in the codebase. While git-history signals indicate very low overall “departed authorship share” (0.002) and only one clearly departed email in recency mode, there is no in-repo ownership/operational documentation structure to confirm that critical-path context is retained by current staff (e.g., no ownership manifest / no runbooks / no ADRs).

  • high

    Add an ownership manifest (e.g., CODEOWNERS or an OWNERS file) for critical areas (agent runner, gateway, CLI, core adapters). Ensure owners are the same people who actively maintain these modules (not just historical authors).

    • CONTRIBUTING.md:1-40 — Current docs focus on contribution priorities and setup, not on durable ownership/retention for critical components—so it would need to be complemented by an explicit ownership manifest.
  • high

    Create operational runbooks for each critical runtime component (gateway runner, CLI entry points, cron/schedulers, and provider integrations). These should include restart/diagnosis steps and “what not to change” guidance to prevent reliance on an individual’s memory.

    • CONTRIBUTING.md:1-120 — No operational runbook structure is described in contribution/onboarding materials, which is where knowledge-retention expectations typically get anchored.
  • med

    Add decision records (ADRs) for major architectural choices that affect critical workflows (tool calling loop, gateway lifecycle, memory provider architecture boundaries).

    • CONTRIBUTING.md:1-120 — Contribution guidance does not reference ADRs/decision history as a required artifact for architecture-critical changes.
Documentation density ("why") 89%

This codebase has durable architecture “why” documentation in the docs and website docs: the multi-gateway kanban deployment guide explains the operational/concurrency rationale; the Docker network egress guide explains the threat model and architectural boundary; and the ACP internals guide documents lifecycle and key bridging/intent-heavy decisions. ADR/runbook/ownership artifacts are absent per org artifacts scan, but this audit is limited to the presence/quality of “why” documentation where it is required.

  • high

    Add missing decision/operational durability artifacts: introduce an ADR process (e.g., docs/adr) for architecture decisions with explicit rationale, and add runbooks for critical services so the operational “why” doesn’t live only in heads.

    • N/A (git_org_signals artifacts scan):N/A — git_org_signals artifacts mode reports absent categories: adr (0) and runbook (0) and ownership (0). While not directly verifiable via code_read line citations, this indicates missing durable rationale artifacts for key maintenance workflows.
  • med

    For each critical integration surface (e.g., adapter/event bridge layers like ACP), ensure the docs consistently include: (1) threat/intent boundary, (2) why design constraints exist, and (3) what invariants must not be broken. Use ACP Internals as the model and replicate its structure.

Operational runbooks 0%

Operational runbooks are not present anywhere in the repository’s tracked org-doc artifacts: the `runbook` category is absent. While the codebase has multiple operationally critical entrypoints (gateway runner, cron scheduler, agent runner, and operator diagnostics), there are no corresponding written procedures for deploy/incident/recovery that would mitigate knowledge concentration and reduce reliance on a single “just knows” operator.

  • high

    Create runbooks for each critical operational service/entrypoint: (1) gateway/run.py, (2) cron/scheduler.py, (3) run_agent.py. Each runbook should include: deploy checklist, common incident playbooks (symptom → checks → mitigation), and recovery steps (restart/rollback guidance, what state is affected, verification steps).

    • gateway/run.py:1-25 — Gateway runner is a long-running operational daemon; requires deploy/incident/recovery runbook coverage.
    • cron/scheduler.py:1-15 — Cron scheduler is a continuously running job executor with locking; requires operational playbooks.
    • run_agent.py:1-25 — Agent runner is the core execution engine; needs operational procedures for failure recovery.
  • med

    Add an operator runbook for hermes_cli/doctor.py that documents when to run it during incidents/setup failures, how to interpret the diagnostic output, and the exact follow-up actions (including configuration/environment checks).

    • hermes_cli/doctor.py:1-15 — Doctor command is an operational diagnostic entrypoint; should be documented as part of recovery workflows.
  • high

    Add an ownership manifest (CODEOWNERS/ownership doc) for these operational entrypoints and ensure the runbooks list at least 2 co-owners per critical service to reduce the gravity-well risk.

    • N/A (org-doc artifacts index):N/A — Artifacts scan also shows `ownership` and `adr` categories are absent; adding ownership complements runbooks to mitigate knowledge concentration.
Onboarding reproducibility 100%

Onboarding reproducibility is implemented well: the repo has multiple written paths (public Quickstart/Installation docs plus CONTRIBUTING) that describe clean-clone-to-productive steps, including a one-command install bootstrap. These docs are reinforced by the actual one-command installer script. No evidence was found of onboarding being purely tribal for the core “get running and verify chat” workflow.

  • high

    Audit whether the docs’ “one-line installer” and the subsequent verification steps are fully deterministic for all supported targets (Linux/macOS/WSL/Termux/Windows) by running the exact documented commands on a clean machine profile; capture any missing flags/required env vars into the docs as a numbered checklist.

  • med

    Add a single “Developer fast path” section to the onboarding docs (separate from CONTRIBUTING) that references `scripts/run_tests.sh` and `hermes doctor` as the two primary verification gates after setup, to reduce fragmentation across doc locations.

    • CONTRIBUTING.md:1-120 — CONTRIBUTING already contains the needed commands (`hermes doctor`, `hermes chat`, and `scripts/run_tests.sh`), but consolidating into a primary onboarding entry could improve discoverability and reduce time-to-productivity.
Tests as executable knowledge 92%

The primitive is clearly present: the repository contains extensive, behavior-focused test suites (especially under tests/run_agent, tests/gateway, and tests/ for agent classifier logic). Tests document intended behavior with concrete assertions, mock external dependencies, and cover critical correctness properties like recovery classification mechanics, API/store semantics, and TUI gateway context/serialization/privacy behavior.

  • high

    Add/expand executable tests that directly pin AIAgent’s end-to-end conversation loop invariants (e.g., session transitions and retry/rotation outcomes) beyond helper/heuristic checks, so refactors of the orchestration surface are protected by intent captured in tests.

    • run_agent.py:250-520 — AIAgent is a central orchestration surface; ensure tests cover the core loop/session transition behaviors, not only helper heuristics.
  • med

    Strengthen classifier tests to cover each disambiguation/branching path (especially 402 billing vs rate-limit and other ambiguous pattern families) as executable examples, not just extraction helpers.

  • low

    For each gateway platform test that validates normalization/connectivity (e.g., Feishu), add one or two assertions that link normalized output to downstream adapter expectations (how Hermes consumes the normalized payload), to make integration tests more behavior-end-to-end.

    • tests/gateway/test_feishu.py:1-220 — Current Feishu tests validate normalization/config and some adapter setup behavior; add downstream-consumption assertions for stronger executable knowledge.
Decision history legibility 67%

Overall commit history appears decision-legible (high explanatory body share; no evidence of a WIP/fix no-body wall). However, there are no ADRs or ownership/runbooks/ownership manifests in the repo’s tracked org-document artifacts, so durable decision records are missing. The code compensates with strong inline rationale (docstrings/comments) around key ACP integration decisions, so the “why” is largely recoverable from source, but not via separate decision artifacts.

  • high

    Add ADRs for the main cross-protocol/integration decisions (at minimum: Hermes todo→ACP plan mapping, tool-progress callback concurrency/id tracking, ACP approval semantics mapping, and logging/noise suppression policy). Ensure ADRs reference the exact functions/modules that implement the decisions.

  • med

    Create a lightweight “decision record” checklist for PRs touching integration surfaces (ACP adapter, permissions, tool progress callbacks) requiring either (a) a referenced ADR ID or (b) a commit body paragraph summarizing the decision and tradeoffs.

    • acp_adapter/events.py:64-157 — Tool-progress callback behavior is non-obvious (concurrency/id tracking); easy to regress without a decision summary.
  • low

    For ops/logging decisions (like benign probe traceback suppression), add a short ADR or “operational decision note” explaining acceptance criteria (what to suppress vs not) to reduce future guesswork.

    • acp_adapter/entry.py:14-55 — Benign-probe filter rationale is present inline; externalizing it improves durability across refactors/rewrites.

Not applicable to this codebase: Single-author hotspots, Review diversity, Retained vs. departed knowledge.

IP & OSS License Hygiene

An SBOM in CI, no AGPL/GPLv3 in the dependency tree, CVEs triaged by severity, and no outside-contributor commits without IP assignment.

33% 9/12 scored
  • License compliance 0%
    0/3 expected sites not present
  • Known-vulnerability scan 0%
    0/5 expected sites not present
  • Known-exploited CVEs 0%
    0/4 expected sites
  • Dependency freshness 67%
    3/4 expected sites
  • Upstream maintenance 0%
    0/1 expected sites
  • Remediation velocity 100%
    1/1 expected sites
  • Supply-chain integrity 67%
    2/3 expected sites
  • Dependency-confusion resistance 67%
    3/4 expected sites
  • IP ownership / provenance 0%
    0/1 expected sites not present
Software bill of materials N/A

SBOM hygiene is not implemented as a concrete, verifiable primitive in this codebase (no SBOM-generation tooling or CI/release wiring referencing syft/cyclonedx/SPDX-style outputs was found). The repository does have committed dependency lockfiles (package-lock.json, uv.lock) and dependabot configuration for GitHub Actions, but there is no evidence of an automated SBOM generation step being produced and kept current as part of release/CI.

  • high

    Add an SBOM generation job to the main CI/release workflow(s) that runs after dependencies are installed and before packaging/release. Use a concrete tool (e.g., Syft for Linux/CLI, or CycloneDX/SPDX generators) and emit an artifact (e.g., sbom.spdx.json or sbom.cdx.json). Ensure it includes transitive dependencies and is consistent across the repo’s ecosystems (npm + uv/PyPI).

    • package-lock.json:1-20 — Pinned npm dependencies exist, so SBOM generation in CI is feasible and should be wired to the lockfile.
    • uv.lock:1-20 — Pinned Python dependencies exist, so SBOM generation in CI is feasible and should be wired to the lockfile.
  • high

    Publish the SBOM as a build/release artifact and (optionally) attach it to releases. Also add a CI check that fails if the SBOM generation step cannot run (or produces an empty output).

    • package-lock.json:1-20 — Lockfile presence indicates the expected source of truth for SBOM content; the missing piece is the release/CI emission + verification.
  • med

    Add a periodic SBOM comparison/accuracy check: generate SBOM in CI and ensure it matches (or is a strict subset/superset of) the dependency inventory derived from the lockfiles (ground truth).

    • uv.lock:1-20 — Because uv.lock is committed, the repo can compute an authoritative resolved dependency inventory; SBOM accuracy checks should be tied to it.
License compliance 0%

I did not find an explicit, code-enforced “license compliance” primitive for transitive dependency licensing/NOTICE obligations (e.g., SBOM + license scanning + gating on strong/network-copyleft or unknown-tier licenses; and collection/packaging of dependency NOTICE/licenses). The repo does have a dependency-update mechanism (Dependabot for GitHub Actions only) and pinned dependency lockfiles (uv.lock), but those do not constitute license compliance enforcement by themselves.

  • high

    Add CI automation that generates an SBOM for *all* included lockfiles (at least uv.lock + package-lock.json variants) and runs a transitive license scan; fail the build (or require manual legal approval) if any strong-copyleft or network-copyleft (AGPL/SSPL) dependency is present, or if licenses are unresolved/unknown.

    • .github/dependabot.yml:1-45 — Current automation is scoped to GitHub Actions only and explicitly not for source dependencies; no license-compliance gating is evidenced.
  • high

    Ensure NOTICE/LICENSE attribution obligations are satisfied: collect dependency license texts/NOTICE files (from scan output or package metadata) into a standardized third-party notices location and verify it is kept current with lockfile updates.

    • package.json:1-35 — Project license declaration exists, but there is no evidence of automated attribution/NOTICE handling for transitive dependencies.
  • med

    Document the re-pricing rule for network-copyleft and add an explicit exception process (legal sign-off) if any copyleft licenses are introduced—do not rely on usage/reachability to mitigate license risk.

    • .github/dependabot.yml:1-45 — Repo already has a security/pinning posture description; extend it to cover legal licensing obligations and the failure/exception workflow.
Known-vulnerability scan 0%

A known-vulnerability scan over lockfiles (OSV/CVE findings with HIGH/CRITICAL triage) is not found as a primitive in this codebase. While there is an OSV-based malware check (MAL-* advisories) for MCP extension packages, it is not the required lockfile vulnerability scan practice (and it is fail-open on network errors). The repo does contain multiple committed lockfiles (npm and uv), which are the natural sites where the known-vulnerability scan should be wired into CI, but no such implementation was located.

  • high

    Add a CI job that runs osv-scanner (or equivalent) in *vulns* mode over every committed lockfile in the repo (root package-lock.json, uv.lock, and each subproject package-lock.json). Configure it to fail the build if there are any untriaged HIGH/CRITICAL findings, and require per-finding remediation or explicit documented exceptions.

  • med

    Ensure results are triaged and prioritized using reachability/context (where possible): for each HIGH/CRITICAL finding, record whether the vulnerable code path is actually used (or at least whether the vulnerable package is required in the runtime bundle).

    • tools/osv_check.py:1-156 — Current OSV usage is for MAL-* malware blocking only; it does not implement the required HIGH/CRITICAL triage over dependency vulnerabilities.
  • med

    Avoid fail-open behavior for the known-vulnerability scan primitive: unlike the current MAL-* check (network errors allow proceeding), the dependency vulnerability scan should have a deterministic outcome (e.g., retries with caching, or fail closed with degraded-mode reporting).

    • tools/osv_check.py:1-60 — Explicitly documented fail-open behavior on OSV network errors; this pattern should not be reused for required lockfile vulnerability scans.
Known-exploited CVEs 0%

The repo includes OSV-based security tooling (tools/osv_check.py blocks MAL-* advisories; hermes_cli/security_audit.py queries OSV for vulnerabilities), but no evidence was found that it specifically checks for the OSV 'known-exploited CVEs' set. Additionally, .github/workflows entries were not found via code search, so an automated gating primitive for known-exploited CVEs could be missing or located outside the searched workflow paths.

  • high

    Add a dedicated 'known-exploited CVEs' scan step that fails CI/release if any pinned dependency matches the OSV known-exploited set (not just MAL-*). Implement it in a script (or extend hermes_cli/security_audit.py) to explicitly detect those IDs/aliases from OSV results.

  • high

    Ensure the repository actually runs the primitive in automation. Confirm/introduce a CI workflow (or equivalent) under .github/workflows or another CI entrypoint that invokes the known-exploited check on every PR and blocks merges.

  • med

    If OSV network failures are possible, avoid fail-open for this primitive. For known-exploited CVEs, treat scan errors as 'unknown' and fail-closed (or require a verified offline SBOM/CVE DB snapshot) to prevent silent exposure.

    • tools/osv_check.py:1-52 — Documentation and exception handling explicitly 'Fail-open: network errors allow the package to proceed.'
  • low

    Unify the two OSV utilities: make tools/osv_check.py a thin wrapper over the richer hermes_cli/security_audit.py (or vice versa) so the repo has one consistent definition of security gating categories (including known-exploited CVEs).

Dependency usage & reachability N/A

N/A for this audit run: the required virgil_query primitive surface (template/row/table) for `dependency_usage_reachability` is not present in the tool-backed code graph in this environment, so I cannot enumerate declared-but-never-imported vs phantom deps or confirm vulnerable API reachability via the prescribed reachability queries.

  • high

    Re-run with a virgil-cli configuration/version that supports the `dependency_usage_reachability` template/row (or confirm the correct template/table name for this deployment). As a fallback, provide the exact virgil_query SQL schema/expected tables that implement: (1) declared-but-never-imported, (2) imported-but-undeclared, (3) call_site reachability by receiver.

    • N/A:N/A — Tooling errors observed: `unknown template 'dependency_usage_reachability'` and `Table with name dependency_usage_reachability does not exist`. These indicate the primitive’s on-graph machinery is missing/unavailable.
  • med

    Once the on-graph primitive is available, I will: (a) cross-check npm/Python manifests vs `raw_import` for unused/phantom dependency issues, and (b) join `raw_import` to `call_site` to determine whether CVE-flagged vulnerable APIs are actually reached, then anchor every site to the exact manifest/lockfile lines and code call sites.

    • package.json:1-35 — This repo uses npm workspaces and declares dependencies (e.g., `@streamdown/math`, `agent-browser`). This is the manifest input that would be compared against on-graph imports/call sites once the reachability queries are available.
Dependency freshness 67%

Dependency freshness controls appear implemented via committed lockfiles for both Node (package-lock.json) and Python (uv.lock), providing deterministic pinning of resolved transitive dependencies. Additionally, Dependabot is set up to keep GitHub Actions fresh (weekly) and to rely on security-update triggers for action SHA patching. Overall, this is a strong baseline for avoiding dangerously stale dependencies, though Dependabot explicitly does not auto-update source dependencies across npm/PyPI (by design).

  • high

    Extend automated freshness checks beyond “pinning exists”: add/verify a CI step that flags dependencies that are many versions behind their upstreams (or simply runs `npm outdated` / `uv pip list --outdated` and fails/warns). Evidence of pinning is present, but the repo should also prove it actively detects staleness.

    • package-lock.json:1-35 — Lockfile pinning is present, but freshness detection/alerts are not evidenced by this lockfile alone.
  • med

    Confirm that npm/uv dependency update PRs for source dependencies are actually flowing through the configured process (not just the dependabot.yml scope). Verify via CI logs or a sample merged PR that updates package-lock.json and uv.lock on a cadence.

    • .github/dependabot.yml:1-45 — This file shows Dependabot is scoped to GitHub Actions only; source dependency freshness likely depends on another workflow/mechanism.
  • low

    For the python constraints file (requirements.txt), ensure it is either generated from/consistent with the locked uv set (so it doesn’t become an “outdated constraint” source of confusion).

Upstream maintenance 0%

Upstream maintenance is partially implemented via Dependabot for GitHub Actions (scheduled weekly). However, the repo explicitly excludes Dependabot automation for source dependency ecosystems (pip/npm), so upstream maintenance for the actual application/runtime dependencies is not clearly covered by an automated upstream-patching mechanism in the audited config. No deprecation/abandoned-upstream signals were provided by the inventory output, so the main hygiene gap here is the lack of an actively maintained upstream update workflow for source dependencies.

  • high

    Enable upstream-maintenance automation for source dependencies (at minimum, Dependabot security updates) for the ecosystems the repo actually uses (PyPI and npm), or add a documented CI mechanism that reliably bumps and locks patched versions when upstream releases security/bugfix updates.

    • .github/dependabot.yml:1-18 — Config explicitly states Dependabot is NOT enabled for pip/npm/source dependencies, which leaves upstream maintenance for those dependencies without an automated upstream-update mechanism.
  • med

    Add/confirm CI evidence that pinned source-dependency updates occur when upstream publishes patches (e.g., Dependabot security update runs for pip/npm, or a scheduled/manual update workflow that includes security-triggered PRs).

    • .github/dependabot.yml:19-45 — Comments describe that source-dependency security updates are intended to be enabled separately, but the checked-in config does not show pip/npm ecosystems. Verification should ensure the intended setting/workflow actually exists and runs.
Remediation velocity 100%

Remediation velocity is present. The repo has an active Dependabot configuration for GitHub Actions with a weekly update cadence. Off-graph provenance indicates that dependency-update PRs are not just configured but have merged recently (39 merges in the last 90 days; 47 in the last 365 days), supporting that the mechanism is working rather than stalled.

  • high

    Verify that Dependabot security updates for pinned source dependencies are actually enabled and resulting in merged CVE-only PRs (not just action-bumps). If they are disabled or slow, adjust the Dependabot security update settings or add targeted automation for the pinned ecosystems (uv.lock / package-lock.json).

    • .github/dependabot.yml:1-45 — The config explicitly scopes scheduled updates to github-actions and states source dependency CVE updates are enabled separately via repository security settings. Confirm that those security-update PRs are flowing/being merged (not only the scheduled actions bumps).
  • med

    Add an explicit metric/check (e.g., CI badge or automated report) for “dependency-update PRs merged in last 90 days” to prevent the velocity mechanism from silently degrading over time.

    • .github/dependabot.yml:1-45 — A mechanism exists, but there is no in-repo enforcement visible here that ensures continued merge velocity; implementing a monitoring check prevents regression.
Supply-chain integrity 67%

Supply-chain integrity is implemented via committed dependency lockfiles with integrity hashes: `package-lock.json` for npm packages (sha512 integrity fields) and `uv.lock` for Python packages (sha256 hashes for sdists/wheels). A plain `requirements.txt` exists for an optional Python submodule, but the integrity mechanism is correctly provided by the presence of `uv.lock` rather than relying on the un-hashed requirements file.

  • high

    Ensure CI/CD uses the lockfiles in a non-floating way (e.g., `npm ci` / `npm ci --ignore-scripts` as appropriate; `uv sync --locked`) so integrity verification is actually enforced during builds, not only during local installs.

    • package-lock.json:1200-1245 — The lockfile has integrity hashes, but build-time enforcement depends on CI using the lockfile install commands.
    • uv.lock:1-80 — The lockfile has integrity hashes, but build-time enforcement depends on CI using the locked sync/install workflow.
  • med

    Add/verify CI steps that fail the build if lockfiles are out-of-date (e.g., ensure no dependency installs are allowed without `--locked`/lockfile mode). This prevents accidental drift that would weaken integrity guarantees.

Dependency-confusion resistance 67%

Dependency-confusion resistance appears implemented via committed lockfiles for both npm (package-lock.json) and Python (uv.lock). The root package manifest does not show obvious unscoped/private or typo-similar package names in the lines reviewed, and the lockfiles should prevent registry-name swapping from taking effect during installation.

  • high

    Verify there are no unscoped private packages declared anywhere else in the repo (e.g., other package.json files under workspaces, app-specific manifests) and confirm any @org private scopes are the intended namespaces.

    • package.json:1-35 — This is the primary surface for dependency names; only the root manifest slice was reviewed.
  • med

    For npm, ensure the install process always uses the committed lockfile in CI/release (e.g., npm ci) and does not fall back to package.json-only resolution.

  • low

    Align any loose Python requirements (requirements.txt) with the pinned uv.lock output; avoid relying on ranges for sensitive/private packages.

IP ownership / provenance 0%

For the primitive “IP ownership / provenance”, I did not find evidence of a durable IP-assignment mechanism (e.g., CLA/contributor agreement documentation and coverage workflow) that would let an acquirer verify that every meaningful contributor’s IP is assigned to the company. The repo contains a contributor audit script that tracks attributions, but it is not itself an IP-ownership/legal assignment artifact.

  • high

    Add or point to the durable legal mechanism that assigns contributor IP to the company (CLA/contributor agreement), and document it in-repo (e.g., LICENSE/CLA/CONTRIBUTING reference) including how signatures are obtained/recorded for all contributors.

    • scripts/contributor_audit.py:1-120 — Contributor audit tooling exists; however, no legal assignment mechanism is evidenced here. This should be connected to the CLA/IP assignment artifact used to cover contributors.
  • high

    Create a searchable provenance record that maps contributors (git author emails/handles) to IP assignment status (e.g., CLA signature IDs/timestamps, or an escrowed/archived record), and ensure the audit script (or a companion CI check) validates coverage against that record.

    • scripts/contributor_audit.py:200-360 — The script already enumerates contributors across git history and PR attribution signals; it should be extended to validate against an IP-assignment/CLA coverage dataset rather than only checking release-note mentions.
  • med

    Include/maintain a current “IP roster” (employee/contractor emails) and explicitly document the policy for external contributors (require CLA/assignment before merging). This allows the unassigned-IP cloud to be resolved deterministically during diligence.

    • scripts/contributor_audit.py:1-120 — The script performs contributor resolution/exclusion, which is the right starting point, but the roster/legal coverage policy that would resolve unassigned IP is not evidenced.
AI-coding-tool provenance N/A

No AI-coding-tool provenance tracking convention (generated-code markers/headers, commit trailer patterns, or an AI-usage/provenance policy doc) was identified in the code artifacts inspected. Therefore this primitive is treated as absent for this codebase in its current form.

  • high

    Add an explicit AI-provenance policy/document (e.g., in CONTRIBUTING/SECURITY/OSS-HYGIENE docs) defining: what counts as AI-generated code, required attribution/traceability format, and how license/IP review is triggered for AI-generated snippets.

  • high

    Introduce a repo-wide provenance marker convention for AI-generated code (e.g., standardized header comment like `# AI-GENERATED: <tool> <date> <prompt/ref> <license-check-status>` and/or filenames placed under an `ai-generated/` directory).

  • med

    Add CI checks to enforce provenance requirements (e.g., scan for missing/incorrect AI-provenance headers in files labeled as AI-generated; require a link/reference to a review or approval record).

Not applicable to this codebase: Software bill of materials, Dependency usage & reachability, AI-coding-tool provenance.

Implementation & Customization

Configuration over per-customer branches: no "if customer_id == 12345", no pricing literals scattered outside the billing module.

78% 7/10 scored
  • Configuration over code branches 56%
    2/3 expected sites
  • Centralized pricing/plan logic 67%
    2/3 expected sites
  • Metering decoupled from pricing model 67%
    3/3 expected sites
  • Feature gating via flags, not forks 80%
    5/5 expected sites
  • Documented extension interface 100%
    9/8 expected sites
  • Customization isolation & upgrade safety 100%
    7/7 expected sites
  • Theming / white-label as config 78%
    3/3 expected sites
Configuration over code branches 56%

This codebase uses configuration/data to drive meaningful variation (notably tool enablement/provider setup persisted to user config, and custom provider request overrides merged via config-like inputs). However, pricing/rate tables are embedded as large code literals in agent/usage_pricing.py, which is a divergence risk versus a pure config/data approach for evolving pricing rules.

  • high

    Move the pricing/rate tables out of agent/usage_pricing.py into a versioned config/data source (e.g., YAML/JSON fetched locally or packaged with the app), and make the lookup layer load from that data so updates/onboarding new models require config changes not code edits.

    • agent/usage_pricing.py:1-220 — The core pricing matrix is hardcoded in _OFFICIAL_DOCS_PRICING with many per-model/per-provider Decimal literals, indicating code-based customization instead of data-driven configuration.
  • med

    Ensure all future per-user/toolset variations (especially plugin-provided toolsets) flow through the same config persistence mechanism and do not require new branching code in hermes_cli/tools_config.py for each new variation.

    • hermes_cli/tools_config.py:1-220 — The module already provides a config registry and plugin discovery hooks; keep new variation sources wired through these data structures.
  • low

    Where custom provider behavior uses request_overrides.extra_body, document the schema for custom_providers entries so integrators can add behaviors via config confidently without touching init logic.

    • agent/agent_init.py:1-140 — The selection and merge logic for custom_providers/extra_body is present; formalizing its expected config shape reduces the need for code changes.
No hardcoded customer branching N/A

No hardcoded customer/tenant/org/account identity branching was found in the audited code paths. Observed uses of `account_id` / `tenant_id` are configuration or data-scoping inputs (e.g., headers, request URLs, metadata fields), not `if`/switch-style special-casing based on literal identity values.

  • med

    If customer-specific behavior is expected in this repo, add/update tests that fail when business logic branches on literal `customer_id`/`tenant_id`/`org_id`/`account_id` values (e.g., snapshot tests with multiple tenants).

Centralized pricing/plan logic 67%

A centralized pricing/plan-cost module exists at agent/usage_pricing.py, with PricingEntry and an _OFFICIAL_DOCS_PRICING snapshot plus routing to official snapshots or provider metadata. Other surfaces (notably the CLI and the agent conversation loop) reuse that module’s pricing helpers rather than duplicating pricing constants or cost calculation logic.

  • high

    Search for any remaining price/tier/discount literals outside agent/usage_pricing.py and the official metadata providers, and refactor them to call get_pricing_entry/estimate_usage_cost/resolve_billing_route.

    • agent/usage_pricing.py:1-118 — This file should become the single source of pricing constants/rules; any other literal-based pricing should be eliminated.
  • med

    For the desktop model picker, confirm the UI is only consuming backend-provided price/tier fields (not computing any pricing). If any computations exist in apps/desktop, move them behind the backend pricing endpoints.

Metering decoupled from pricing model 67%

The codebase includes a clear separation between usage metering (token counters persisted in the session DB) and pricing model logic (agent/usage_pricing.py maps normalized usage to cost estimates). However, billing/cost fields also exist alongside usage in the session schema (estimated/actual), so the decoupling is strong but not perfectly 'events only + later mapping' throughout the stack.

  • high

    Audit the session write path (where input_tokens/output_tokens/cache_*_tokens and billing_provider/billing_mode are populated) to ensure no pricing literals/rules are embedded in the metering/capture code. If billing_mode/cost_source are set during capture, refactor to store only metering inputs and defer cost-source/pricing-version selection to usage_pricing.

    • hermes_state.py:260-320 — Shows usage counters and pricing-related fields coexist in the same persisted record; capture-path coupling may exist and should be checked.
  • med

    Ensure all cost displays (CLI/status/any API) derive from estimate_usage_cost(normalized metering) rather than re-implementing token->price math elsewhere (search for token multipliers or per-million constants outside agent/usage_pricing.py).

Feature gating via flags, not forks 80%

Yes—this codebase uses a centralized entitlements/feature-state mechanism (NousFeatureState / NousSubscriptionFeatures) to gate tiered capabilities (web/image/video/tts/browser/modal) via flags derived from account entitlements and configuration, rather than introducing divergent per-plan code paths.

  • high

    Verify (and if needed, tighten) that every tool dispatch path consumes the already-computed enabled_toolsets/disabled_toolsets (from tools_config/nous_subscription) and does not re-check plan/tier in individual tool handlers.

    • agent/agent_runtime_helpers.py:1600-2362 — invoke_tool passes enabled_toolsets/disabled_toolsets down to the shared handler; audit downstream tool registry/handlers for any re-introduced tier-specific branching.
  • med

    Ensure gateway/managed-vs-direct selection remains strictly config/flag driven across all UI and runtime surfaces (CLI, TUI, gateway). Add a single contract test that compares feature-state outputs to tool registration results.

  • low

    If additional flags are introduced in the future, deprecate/retire old gating variables (e.g., legacy tier/message flags) and keep gating logic concentrated in hermes_cli/nous_subscription.py.

Documented extension interface 100%

This codebase has a strong, documented extension interface centered on a plugin system (`hermes_cli/plugins.py`) with stable hook definitions and a `PluginContext` facade for third-party customization. Separately, it uses documented ABC/profile contracts for provider customization (`ProviderProfile` and browser `BrowserProvider`) with registry-driven discovery/dispatch and (for some interfaces) compliance tests to keep the contract stable across upgrades.

  • high

    Add explicit documentation for the customer-facing extension points most likely to be used externally (e.g., browser provider lifecycle, `ProviderProfile` hooks, and dashboard-auth provider protocol), including versioning/compat rules and an example “hello world” plugin for each. Current code has strong docstrings, but external/partner docs often need a dedicated compatibility section.

    • hermes_cli/plugins.py:1-110 — Plugin contract exists, but there’s no explicit, single “public contract doc + versioning rules” artifact shown in the evidence slices.
    • providers/base.py:1-199 — ProviderProfile is documented in-code; partner adoption typically benefits from a versioned interface spec and compatibility policy.
  • med

    Ensure all extension categories (standalone/backend/platform/exclusive) have equivalent compliance tests (like `test_plugin_platform_interface.py`) so the stability guarantee covers every documented contract, not only gateway platform plugins.

  • low

    Add at least one upgrade/compat regression test for provider-profile overrides (e.g., ensuring `build_api_kwargs_extras` and alias resolution remain stable), mirroring the approach used for platform interface compliance tests.

    • providers/__init__.py:1-120 — Discovery and override semantics are critical for extension safety; a focused regression test would reduce risk during future refactors.
Customization isolation & upgrade safety 100%

This codebase implements customization with explicit, stable extension boundaries (plugin LLM access with config trust gating, shell hooks injected via an existing hook manager with allowlist/consent and idempotent config registration, and image-generation provider backends behind an ABC selected by config). It also centralizes customization-prone credential removal via a registry to avoid per-source bespoke core branching that would require re-validation on upgrades.

  • high

    Add a short, versioned extension contract doc/test for each customization boundary (plugin LLM ctx.llm, shell hooks event matcher+callback shape, ImageGenProvider.generate response schema). Ensure CI has “contract tests” that validate core+plugin integration across upgrades.

    • agent/plugin_llm.py:1-70 — Plugin LLM surface and trust-gating contract is defined here; it should be locked down with explicit contract tests.
    • agent/shell_hooks.py:1-55 — Shell hook configuration/dispatch contract is defined here; regression safety depends on preserving the wire protocol and callback behavior.
    • agent/image_gen_provider.py:1-35 — Image provider contract is defined here; adding contract tests will prevent accidental coupling during core evolution.
  • med

    For plugin trust-gated overrides, ensure there is a single authoritative place that documents the default-deny behavior for missing config and how allowed_* lists interact (including wildcard semantics), and add tests for each override dimension.

    • agent/plugin_llm.py:230-320 — Override trust gating logic lives here; expanding test coverage around all override flags reduces upgrade risk.
Theming / white-label as config 78%

The codebase does support theming/white-label as configuration. The clearest, explicitly white-label-oriented implementation is the CLI `skin_engine` (YAML-driven skins with `branding` and `colors`, requiring no code changes to add a new skin). Additionally, both the desktop and web frontends use centralized theming contexts that apply brand/skin differences by building/injecting theme tokens (CSS variables) rather than forking UI code per brand.

  • high

    Verify end-to-end partner onboarding expectations for white-label: confirm there is a supported/configurable mechanism to select or activate a skin/theme per deployment (and/or per environment) without manual code edits or per-customer branches. If selection is currently only user-local (e.g., localStorage), document the intended operational workflow for multi-tenant/partner setups.

    • apps/desktop/src/themes/context.tsx:220-335 — Desktop skin/mode are resolved via localStorage and preset lists; confirm whether there’s an operator-level config/branding assignment path for partners beyond per-user local selection.
    • hermes_cli/skin_engine.py:1-260 — The skin engine is data-driven via YAML, but this slice is primarily schema/docs and built-in skins; confirm the actual runtime selection/activation mechanism (e.g., config key or CLI flags) is operator-driven and not code-driven.
  • med

    Align naming and theme selection semantics across surfaces (CLI skins vs desktop skins vs web themes) so that ‘brand/theme id’ maps cleanly to all clients. This reduces divergence where different clients use different identifiers/selection rules.

Tenant-configurable behavior surface N/A

After inspecting the code areas most likely to implement or reference a tenant/customer-configurable behavior surface (gateway config, plugin trust/policy knobs, and tenant-named tests), I did not find a clear settings/rules model where customer/tenant-specific behavior variations are expressed as data (per-tenant config rows) rather than code branching. A number of config-driven switches exist (e.g., plugin LLM trust gate, gateway/session policies), but they are not established as a per-tenant behavior surface in the sense required by this primitive.

  • high

    Locate (or confirm absence of) the intended multi-tenant model: search for the authoritative tenant identifier (e.g., `tenant_id`, `org_id`, `customer_id`) used to select tenant-scoped settings/rules, and for a central config/rules loader that feeds behavior (not just infra config). If none exists, define a tenant-scoped settings schema (YAML/DB) and refactor the behavior gates to read from it.

    • tests/gateway/test_teams.py:1-120 — Indicates tenant is considered in config validation at least at the adapter level, but no corresponding tenant-configurable behavior surface wiring was confirmed in the inspected code.
Onboarding-by-configuration cost N/A

I did not find an implementation specifically targeting “onboarding a new customer cheaply via configuration/data (self-serve provisioning)”. The codebase has an “onboarding” concept for first-run UX hints (tracked in local config), and also config-driven behaviors (e.g., shell hooks registered from config), but these do not constitute a documented, customer/tenant onboarding-by-configuration pathway.

  • high

    Add/confirm a tenant/customer onboarding-by-configuration contract: a single documented provisioning entrypoint (ideally self-serve) that turns a new tenant’s input data/config into the required runtime setup without per-customer code edits or bespoke deploys. Include example configs and a validation script/command that verifies everything is wired correctly.

    • agent/onboarding.py:1-22 — Current onboarding is scoped to first-touch hint flags in local config, not customer/tenant provisioning; this highlights the gap to address.

Not applicable to this codebase: No hardcoded customer branching, Tenant-configurable behavior surface, Onboarding-by-configuration cost.

Procurement Code Readiness

Data-export and data-subject erase/export endpoints, region pinning, and DPA-mapped controls that survive enterprise procurement.

21% 6/10 scored
  • Self-serve trust documentation 0%
    0/2 expected sites
  • Data export mechanism 0%
    0/2 expected sites not present
  • Deletion / erase-on-request 42%
    3/4 expected sites
  • Data residency commitment 0%
    0/1 expected sites not present
  • Enterprise access controls 83%
    2/2 expected sites
  • Sub-processor transparency 0%
    0/2 expected sites not present
Self-serve trust documentation 0%

The repo contains committed security/trust documentation (SECURITY.md and a published website security page), but it is not packaged as a self-serve procurement trust set (it does not present deal-closing artifacts like DPAs/certifications/sub-processor lists/pen-test summaries and maintained control status as prospect-ready materials).

  • high

    Create/maintain a single, prospect-facing trust center (or “Trust & Security” landing page) that self-serve purchasers can use for diligence: include links to the current DPA/contract terms, a versioned sub-processor list, pen-test/audit summaries (with dates and report references), and any maintained control-status overview. Ensure links resolve to committed artifacts that are updated on a schedule.

    • SECURITY.md:1-35 — Current trust content is a policy/vulnerability disclosure document; it does not provide the packaged deal-closing self-serve procurement materials expected of this primitive.
    • website/docs/user-guide/security.md:1-40 — Current public page documents security layers, but it is not an entry point for procurement/self-serve certifications and contracting artifacts.
  • med

    Augment SECURITY.md (or the trust landing page it links to) with an explicit “Procurement packet” section: what artifacts exist, where they live in-repo (or on a published secure location), and their last-updated dates/versions.

    • SECURITY.md:1-35 — SECURITY.md is the natural place to anchor trust commitments, but today it focuses on boundaries and vulnerability reporting rather than procurement-ready packaging.
  • low

    If the project intends to keep these artifacts off-code (e.g., compliance reports in a data room), still add stable, self-serve references and an index page in the repo so buyers don’t have to re-derive everything from scratch.

Questionnaire response library N/A

No questionnaire response library (e.g., CAIQ/SIG/VSA reusable question-to-answer bank) is present in the repository. This primitive is a DATA-ROOM follow-up artifact, and its absence is expected; procurement should transition to requesting the current, framework-mapped questionnaire response package from the seller.

  • high

    Request the seller’s current security questionnaire response library (CAIQ/SIG/VSA as applicable), including version/date and the mapping to the dominant frameworks/domains used for procurement (and any supporting evidence bundle).

  • med

    Ask for a control-evidence index accompanying the questionnaire responses (i.e., what documents/reports substantiate each answer) to avoid re-deriving controls during diligence.

Controls-to-contract mapping N/A

I did not find any controls-to-contract mapping artifact (DPA/MSA -> controls -> audit evidence) that would let procurement verify commitments like encryption, retention, breach notice, and residency against implemented mechanisms and packaged evidence. The repo has security policy and security guidance/deployment documentation, but nothing that reads like the required hybrid controls-mapping doc for contract close readiness.

  • high

    Ask the seller/GC/R&W underwriter for the current DPA/MSA + the specific controls-to-contract mapping (or equivalent schedule) that enumerates each DPA commitment (encryption, retention, breach notice, residency) and attaches/points to the audit evidence used for attestation.

    • SECURITY.md:1-31 — Confirms what the repo provides today (security policy) but does not provide contract-commitment traceability.
  • med

    Request/produce a repo-adjacent mapping document under the expected location (e.g., docs/security) that explicitly links each named DPA/MSA commitment to (a) a code-visible control/mechanism and (b) the audit evidence package/version used for that control.

Data export mechanism 0%

The codebase supports exporting an individual session (web endpoint `/api/sessions/{session_id}/export` and a desktop helper `exportSession`) and the data store includes `export_all()` to export all sessions. However, there is no tenant-scoped “export ALL data out on request” handler/job wired into the web/API layer (the available export endpoint is per-session), so the procurement “data export mechanism” primitive is not fully implemented as required.

  • high

    Add a tenant-scoped “export all data” HTTP endpoint (or async job + polling/download endpoint) that invokes `SessionDB.export_all(...)` and returns data in a portable format (e.g., JSONL/JSON archive). Ensure the export is scoped to the authenticated tenant/user context (not global) and streamed/queued if large.

    • hermes_cli/web_server.py:4440-4487 — Current export handler is only per-session (`/api/sessions/{session_id}/export`). A tenant-wide route should be added alongside this pattern.
    • hermes_state.py:3068-3097 — `export_all()` is the likely correct underlying implementation for the required tenant-wide export; wire it to a request/tenant-scoped mechanism.
  • med

    Extend the desktop/web UX to request a full export (not just a single session), and plumb it to the new tenant-wide backend export endpoint.

Deletion / erase-on-request 42%

The codebase contains code-visible deletion/erase operations for stored memory items (Supermemory forget tool + RetainDB delete endpoints). However, the implementation is primarily id/query-scoped deletes; the code evidence available here does not demonstrate the procurement primitive’s required verifiable, tenant/subject-scoped cascade deletion reaching backups/derived stores with auditable linkage to an erase-on-request contract.

  • high

    Add/locate a data-subject/tenant erase endpoint or job that is explicitly invoked by a deletion request, and ensure it cascades to all data stores (primary + derived + backups) for that subject/tenant; include audit evidence and request identifiers in logs.

  • med

    Implement (or expose) deletion-by-subject/session for each memory backend (not only by memory_id), so an erase request can delete all items belonging to a subject across containers/scopes in one verifiable workflow.

  • low

    Ensure UI/admin delete actions (if any) map to the same server-side subject/tenant erase workflow and return a verifiable status/result (including which datasets were purged).

    • gateway/platforms/api_server.py:1-260 — API header indicates DELETE /v1/responses/{response_id}, but erase-on-request should be validated at the data-subject layer, not only response deletion.
Data residency commitment 0%

Region concepts exist, but in the audited code they are used to route provider/API calls (e.g., AWS Bedrock client region_name). There is no evidence of a tenant-region attribute that is enforced end-to-end for data/compute placement as a residency commitment (the region usage appears provider-routing, not residency enforcement).

  • high

    Provide (or implement) an explicit tenant-scoped data residency control: a tenant `data_region`/region attribute in the data model, with routing that enforces compute + all persistence/derived/backups in that region. Ensure the enforcement is applied at every persistence boundary (memory/session storage, logs, vector stores, caches, media/object storage, queues).

    • agent/bedrock_adapter.py:53-84 — Current `region` usage is only to construct a provider client (`bedrock-runtime`, `region_name=region`), which is insufficient as residency enforcement for tenant data.
  • med

    Add end-to-end tests that assert tenant A’s data never touches tenant B’s region (including any async background jobs, retries, and fallback routing). This should verify both data storage location and any region-dependent derived artifacts.

    • agent/bedrock_adapter.py:53-84 — The region routing currently tested/used is provider client creation; add tests specifically covering tenant-scoped persistence boundaries and derived/backup flows.
Enterprise access controls 83%

Enterprise access controls are only partially implemented. The codebase clearly enforces CIDR allowlisting at the boundary for the MS Graph webhook adapter (including fail-closed behavior when the host is network-accessible and CIDRs are missing). However, the code-visible evidence does not demonstrate that this network restriction is managed through an admin UI or applied as a tenant-scoped enterprise control for the admin/dashboard boundary; the dashboard gate is identity/session-based rather than IP-allowlist-based.

  • high

    Confirm and wire an admin UI + persisted configuration that lets enterprise/security teams manage IP allowlists (CIDRs) for the relevant boundaries (at minimum: dashboard/admin endpoints, and ideally per-tenant). Provide code-visible enforcement that reads the admin-managed allowlist values.

  • med

    If tenant-scoped enterprise controls are required, refactor the CIDR allowlist configuration model so it is stored per tenant (and then ensure each boundary checks the tenant’s configured CIDRs). Add tests proving tenant A and tenant B have different CIDR behavior.

  • low

    Add/extend procurement-ready documentation in-repo that explicitly lists: (a) which endpoints enforce IP allowlisting, (b) how to configure CIDRs, (c) whether restrictions are global vs tenant-scoped, and (d) how the admin UI manages it.

Sub-processor transparency 0%

No versioned, in-repo sub-processor transparency artifact (sub-processor list) was found/packaged in this codebase. The repo contains security/trust-model documentation, but it does not provide the required current sub-processor inventory backing a DPA sub-processor clause.

  • high

    Add a dedicated, committed, versioned sub-processor inventory (e.g., docs/subprocessors/SUBPROCESSORS.md or similar) and ensure it is referenced from SECURITY.md and the website trust/security page used by customers and procurement.

  • high

    Implement a change workflow: when adding a new sub-processor (new third-party data sink), update the inventory with a version/date and provide a documented notification flow to customers under the DPA (e.g., release notes + email template + escalation path).

    • SECURITY.md:1-200 — This repo already defines a trust/security posture; it should be extended with an explicit, procurement-grade sub-processor notification/update process.
  • med

    Cross-check the declared sub-processor inventory against actual third-party SDK imports used in the runtime (e.g., OpenAI SDK usage is present) and ensure the inventory entries match the code paths that send data externally.

Compliance attestation readiness N/A

This primitive is a DATA-ROOM follow-up artifact (a current SOC 2 Type II / ISO / pen-test attestation readiness package plus control-to-code traceability). In this repo, there is no retrievable in-repo source evidence for a current compliance attestation readiness package. The tooling indicates the corresponding evidence category is not present/packaged in the repository (as expected for data-room artifacts).

  • high

    Request the current compliance attestation package from the seller (e.g., SOC 2 Type II report and pen-test/ISO as applicable) plus control-to-code traceability evidence showing the implemented mechanisms map to the attested controls for the Dim 5 audit evidence set.

  • med

    Ask the seller/GC to provide the report period coverage (start/end dates), scope statement (services, regions, sub-processors), and the specific mapping/traceability artifact (e.g., a controls-to-evidence spreadsheet or appendix) used during audits.

  • low

    If the seller maintains these in another location (e.g., secure customer portal), request a shareable link and confirm the artifacts are current (not expired) and match the deployed production configuration.

Reliability / SLA evidence N/A

No packaged Reliability / SLA evidence (status page/config, SLA terms, incident postmortems) was found in this repository. The only similarly named material is runtime readiness logic used to decide whether the app should proceed, which does not constitute SLA/reliability evidence.

  • high

    Request the buyer-ready Reliability/SLA evidence set from the seller (e.g., current status page URL + uptime reporting methodology, any formal SLA/credits terms, and a runbook + recent incident postmortems). Provide versioned artifacts suitable for procurement diligence.

Not applicable to this codebase: Questionnaire response library, Controls-to-contract mapping, Compliance attestation readiness, Reliability / SLA evidence.

Reporting & Data Export

Customer-accessible export endpoints (CSV, Parquet, JSON), scheduled exports, and a documented map of emitted events.

16% 6/10 scored
  • On-demand data export 0%
    0/3 expected sites not present
  • Export completeness & fidelity 0%
    0/2 expected sites
  • In-product reporting / analytics 75%
    4/4 expected sites
  • Documented export / event schema 0%
    0/3 expected sites not present
  • Export access control & audit 0%
    0/1 expected sites not present
  • Exit portability / no lock-in 22%
    2/3 expected sites
On-demand data export 0%

No tenant-scoped, permission-gated, audited on-demand data export handler for exporting an entire customer tenant/account’s data in a portable format was found. The codebase has (a) a desktop-only single-session JSON export and (b) a local CLI backup zip of the user’s Hermes home, but neither satisfies the primitive’s requirement for complete tenant-scoped data egress via a backend handler.

  • high

    Implement a backend on-demand export endpoint in the authenticated API server layer that (1) is tenant-scoped, (2) enforces authorization, (3) exports the full set of tenant/account data categories in a portable format (tabular/columnar or structured JSON/NDJSON), and (4) writes an audit log entry tied to the export request.

  • med

    Rework the desktop session export to call the backend tenant export flow (or clearly mark the capability as a limited “session export” that is not the full account/takeout export).

  • med

    If you keep the CLI backup, document it as a local device backup tool (not a customer data export primitive), and ensure the actual customer-facing export primitive exists in the API layer.

Export completeness & fidelity 0%

This codebase has export-like mechanisms, but none implement “export completeness & fidelity” as a correct, complete, customer-data portable dump with faithful coverage of all critical categories. The desktop `exportSession` exports only one session’s messages. The CLI `hermes backup` bulk-zips the local `~/.hermes/` directory, but it deliberately skips excluded directories and “secret” files, so it is not a complete/fidelitous customer data export in the required sense (and it is not demonstrated as tenant-scoped/permission-audited).

  • high

    Implement (or add) a tenant/account-scoped, permission-gated “full customer data export” endpoint/handler that covers all customer-critical entities (customer/profile, financial, operational, config, permissions/accounts, integration specs, and historical analytics) and verifies round-trip fidelity (types + relationships) without silent truncation.

    • apps/desktop/src/lib/session-export.ts:21-57 — Current export scope is single-session only (`getSessionMessages(sessionId)`), not an account-wide completeness export.
    • hermes_cli/backup.py:92-160 — Current bulk export is a filesystem zip with hard-coded exclusions (`_EXCLUDED_DIRS`, `_EXCLUDED_NAMES`, `_SECRET_FILE_NAMES`), which is a direct fidelity/completeness gap for customer-data portability.
  • high

    Harden the bulk export path with explicit authorization checks and an audit-log write on export initiation/completion, and ensure export scope cannot cross tenants/accounts.

    • hermes_cli/backup.py:92-260 — The CLI backup reads from local `HERMES_HOME` and writes a zip; there is no tenant scoping, explicit permission gating, or audit-log trail shown in this implementation.
  • med

    Replace/extend the current “session export” with a structured export contract that can be composed into an account export (shared schema + inclusion rules), rather than ad-hoc per-feature JSON downloads.

Large / async export handling N/A

No code-visible primitive for “Large / async export handling” was found. Searches for export-job/worker/task/queue/async-export/stream-export symbols returned no results, and the only “export” behavior found is small, UI/in-request style data export (e.g., exporting a single session to a JSON blob) and a general backup/import mechanism—neither implements an async, tenant-scoped, streaming large-dataset export pipeline with progress/notifications.

  • high

    Add a dedicated large-data export pipeline: (1) tenant-scoped authorization, (2) enqueue an async export job, (3) stream results to a blob/object store (not buffering in memory), (4) provide progress/notification and resumability, and (5) implement a secure download endpoint tied to the job and tenant.

Scheduled / recurring exports N/A

No scheduled/recurring data-export primitive is implemented in this codebase. “Export” behavior found is manual (browser download of a session JSON), and the cron-related server code appears to support scheduled message/delivery plumbing rather than a tenant-scoped, retryable scheduled export that delivers portable customer data to a destination.

  • high

    If the product intends scheduled data exports, introduce a backend schedule store (tenant-scoped), a scheduled runner/worker that batches export jobs, and an export execution pipeline that writes results to a configurable destination with retries + a dead-letter queue (DLQ).

  • high

    Connect scheduling to the actual data-export handler so scheduled exports use the same tenant scoping, permission checks, and audit logging as any on-demand export endpoint would.

Warehouse sync / reverse-ETL N/A

No Warehouse sync / reverse-ETL primitive is implemented in this codebase. Repo evidence discovery found no maintained warehouse connector configurations (dbt/airbyte/fivetran/singer/meltano-style) and no code path that performs incremental warehouse syncing (the codebase “sync” occurrences appear to be internal state/media sync, not reverse-ETL to a customer warehouse).

  • high

    If warehouse sync is a product requirement for Hermes Agent, add a dedicated reverse-ETL layer: (1) connector-configs (dbt/airbyte/fivetran/singer/meltano) committed under a warehouse_sync_config-like folder, and (2) code that provisions & runs incremental sync jobs tenant-scoped with authz and audit logging.

In-product reporting / analytics 75%

This codebase includes a real in-product analytics/reporting module: the FastAPI server exposes /api/analytics/usage and /api/analytics/models, and the React dashboard page (AnalyticsPage) renders charts from those endpoints. However, the analytics appear geared toward the local/single deployment context rather than clearly implementing tenant-scoped, portable reporting/export for customer analytics/warehouse/exit use.

  • high

    Define and implement a customer-data portability/export path for the analytics reports (e.g., export the same aggregates returned by /api/analytics/* as CSV/JSON), ensuring permission checks and audit logging on the export endpoint.

    • hermes_cli/web_server.py:2450-2500 — Analytics are computed and returned for the UI, but this shows only the dashboard data-return mechanism (no export handler evidenced here).
  • med

    Verify authorization + data scoping guarantees for analytics endpoints in the running auth middleware/gate (ensure they cannot leak another customer’s data if multi-tenant is expected).

  • low

    Add end-to-end tests that assert the analytics UI renders correctly from API responses (including response shape stability for daily/by_model/totals and models capabilities enrichment).

Event stream completeness N/A

No auditable “event stream completeness” primitive is present for reporting/data export eventing. Although the codebase clearly emits/dispatches events internally (e.g., the Ink UI event dispatcher/emitter), there is no maintained documented event catalog (event-name → schema/expectation) in-repo that can be diffed against the code-emitted event set. The off-graph doc scan only surfaced unrelated schema/XSD files in the export_event_schema bucket, not a real event documentation map relevant to a reporting/export event stream.

  • high

    Add/restore a real documented event catalog (asyncapi/openapi/events.md or equivalent) that enumerates the externally promised reporting/export event names and payload schemas, then wire the reporting/event emission layer to a single source of truth so you can diff “documented vs emitted” in CI.

  • med

    If the intended scope is actually the Ink terminal/UI event system, explicitly document the terminal/UI event types and their payloads, and define completeness expectations. Then compare emitted event types (call sites like emit/dispatch/track) against that documented set.

Documented export / event schema 0%

No consumer-facing, documented export/event schema (e.g., asyncapi/openapi/event catalog with versioned payload definitions) was found. The codebase does implement data export (e.g., exportSession builds a JSON payload and downloads it), but the exported format is not accompanied by a maintained schema document that external consumers can depend on.

  • high

    Add a versioned, documented JSON schema for the session export payload (exportSession), including the exact structure of `messages` returned by getSessionMessages, plus metadata fields like exported_at, session_id, title, message_count.

  • high

    Publish (and keep in sync) a documented event catalog/schema for any externally consumable event stream. If events are internal-only, explicitly document that and avoid implying stable external payload contracts.

  • med

    Add CI checks to prevent schema drift: generate schemas from the TypeScript types (where possible) or validate exported payload instances against the published schema during tests.

Export access control & audit 0%

No code-visible implementation of “Export access control & audit” was found. The only clearly identified export mechanism is the desktop-side `exportSession()` which fetches messages and downloads a JSON file, but it does not perform authorization/tenant scoping checks or write an audit log entry on the export path.

  • high

    Move export authorization + audit logging to a server-side export handler (or server API call invoked by the desktop client): enforce the caller’s permission to export the specific tenant/session, ensure tenant isolation in the data query, and write an audit event that includes who exported what (session id / tenant id), from where, and when.

    • apps/desktop/src/lib/session-export.ts:13-57 — Current export implementation directly calls `getSessionMessages(sessionId)` and triggers a browser download; there is no permission check, tenant scoping validation, or audit-log write in this export path.
Exit portability / no lock-in 22%

There is a real exit-portability mechanism: `hermes_cli/backup.py` provides `hermes backup`, which scans the entire local Hermes home directory and packages it into a zip for download/transfer. For SQLite-backed state, it uses a WAL-safe snapshot (`sqlite3.backup()`) before writing into the archive. However, I did not find any tracked contractual no-lock-in / data-portability terms in-repo (the contract clause is a hand-off item to the buyer’s GC). The desktop UI also supports exporting individual sessions to JSON, but the codebase’s “full account” exit export is primarily the CLI backup.

  • high

    Locate/confirm the contractual no-lock-in/data-portability clause in tracked legal/terms artifacts (MSA/termination/offboarding terms). Hand off to the buyer’s GC to verify it explicitly permits full-account export before termination/lock-in and that any revocation/termination does not block retrieval.

  • med

    Add a clear, user-facing reference in the product/CLI help (or docs) that `hermes backup` is the official full-account export for exit portability, including what is included/excluded and how to restore (`hermes import`)—to reduce risk of users extracting an incomplete dataset.

    • hermes_cli/backup.py:1-20 — The backup/import behavior is described in the module docstring, but this should be surfaced as authoritative product guidance for exit portability.
  • low

    Consider adding an integrity/manifest step to the exported zip (e.g., checksum list + schema/version metadata) so users can validate portability and completeness after exit/transfer.

    • hermes_cli/backup.py:78-176 — The zip is created and populated, but there is no manifest/integrity listing in the shown implementation segment.

Not applicable to this codebase: Large / async export handling, Scheduled / recurring exports, Warehouse sync / reverse-ETL, Event stream completeness.