AI / Data Foundation
Versioned data pipelines, pinned model versions, and a real vector or feature store — not scattered cron jobs and model="latest".
Declarative, tested transformations 67%
Declarative, tested transformations exist primarily through Hermes’ plugin system hooks (transform_llm_output / transform_tool_result / transform_terminal_output). The repo provides unit + integration tests that load real plugins from manifests and verify the transformation contract end-to-end (dispatch semantics, wiring of kwargs, replacement rules, truncation/redaction interactions, and exception fallback). The main potential gap is that the core application seam locations (e.g., run_agent/model_tools/terminal_tool boundaries) are not directly evidenced here as hook-invocation sites—though the tests strongly indicate the transformation layer is governed and validated.
- high
Add (or locate and reference) a direct code-evidence slice in run_agent.py/model_tools.py/terminal_tool.py showing exactly where each hook (transform_llm_output / transform_tool_result / transform_terminal_output) is invoked during production execution, so the primitive’s presence is proven at the critical seams (not just via tests).
- run_agent.py:1-260 — run_agent.py is the expected LLM-output seam, but the provided evidence here is only module header range; hook invocation wiring lines need confirmation.
- model_tools.py:1-260 — model_tools.py is the expected tool-result seam; the provided evidence here is only the module header range and does not yet show the hook call site.
- tools/terminal_tool.py:1-260 — tools/terminal_tool.py is the expected terminal-output seam; the provided evidence here is only the module header range and does not yet show the hook call site.
- med
Ensure transformation assets are clearly versioned per plugin (e.g., plugin.yaml version + any compatibility constraints) and that each transformation hook has at least one dataset/boundary test case for empty outputs and malformed plugin return types (some exist, but consolidate per hook into a consistent suite).
- tests/test_transform_llm_output_hook.py:1-160 — Covers empty string pass-through, non-string returns, and exception behavior for transform_llm_output.
- tests/test_transform_tool_result_hook.py:1-192 — Covers None/invalid hook returns, first-valid-string replacement, and exception fallback for transform_tool_result.
- tests/tools/test_terminal_output_transform_hook.py:1-210 — Covers first-valid-string replacement, truncation, redaction behavior, and exception fallback for transform_terminal_output.
- low
Document (in a short developer guide) the expected plugin contract for each transform hook (input kwargs, replacement semantics, return-type rules, truncation/redaction expectations) and point to the corresponding test files as the authoritative spec.
- hermes_cli/plugins.py:120-260 — VALID_HOOKS enumerates transform hooks and describes replacement semantics at a high level, but a dedicated “contract + tests” doc would reduce reliance on reading tests.
Orchestrated pipelines 89%
This codebase contains an orchestrated, dependency-style pipeline implementation for the Teams meeting summary flow. It externalizes orchestration state using a durable store, persists step-by-step lifecycle statuses, classifies retryable vs terminal failures, and wires the pipeline into the gateway via an explicit scheduler callback.
- high
Add a first-class, queryable DAG/asset definition for the pipeline steps (e.g., a versioned pipeline manifest describing step graph, retry policy, and step inputs/outputs) and surface it for observability tooling. Right now, the step graph is implicit in control flow within `run_job`.
- plugins/teams_pipeline/pipeline.py:260-560 — The orchestration steps and transitions are implemented directly in `run_job` control flow, but there is no separate external artifact describing the pipeline DAG/graph.
- med
Strengthen the retry mechanism into an explicit scheduler-backed retry loop (with bounded attempts and backoff scheduling persisted per job), rather than relying only on `retry_scheduled` status updates and eventual invocation.
- plugins/teams_pipeline/pipeline.py:260-560 — `run_job` catches `TeamsPipelineRetryableError` and persists `retry_scheduled`, but the actual retry scheduling policy/worker loop is not shown in the orchestration hot path we inspected.
- low
Expose an auditable summary of pipeline runs (job_id, event_id/dedupe_key, step timestamps, last error) via a small API/CLI command that reads the durable store. This would improve operational observability and reproducibility.
- plugins/teams_pipeline/store.py:1-194 — The store contains `jobs` and timestamps/receipts, but we did not confirm a dedicated CLI/API that outputs a structured run report from these records.
Data quality validation / contracts 67%
This codebase does have data-quality/contract-like validation layers, but they are not consistently applied as a single “data contracts” primitive across all ingestion boundaries. Strongest evidence is present for (1) tool JSON schema sanitization to prevent ingestion failures, and (2) file-tool input guards (size limits and blocked device paths) with unit tests that confirm rejection/quarantine behavior. Other ingestion-style boundaries (e.g., delivery routing inputs) appear to rely more on parsing and downstream logic than on a clearly governed contract gate.
- high
Introduce (or standardize) an explicit data-quality contract gate for delivery routing inputs in gateway/delivery.py—e.g., strict schema/validation for DeliveryTarget.parse inputs (and any content/metadata shape), with a quarantine/error return type that prevents malformed targets from reaching platform adapters.
- gateway/delivery.py:1-120 — Delivery routing accepts and parses target strings; evidence available shows best-effort parsing without an explicit contract gate for malformed/unsafe routing inputs.
- med
Extend file_tools validation coverage to include comprehensive handler-level shape validation for all tool entrypoints (keys/types/ranges) in addition to existing path/device/size guards, and ensure each validation rule has a unit test that asserts rejection behavior.
- tools/file_tools.py:1-520 — Existing guards cover some high-risk cases (blocked devices, size caps, path resolution); broader ingestion-boundary shape contracts appear partially covered but not comprehensively demonstrated in the sampled sections.
- low
Document the schema/validation contract pattern (what constitutes the “contract”, where it runs, what error payload looks like, and how quarantine is expressed) and reuse it across modules that accept LLM/tool inputs.
- tools/schema_sanitizer.py:1-220 — Schema sanitization already follows a contract mindset; formalizing the pattern would help apply it consistently to other ingestion boundaries.
Raw / immutable source layer N/A
I did not find an immutable “raw/source preserved unmodified” landing layer anywhere in the codebase. The only “raw/Raw…” hits are UI-level naming (e.g., `RawAnsi`) or data-fetch scripts that output processed/normalized artifacts directly, without a governed immutable raw layer for audit/reproducibility.
- high
Introduce a dedicated immutable raw landing layer for external fetches (e.g., Wikipedia/Wikidata, YouTube transcript, other tool ingestions): persist the exact HTTP response payload (and request parameters/headers + timestamps + source identifier) to a versioned store before any parsing/normalization; write transforms downstream from this stored raw artifact.
- optional-skills/research/osint-investigation/scripts/fetch_wikipedia.py:200-267 — Currently writes enriched CSV rows directly from processed lookups; no observable immutable raw capture before transformation.
- skills/media/youtube-content/scripts/fetch_transcript.py:1-125 — Currently returns normalized structured JSON; no evidence of persisting unmodified raw inputs for later replay.
- med
Add an ingestion manifest/spec for the raw layer (schema + required fields, retention policy, and a deterministic reprocessing command that reads raw artifacts without re-fetching external sources).
- skills/media/youtube-content/scripts/fetch_transcript.py:1-125 — Script-level JSON output exists, but there is no accompanying machine-readable contract/manifest for an immutable raw storage layer.
- low
Rename or namespace UI-layer “raw” components (e.g., `RawAnsi`) to avoid confusion with the data-layer primitive name, and document clearly which “raw” refers to UI bytes vs. immutable source-data persistence.
- ui-tui/packages/hermes-ink/src/ink/components/RawAnsi.tsx:1-62 — Terminology collision risk: UI component uses `RawAnsi` name but does not represent the requested data primitive.
Data + pipeline versioning 0%
The codebase contains a strong reproducibility primitive via filesystem snapshotting (Checkpoint Manager using a shared shadow git store) and a durable pipeline state store for the Teams meeting pipeline. However, there is no clear evidence that data + pipeline versions are tightly coupled and recorded for each pipeline release/job (e.g., no explicit pipeline-version/data-provenance manifest tying code version to specific input/output versions).
- high
Add explicit pipeline versioning metadata capture and persistence: when a checkpoint/snapshot is created for a pipeline run, store (and persist into the pipeline store) the pipeline code version (git commit hash of the pipeline module), configuration hash, and the input artifact/version identifiers that determine the produced outputs.
- tools/checkpoint_manager.py:1-60 — Snapshot mechanism exists, but the snippet indicates it snapshots filesystem state; additional required binding to pipeline/data provenance appears missing.
- plugins/teams_pipeline/store.py:1-194 — Durable state exists but does not show fields for pipeline/data version linkage.
- med
Extend TeamsPipeline job/sink records to include deterministic identifiers: pipeline logic hash, input artifact keys (meeting artifact/transcript/audio versions), and output schema/version. Persist these at upsert_job/upsert_sink_record time so later runs can replay exactly.
- plugins/teams_pipeline/store.py:1-194 — Store currently persists jobs/sink_records but schema for version linkage is not present in the observed code.
- low
Add tests that verify reproducibility: given the same pipeline version + captured input versions, outputs should be identical (or produce the same sink record identifiers).
- tests/test_yuanbao_pipeline.py:1-200 — There are pipeline unit tests, but the audit did not find version-coupling tests specific to data+pipeline reproducibility.
Data lineage / provenance 89%
Data lineage/provenance exists in this codebase primarily as durable conversation/session provenance in the SQLite-backed Hermes state store (not as an external lineage standard like OpenLineage/DataHub). Sessions link derivations via `parent_session_id`, messages store tool-call identifiers and timestamps, and retrieval tooling (session_search_tool) reconstructs lineage roots and corrects anchor rebinding for consistency. An observability plugin (Langfuse) provides external trace emission, but the lineage primitive here is more “conversation/tool provenance” than “dataset lineage” for data pipelines.
- high
Audit and document the end-to-end provenance emission path (session creation → message insertion → tool call linkage) to ensure lineage is always emitted for every transformation/derivation event, and that identifiers (session_id/task_id/tool_call_id) are consistently propagated across modules.
- hermes_state.py:230-320 — Provenance fields exist in schema (`parent_session_id`, tool call fields), but the wiring that guarantees they are always populated should be verified/standardized at the write points.
- med
Add a machine-queryable provenance export (e.g., JSON export of a session lineage graph or an internal API endpoint) so lineage can be validated without requiring consumers to understand internal DB semantics.
- tools/session_search_tool.py:1-170 — Current lineage correctness is enforced during retrieval; providing a first-class export would externalize lineage for change-management review.
- low
If external observability is the intended governed provenance system, extend/confirm that trace metadata includes all lineage-critical ids (session lineage root, parent/child relationships, tool_call_id/session_id mapping) consistently across all trace/span creation points.
- plugins/observability/langfuse/__init__.py:1-120 — The plugin defines trace state and activation gating; ensure trace payloads always include the same lineage anchors as Hermes state.
Feature management 58%
This codebase has a centralized feature-management layer for Tool Gateway entitlements: `hermes_cli/nous_subscription.py` computes governed, structured feature state (available/active/provider/managed) and is backed by unit tests. Other surfaces (CLI tool config + portal status) import and use this computation, reducing training/serving-style skew risk. However, the audit evidence available shows imports for consumers, but not the full wiring where each consumer uses the computed states to gate runtime behavior (so overall quality is not “perfect”).
- high
Verify end-to-end wiring that every runtime decision/surface using features (especially those affecting which tools/skills are exposed to the agent) derives from `get_nous_subscription_features`/`apply_nous_managed_defaults`, rather than re-implementing entitlement logic. Add/extend tests that assert consistent feature gating across the main execution entrypoints.
- hermes_cli/nous_subscription.py:220-420 — Central feature-definition computation exists; ensure all downstream consumers use its outputs.
- hermes_cli/tools_config.py:1-140 — Consumer module imports the centralized feature layer; confirm it is actually used for gating in the relevant code paths.
- agent/prompt_builder.py:900-1040 — Prompt/skills assembly is a concrete serving-time surface; confirm it uses the same feature state when filtering/gating skills.
- med
Add a single “contract test” that compares feature computation inputs/outputs across all entrypoints that call it (e.g., CLI tools selection, portal status, agent runtime). This prevents drift when new features/backends are added.
- tests/hermes_cli/test_nous_subscription.py:1-260 — Unit tests exist for the feature computation; expand to cross-entrypoint consistency checks.
- hermes_cli/portal_cli.py:1-120 — Portal status is one consumer; include other consumers in a shared contract test.
Vector / embedding store 75%
The codebase includes a persisted vector/embedding-like store for its memory system: HRR vectors are stored in SQLite tables (`facts.hrr_vector` and `memory_banks.vector`) and are recomputed on ingestion/rebuild. Retrieval code uses these persisted vectors for similarity scoring rather than keeping everything ephemeral in process memory. However, the implementation is not a managed, model+content-version governed embedding store; it primarily persists vectors locally without explicit linkage to a producing model version (beyond parameters like HRR dimension).
- high
Add explicit version governance to the persisted embeddings: store an embedding-config version (e.g., vector type + embedding parameters + producing model identifier/version if any) alongside each vector/bank, and refuse or automatically rebuild vectors when the producing configuration changes.
- plugins/memory/holographic/store.py:1-120 — Schema defines `hrr_vector` and `memory_banks.vector` but does not record a producing model/config version to tie embeddings to their generator.
- med
If the intent is to satisfy the “managed, queryable, versioned store” requirement, replace/augment the local SQLite vector persistence with a dedicated vector DB interface (or at least isolate it behind a vector-store abstraction that exposes upsert/query/delete and versioned namespaces).
- plugins/memory/holographic/store.py:1-120 — The store is implemented directly as SQLite tables and blobs; there is no external vector DB or managed indexing layer.
- med
Add an automated freshness/consistency check at query time (or before retrieval): verify that vector banks exist and match the current embedding configuration; otherwise trigger `rebuild_all_vectors` (bounded, logged) and fall back safely.
- plugins/memory/holographic/store.py:450-579 — `rebuild_all_vectors` exists, but there is no evidence of automatic, governed “config mismatch → rebuild” gating before using stored vectors.
Model version pinning 33%
Model identity pinning is partially supported: the agent’s public constructor takes an explicit `model` string and threads it into initialization, but direct enforcement that the string is not a floating alias (e.g., rejecting `latest`/`stable` as runtime model IDs) was not found in the inspected model-call wiring. The codebase includes careful model-string parsing/handling for tags like `latest`/`stable`, but pinning enforcement at invocation appears incomplete based on observed call-path slices.
- high
Add an explicit guardrail at the model-identity ingress point (agent init / request kwargs build): detect `model` values that end with or equal floating aliases (e.g., `latest`, `stable`) and either (a) reject with a clear error or (b) resolve to a pinned concrete ID via a version resolver with persisted results for reproducibility.
- run_agent.py:1080-1125 — This is the choke point where model identity is first accepted and forwarded; pinning validation should occur here or immediately after.
- med
Ensure the resolved/pinned model ID (post-alias-resolution) is the value stored in session state / logs (e.g., the session DB row) and used for the API request, not the user-provided alias string.
- run_agent.py:1080-1098 — Session/model fields are already tracked (model identity is present on the agent); align the stored value with the pinned/concrete ID after validation/resolution.
- low
Add unit tests that explicitly cover `model='latest'` and `model='stable'` for each provider path (OpenAI-compatible, OpenRouter, Anthropic, local/Ollama), asserting deterministic behavior (reject or resolve-to-pinned).
- agent/model_metadata.py:35-55 — Model string parsing currently recognizes `latest`/`stable` tags; tests should verify how this impacts runtime model selection and reproducibility guarantees.
Prompt / model-call management 92%
The codebase has a clear prompt governance layer: prompt fragments live in agent/prompt_builder.py, the assembled system prompt is constructed in agent/system_prompt.py (with explicit caching described), and specialized prompts for background review are centralized in agent/background_review.py. The conversation loop further centralizes system-prompt cache restore/build via _restore_or_build_system_prompt(), helping prevent prompt literal drift near model call sites.
- high
Audit remaining model-provider call paths (e.g., where chat/completions/responses are invoked) to ensure they always consume the governed system prompt built/restored by the system_prompt + conversation_loop layers, and do not inline literal prompt strings near the SDK calls.
- agent/system_prompt.py:1-240 — Central prompt assembly entrypoint; ensure all model calls source from here.
- agent/conversation_loop.py:1-260 — Central prompt cache restore/build; ensure all turn model calls use the cached/built prompt.
- med
Where review or specialized prompts must evolve, add automated snapshot/equality tests to detect prompt drift across variants (e.g., memory vs combined review prompt) and confirm the background-review prompt strings remain the only source-of-truth.
- agent/background_review.py:1-220 — All background review prompt strings are centralized here; add/extend tests to lock them down against accidental duplication elsewhere.
- low
For the largest prompt fragments (identity/guidance blocks), consider moving the biggest text literals into versioned markdown/template assets (if not already done elsewhere) and have prompt_builder load them, so changes remain even more diffable and reviewable than code-only constants.
- agent/prompt_builder.py:1-220 — Many guidance strings are currently embedded as Python constants; externalizing to versioned text assets would further strengthen diffability.
Reproducibility / determinism 8%
This codebase contains partial determinism/reproducibility primitives: deterministic tool-call IDs (to avoid UUID-driven cache invalidation) and a trajectory persistence utility for recording conversation outcomes. However, at run/turn boundaries there is no evidence of a fully captured “replay manifest” (pinned model/provider identifiers, generation parameters, preprocessing settings, and RNG seeds) sufficient to recreate runs exactly from pinned inputs. The trajectory evidence trail exists, but it appears incomplete for determinism.
- high
Add a run-boundary “repro manifest” captured alongside each trajectory/run: exact model ID/provider/base_url, generation params (temperature/top_p/any seed), preprocessing/compression parameters, and the specific code/data versions used. Store it in the same output directory (or JSONL alongside trajectory entries) and include schema validation for required fields.
- agent/trajectory.py:1-57 — Trajectory entries currently record only `timestamp`, `model`, `completed`, and `conversations`—missing deterministic replay inputs (seed/generation params/preprocessing settings).
- agent/conversation_loop.py:1-220 — Core per-turn execution lives here; this is where determinism-affecting generation parameters should be captured/propagated into the trajectory manifest.
- med
Version and document the determinism-critical hashing inputs for tool-call IDs (argument serialization normalization). Ensure the exact serialization function and its version are included in persisted artifacts so that cached/deterministic IDs remain stable across code changes.
- agent/codex_responses_adapter.py:128-164 — Deterministic call IDs depend on `arguments` (as a string) and `index`; without explicit serialization normalization/versioning captured, the same logical tool call might not hash identically across runs.
- low
Reduce nondeterministic fields in replay artifacts where possible (e.g., keep `timestamp` for auditing, but ensure a separate `run_id` is derived deterministically from pinned inputs).
- agent/trajectory.py:1-57 — Uses `datetime.now().isoformat()` for trajectory entries, which is not deterministic; consider adding deterministic IDs derived from the repro manifest.
AI output validation 100%
The codebase has a strong, centralized AI output validation primitive for LLM auxiliary calls: `_validate_llm_response` enforces the expected `.choices[0].message` shape and throws a clear `RuntimeError`. `call_llm` consistently applies this validator immediately after each model invocation and reuses it across retries/fallback paths, preventing raw/unvalidated model payloads from flowing downstream.
- high
Create/extend tests that intentionally return malformed LLM payloads (e.g., `response=None`, `response.choices=[]`, or missing `.message`) and assert that each retry path also fails via `_validate_llm_response` with the same error wording (no open-loop: retries must reuse the same schema gate).
- agent/auxiliary_client.py:4865-4955 — Validation error message and failure conditions are defined here; tests should assert these exact behaviors across retries.
- agent/auxiliary_client.py:4955-5300 — `call_llm` wraps model calls with `_validate_llm_response(...)` across multiple retry/fallback branches; tests should cover at least one malformed-payload trigger per major branch.
- med
Audit the main agent LLM call path(s) (non-auxiliary) to ensure the same (or equivalent) validator is applied right after model invocations, not only for auxiliary routing.
- agent/auxiliary_client.py:4955-5300 — This evidence covers auxiliary calls; the primitive should also be present at other model-call boundaries if they exist.
Grounding / wrongness check 67%
This codebase has an output-grounding/wrongness check for structured plugin LLM calls: `PluginLlm.complete_structured()` enforces JSON parsing and (when a schema is provided) validates the parsed output against `json_schema` using `jsonschema`, failing closed on mismatch. However, beyond this structured path, I did not find evidence of a broader “claim-by-claim context grounding/judge” loop for free-form assistant text responses.
- high
Add a general grounding/wrongness check for free-form LLM outputs that are surfaced to users or used for actions: introduce a judge-based or context-based verification step (or retrieval-backed citation verification) and enforce a bounded re-check/retry policy on failure.
- agent/plugin_llm.py:604-746 — Current enforcement is only for structured outputs. This is evidence of the existing check pattern, which can be extended to cover non-structured/free-form claims.
- med
For structured paths, ensure failure modes are explicit and observable (e.g., include the validation error details in the audit trail and/or add a deterministic fallback output schema on validation failure).
- agent/plugin_llm.py:604-746 — Validation failures raise `ValueError` with a message, but the surrounding retry/fallback behavior is not shown in the slice; improving audit + fallback would strengthen closed-loop safety.
Self-correction / feedback loop 0%
No closed self-correction/feedback loop was found. The code detects judge-output parse/validation failures and uses fail-open + bounded pausing, but it does not feed the specific error back to the model for a re-check within the same validation path.
- high
Implement a closed retry loop inside GoalManager/judge_goal for judge contract failures: when _parse_judge_response() reports parse_failed (empty/non-JSON), re-prompt the same judge model with an error-specific instruction (e.g., 'Output exactly one JSON object with keys done and reason; your previous output was <reason snippet>'). Bound attempts (e.g., 2-3) before falling back to the current auto-pause behavior.
- hermes_cli/goals.py:225-360 — parse failure is detected and returned as parse_failed, but the failure details are not used to create an error-fed-back next judge prompt.
- hermes_cli/goals.py:600-705 — GoalManager.evaluate_after_turn() currently only pauses/limits after repeated parse failures, rather than retrying with the error fed back to the judge model.
- med
Add targeted tests that assert the loop is closed: when the judge output is non-JSON for the first attempt, the second attempt must include the specific failure and must re-run parsing. (For example, unit tests around judge_goal/_parse_judge_response with a mocked auxiliary client returning controlled invalid outputs.)
- hermes_cli/goals.py:225-360 — The parse_failed contract provides the data needed to craft error-specific feedback, so tests can verify the prompt augmentation and re-parse.
Evaluation harness + scoring N/A
I did not find an “Evaluation harness + scoring” primitive in this codebase. There are some benchmark/unit tests for a specific runtime evaluation path (browser CDP evaluation), but no offline golden set with automated scoring, no eval runner that logs inputs/outputs, and no evidence of recurring production eval/scoring distinct from the per-request execution loop.
- high
Add a dedicated eval layer (e.g., an `evals/` package + CLI entrypoint) that runs an offline golden set, scores outputs with explicit metrics/rubrics, and logs results (inputs, model versions, prompts, outputs, scores, and pass/fail) to a persistent store.
- scripts/benchmark_browser_eval.py:1-139 — Current benchmarking is ad-hoc and does not provide the governance artifacts (golden set, rubric, logging, recurring scoring) required by this primitive.
- med
Instrument the production request/agent loop to emit structured “eval candidates” (prompt/input + model version + tool context + ground-truth key/labels when available) into the logging store, but keep eval execution in the separate eval layer.
- tests/tools/test_browser_eval_supervisor_path.py:1-260 — Existing tests validate correctness of a specific eval dispatch path via mocks; this pattern should be extended into a broader, versioned, golden-set evaluation/score-and-log harness.
- low
Optionally integrate an eval framework dependency (e.g., promptfoo/deepeval/langsmith/ragas) only after establishing the repo’s own artifact structure (golden set format, scorer definitions, and logging schema).
- scripts/benchmark_browser_eval.py:1-139 — There is no indication of a third-party eval framework or a structured scoring/logging pipeline; it’s currently just timing output.
Runnable correctness checks N/A
I did not find any documented, one-command runnable pass/fail correctness-check entrypoint for this codebase (e.g., a CI workflow or root-level `test`/`check`/`build` command wired to return an unambiguous green/red status). While the repo contains Python test files (e.g., under `tests/` and `skills/.../tests/`), the required governance layer that makes correctness checks trivially runnable and externally verifiable from a single command was not located.
- high
Add or document a single root command (e.g., `pytest` invocation or `make test`/`just test`) that runs the existing test suite and returns a clear pass/fail exit code; ensure it covers the agent-facing correctness scope (setup/config flows + any workflow logic with mocks).
- tests/hermes_cli/test_setup.py:1-200 — Existing unit tests exist, but there is no evidence in-repo (from the checked governance entrypoints) of a single documented pass/fail command that orchestrates them.
- skills/creative/comfyui/tests/test_run_workflow.py:1-200 — Existing unit tests exist for workflow logic, reinforcing that correctness checks exist, but the runnable correctness-check primitive (one-command, externally visible pass/fail entrypoint) was not found.
Actionable diagnostics 100%
The codebase includes a strong “actionable diagnostics” primitive via `hermes_cli/kanban_diagnostics.py`, which produces structured diagnostics with fix-oriented `actions`, and via `agent/lsp/reporter.py`, which externalizes LSP diagnostics with severity and exact line/column positioning. Both are runnable/consumable outputs rather than implicit logs or ad-hoc strings.
- high
Audit other failure-producing surfaces (e.g., CLI “status” outputs, update/check commands, and any tool preflight errors) to ensure they consistently emit structured diagnostics with (1) a stable diagnostic code/kind, (2) precise location/context when applicable, and (3) explicit operator actions/hints—not just error strings or stack traces. Reuse the existing `Diagnostic`/`DiagnosticAction` patterns where possible.
- hermes_cli/kanban_diagnostics.py:1-80 — Provides the canonical actionable-diagnostics shape (kind/severity/title/detail/actions) that other surfaces should emulate for consistency.
Positive confirmation 200%
The codebase contains a clear instance of positive confirmation: `tools/terminal_tool.py::check_terminal_requirements()` returns an explicit boolean success/failure signal (not only log messages), and `tests/tools/test_terminal_requirements.py` asserts on that success signal with runnable pytest pass/fail conditions.
- high
Search for other operational gates (e.g., startup readiness checks, backend selection checks, tool availability checks) and ensure they also expose an explicit positive success signal (boolean/structured status) and have corresponding tests asserting the success case.
- tools/terminal_tool.py:2369-2510 — This file demonstrates the target pattern (explicit success return + tests). Other gates should be brought to the same standard.
- med
If CI/workflow files exist elsewhere in the repository but were not indexed by the code-graph query, add/verify a documented one-command test run that guarantees an unambiguous green/yellow/red outcome (positive confirmation at the repo level).
- tests/tools/test_terminal_requirements.py:1-188 — While tests provide positive confirmation locally, the repo-level CI positive confirmation signal (e.g., GitHub Actions) was not evidenced from workflow/config queries in this audit.
Machine-readable contracts 100%
This codebase does have machine-readable contracts: it externalizes tool parameter expectations as JSON schema (e.g., computer_use), and it provides explicit schema sanitizer/translation modules for backend/provider compatibility (generic sanitizer + Gemini + Moonshot). The presence of focused schema modules plus targeted tests indicates the contracts are managed rather than implicit.
- high
Add (or confirm) a single registry/manifest that enumerates all tool contracts (e.g., where schemas live, how to load them, and their versions), so an agent can query the available contracts without knowing file locations.
- tools/computer_use/schema.py:1-214 — Currently shows one concrete tool contract, but evidence here does not demonstrate a unified registry manifest across all tools.
- med
Ensure the schema contract assets are consistently referenced/validated at the point where tool schemas are emitted to each provider (i.e., confirm call sites always use the sanitizers rather than duplicating shape assumptions).
- tools/schema_sanitizer.py:1-446 — Sanitizers exist; next step is to verify wiring at tool emission sites to ensure contracts stay source-of-truth.
Not applicable to this codebase: Raw / immutable source layer, Evaluation harness + scoring, Runnable correctness checks.