CS All case studies

Dokploy/dokploy

github.com/Dokploy/dokploy · audited 2026-06-04 · commit 6a0acd9

31% ERI composite

Dokploy is a self-hostable deployment platform — a Docker/Swarm-based alternative to Vercel/Heroku. Its 31% composite reads like a younger product that nailed a couple of enterprise primitives early and hasn’t yet built out the rest. The shape is consistent: strong where the team made a deliberate architectural choice, thin everywhere the platform-hardening work simply hasn’t happened yet.

Where it’s strong

Implementation & Customization (76%) is the standout — whitelabeling and branding are driven by configuration, not per-customer forks, exactly the pattern that survives scale. Identity & Access (63%) is real: federated SSO is wired through a centralized auth server (@better-auth/sso, plugin-based) with an enterprise SSO surface, rather than a hand-rolled login. After that the scores fall off.

Where the gaps are

The weak dimensions are broad, which is typical for a project at this stage. API & Extensibility (0%): there’s a script that generates an OpenAPI spec and a Swagger UI page, but no checked-in machine-readable contract artifact — so there’s nothing versioned for consumers to build against. Procurement Readiness (0%) and AI / Data Foundation (15%) are effectively greenfield. Engineering Org Resilience (19%): ownership is centralized with thin CODEOWNERS/review coverage — a real bus-factor risk. Tenancy Isolation (21%): tenant keys exist on some tables (projects.organizationId) but aren’t enforced consistently across business records, and Reliability Primitives (23%) lacks the retries/circuit-breakers/idempotency a deployment platform eventually needs.

The read

Dokploy has the two bones that are hardest to retrofit — clean config-driven customization and real SSO — which is a good sign about how the team thinks. But the breadth of low scores means an enterprise thesis here is an investment in platform hardening across many dimensions at once: a checked-in API contract, consistent tenant enforcement, reliability primitives, and ownership distribution. The dimension breakdown below is scored against the audited commit, evidence linked inline.

T1 Thesis Viability

AI / Data Foundation

Versioned data pipelines, pinned model versions, and a real vector or feature store — not scattered cron jobs and model="latest".

15% 17/19 scored
  • Declarative, tested transformations 33%
    1/3 expected sites
  • Orchestrated pipelines 0%
    0/3 expected sites
  • Data quality validation / contracts 100%
    3/3 expected sites
  • Raw / immutable source layer 0%
    0/2 expected sites not present
  • Data + pipeline versioning 0%
    0/2 expected sites not present
  • Data lineage / provenance 0%
    0/2 expected sites not present
  • Feature management 0%
    0/1 expected sites not present
  • Model version pinning 0%
    0/2 expected sites not present
  • Reproducibility / determinism 0%
    0/1 expected sites not present
  • AI output validation 0%
    0/3 expected sites
  • Grounding / wrongness check 0%
    0/2 expected sites not present
  • Self-correction / feedback loop 0%
    0/3 expected sites not present
  • Evaluation harness + scoring 0%
    0/2 expected sites not present
  • Runnable correctness checks 0%
    0/1 expected sites not present
  • Actionable diagnostics 25%
    1/4 expected sites
  • Positive confirmation 0%
    0/3 expected sites not present
  • Machine-readable contracts 92%
    4/4 expected sites
Declarative, tested transformations 33%

The codebase does include declarative, version-controlled transformation logic with unit tests—most clearly for Traefik routing label generation via `createDomainLabels`, which has extensive Vitest coverage including regression and edge cases. However, the broader “compose spec augmentation” pipeline (e.g., `addDomainToCompose`, compose YAML parsing, and write/serialization steps) appears less fully covered by similarly contract-driven transform tests, so the primitive is only partially implemented end-to-end.

  • high

    Add dedicated unit/integration tests for `addDomainToCompose` covering boundary conditions: (a) empty `domains`, (b) missing target service, (c) composeType docker-compose vs stack/swarm, (d) isolatedDeployment/randomize branches, and (e) presence/absence of existing labels (ensuring idempotent unshift behavior and network injection).

  • med

    Add tests for YAML parsing and remote compose loading behavior (`loadDockerCompose`/`loadDockerComposeRemote`) including invalid YAML, missing paths, and stderr/empty stdout cases to ensure the transformation fails closed (returns null) rather than producing incorrect outputs.

  • low

    Consider extracting a small “transformation module” surface for compose augmentation (pure functions returning transformed specs, with serialization/writes at the edges) so that the tested transformations are more clearly separated from IO concerns and are easier to audit as a transformation layer.

Orchestrated pipelines 0%

This repo includes an orchestration layer for scheduled/recurring operations using BullMQ: schedules are represented as typed job payloads (Zod), jobs are scheduled centrally with cron repeat patterns, and workers execute them via a dedicated job runner. However, the pipeline layer does not clearly declare explicit retry/attempt/backoff behavior at the queue level, and the job runner catches errors and only logs (which can prevent the orchestrator from reliably marking runs as failed for retries).

  • high

    Add explicit retry behavior to BullMQ job options (e.g., `attempts`, `backoff`, and/or `removeOnFail` strategy) in `jobQueue`/`defaultJobOptions`, and ensure failure states propagate (don’t swallow errors).

    • apps/schedules/src/queue.ts:1-80 — Queue/job options currently only set `removeOnComplete` and `removeOnFail`—no retry/attempt/backoff configuration is declared.
    • apps/schedules/src/utils.ts:1-273 — `runJobs` wraps execution in `try/catch` and logs errors via `logger.error(error)` without rethrowing; this can inhibit BullMQ from treating the run as failed for retry purposes.
  • med

    Improve run observability by recording structured metadata with each job execution (inputs, schedule/job ids, timezone/cron expression, and any script/command version), and include it in both success/failure logs.

    • apps/schedules/src/queue.ts:1-80 — Job creation passes `job` as data and sets repeat pattern, but there is no visible mechanism for attaching versioned execution metadata beyond the job payload.
    • apps/schedules/src/workers.ts:1-46 — Worker logs include `job.data`, but there is no evident structured/complete run context (versions/inputs captured for determinism).
  • low

    If you intend true dependency-graph orchestration (DAG), model task dependencies explicitly (e.g., separate job types with `dependsOn` or a workflow engine), rather than using only repeatable single-step jobs.

Data quality validation / contracts 100%

This codebase does have data quality validation/contracts: it externalizes input validation into runnable Zod schemas and DB insert schemas (via `drizzle-zod`), and applies them at TRPC ingestion boundaries (router `.input(...)` gates) to reject invalid requests before side effects and persistence.

  • high

    Confirm that every ingestion boundary that leads to persistence/side-effects has a schema gate (TRPC `.input(...)` or equivalent) and not only runtime checks. Add/extend schemas where gaps exist (e.g., ensure all operations that accept ids/paths/files use a shared Zod contract module).

  • med

    Add explicit negative-path tests (unit/integration) that assert invalid inputs are rejected (400/typed TRPC error), and that no DB rows are created for invalid payloads.

  • low

    Where schemas exist (e.g., `uploadFileToContainerSchema`), standardize error messages and error codes so diagnostics clearly indicate which contract failed and why (field-level errors from Zod).

Raw / immutable source layer 0%

No immutable/raw landing layer was found. The raw compose handling deletes and recreates the on-disk compose location and writes `docker-compose.yml` directly, which makes the original source non-recoverable for later audits/reprocessing.

  • high

    Introduce an immutable/raw landing directory for raw compose inputs (e.g., store `composeFile` as `raw/docker-compose.yml` or with a content-hash/run-id). Do not delete it on subsequent renders/deploys; only create new versions when raw input changes.

  • high

    Update the raw-to-workdir wiring so downstream steps (randomizeSpecificationFile, domain injection, etc.) write into a separate mutable working directory, while the immutable/raw source remains untouched.

  • med

    Add a manifest/metadata record (at least content hash, timestamp, and sourceType) for each raw compose input version to support reproducibility and audit trails.

Data + pipeline versioning 0%

No governed “data + pipeline versioning” primitive was found. The codebase implements deployment/job orchestration and records deployment logs and (some) commit-derived identity, but there is no evidence of a versioning system that snapshots/governs both the pipeline logic and the data/state it runs on in a reproducible, traceable way.

  • high

    Introduce a reproducibility contract for each deployment run: (1) persist a pipeline version manifest (build config/toolchain versions + repository code revision/commit) and (2) persist immutable data/state artifact identifiers (dataset snapshot/version or input artifact digests). Store both in the deployment/job database record and propagate them through the queue and execution paths.

  • high

    Add explicit, queryable linkage between runs, pipeline versions, and data snapshots: create dedicated DB tables/fields (or an external system like DVC/lakeFS) for data artifact versions and pipeline manifests, and reference them from deployments.

Data lineage / provenance 0%

No data lineage/provenance primitive is implemented as a governed, queryable mechanism. The codebase contains an audit log schema (action/resource tracking) and an Inngest-to-UI mapping that reconstructs some source relationships (events → runs), but there is no automatic, complete lineage of datasets and their transformations, nor any explicit lineage emission that can be validated during change management.

  • high

    Introduce a dedicated lineage/provenance data model and emission points in the pipeline/application layer: for every derived dataset/row returned by transformations (e.g., event/run aggregation), persist (1) source identifiers, (2) transformation/mapping version, (3) schema/version of output, and (4) timestamps and operator/job metadata. Expose it via API so it is queryable and change-management verifiable.

    • apps/api/src/service.ts:1-240 — This is where derived job rows are produced from fetched event/run sources; it should be the primary lineage emission site.
  • med

    Extend or complement the existing audit-log mechanism only if it is repurposed for data lineage: add dataset/derivation identifiers and explicit transformation edges. Otherwise, keep audit-log as security/compliance and implement lineage in a separate governed store.

Feature management 0%

No implementation of the requested “Feature management” primitive exists in this codebase. The only related concept found is an enterprise license-based UI gate, which does not externalize or unify feature definitions for training vs. serving.

  • high

    Confirm the intended scope: if you actually need ML feature-store style management (e.g., Feast/Tecton) for train/serve parity, introduce a dedicated feature layer with a single versioned definition source, and ensure both training and serving read the same compiled feature definitions.

  • med

    If the real requirement is product feature flags (not ML features), rename/re-scope this primitive in your audit rubric (feature flags vs. feature store/feature definitions).

Vector / embedding store N/A

No managed/vector embedding store primitive is present or wired in this codebase. Although there is an AI text-generation service with structured output validation, the implementation does not compute embeddings, does not persist vectors, and does not query any dedicated vector database.

  • high

    If the product intends to support RAG/semantic search over stored project/app data, introduce a dedicated vector embedding store layer (with explicit schema, indexing, and a model+content-version linkage) and wire it into the request path where retrieval is needed.

    • packages/server/src/services/ai.ts:1-258 — Current AI path is generation-only (`generateText`) with structured output validation; it provides a clear contrast point for where a retrieval step (embed→upsert/search→grounded generation) should be inserted.
Model version pinning 0%

Model calls exist (via `generateText`), but there is no evidence of “model version pinning” being enforced: the code passes through a model identifier chosen at runtime (`aiSettings.model` / `input.model`) without ensuring it is an explicit, pinned version (e.g., rejecting `latest`/`stable`-style floating aliases or requiring versioned IDs).

  • high

    Enforce pinned, explicit model IDs at the boundary where models are selected: validate `aiSettings.model` (and `input.model` in `testConnection`) against a policy (reject `latest`/`stable` and require versioned IDs), and/or map friendly names to pinned model IDs from a single config file.

  • med

    Add an internal “model registry” (versioned config) that stores supported pinned model IDs per provider, and ensure both runtime suggestions and connection tests draw from this registry.

Prompt / model-call management N/A

I did not find any evidence of prompt/model-call management in this codebase: there are no centralized prompt/config artifacts (no prompt/prompts directories or prompt-management libraries), and there are no detectable LLM model-call sites (no OpenAI/Anthropic model SDK usage tied to chat/completions/generate-style calls). This repo appears to be a non-LLM application (infrastructure/dashboard/workers), so this primitive does not appear applicable here.

  • high

    If/when LLM features are added, introduce a managed, versioned prompt/config layer (single source of truth) and route all model calls through it; otherwise, do not force this primitive into a non-LLM stack.

Reproducibility / determinism 0%

The codebase does not provide a reproducibility/determinism primitive that would allow recreating runs exactly from pinned inputs (e.g., captured seeds, environment/dependency versions, and deterministic execution controls). A clear non-deterministic component exists: password generation uses `Math.random()` without any seed management or run-input capture.

  • high

    Introduce a deterministic randomness mechanism for replayable runs: replace `Math.random()` with a seeded PRNG (e.g., `seedrandom`/custom xorshift) and plumb a `seed` value from run boundaries into `generateRandomPassword`; also persist the seed alongside other run inputs.

  • med

    Add run-boundary metadata capture for determinism: at the entry points that trigger nondeterministic operations (auth flows, any “randomize” features, job/worker execution), record (a) seed(s) used, (b) environment variables relevant to behavior, and (c) code/dependency versions (e.g., git SHA + lockfile hash).

AI output validation 0%

AI output validation is partially implemented. For `suggestVariants`, the code uses a strict Zod schema at the model boundary via the AI SDK’s structured-output facility (`Output.object({ schema })`), ensuring invalid outputs are rejected before being used. However, other model call paths (e.g., `analyzeLogs`, `testConnection`) return/only lightly check raw model text with no strict schema gate, and there is no demonstrated retry/self-correction loop that reuses the same schema and consistent error messaging.

  • high

    Add strict structured output validation (Zod + `Output.object({ schema })`) to the `analyzeLogs` path so `result.text` is replaced by a schema-gated object (e.g., `{ summary, issues_found, root_cause, suggested_fix }`), rejecting non-conforming outputs instead of passing raw text through.

  • high

    Tighten `testConnection` output validation to an explicit contract (e.g., require exact `text === 'ok'` or use structured output with a schema) rather than only checking that `result.text` is non-empty.

  • med

    Implement a closed-loop retry around schema validation failures for the structured-output calls (including reusing the same schema and returning a consistent validation error message), rather than failing immediately or only logging.

Grounding / wrongness check 0%

Grounding / wrongness checking for LLM outputs does not appear to be implemented as a reusable primitive. The `aiRouter` endpoints call `generateText` and directly return `result.text` (or only check it is non-empty) without any judge/validator pass that verifies claims against the provided context/logs.

  • high

    Add a centralized grounding/wrongness-check wrapper for LLM outputs (e.g., `groundingWrongnessCheck({context: input.logs, outputText})`) that either (a) produces a structured verdict per claim supported by the context, or (b) rejects and triggers a bounded retry with an evidence-only prompt and strict schema validation. Wire it into `analyzeLogs` so the endpoint never returns unverified claims.

  • high

    Harden the `testConnection` endpoint with explicit contract validation (e.g., require trimmed `result.text` equals `ok`) and treat mismatch as a failure. This is a minimum wrongness check for even simple outputs.

  • med

    Add an automated eval/verification harness (golden tests) for the `analyzeLogs` behavior: a set of log snippets with expected grounded findings, and a CI pass/fail that enforces that the grounding check gates the output.

Self-correction / feedback loop 0%

No self-correction / feedback loop exists. While the project uses structured output validation for the main AI suggestion generator (`suggestVariants`) via `Output.object({ schema: fullSchema })`, failures are not used to drive a bounded retry where the specific validation error is fed back into the model prompt and re-checked.

  • high

    Implement a closed self-correction loop inside `suggestVariants`: catch schema/parse/validation errors from the AI SDK/Zod gate, extract the specific error message/path, append it to the prompt (e.g., “Your previous output failed because: <error>. Fix and retry.”), and re-run `generateText` for a bounded number of attempts (e.g., 2-3) before returning a safe fallback (like a deterministic error payload).

  • med

    Ensure the API layer (`apps/dokploy/server/api/routers/ai.ts`) either (a) does not just propagate errors, or (b) returns a structured error that includes the last validation error and attempt count when the loop ultimately fails.

  • low

    Add a unit/integration test that forces an invalid model output (or mocks `generateText` to return invalid JSON) and asserts that the second attempt includes the validation error and that attempts are bounded.

Evaluation harness + scoring 0%

I could not find any evaluation harness + scoring implementation for AI output quality (no `eval/`, `golden/`, `testset/`, or evaluation runner artifacts, and no code that logs model I/O into a scored, recurring eval process). The codebase does have runtime schema validation and AI calls for suggestion generation and log analysis, but the offline golden-set evaluation/scoring layer described by the primitive is missing.

  • high

    Add an offline evaluation harness for the AI features (starting with `suggestVariants` and `analyzeLogs`): create a versioned golden dataset (inputs like user requests/log samples + expected properties), run model generations in batch, score outputs with deterministic judges, and store results per run (including model/version, prompt version, and seed/params).

  • high

    Implement production logging specifically for eval readiness: persist (a) the feature name, (b) prompt/prompt-template version, (c) model identifier + provider, and (d) the structured model output (or model text), along with the input payload. Wire logs to the eval runner so the next recurring eval can replay and score them.

  • med

    Create a documented one-command check/CI job that runs the eval suite, reports pass/fail with clear thresholds, and fails the build on regressions.

Runnable correctness checks 0%

The codebase contains a Vitest configuration and many unit tests, but I could not find a documented one-command “correctness check” entrypoint (or CI workflow) that an agent can run to get an unambiguous pass/fail verdict. Therefore, runnable correctness checks are not externally governed/documented in a way this primitive requires.

  • high

    Add a root-level, documented command that runs the repository’s correctness suite and returns a clear exit code (e.g., a `test` script like `vitest run --config apps/dokploy/__test__/vitest.config.ts`), and ensure CI uses the same command.

  • med

    Add/confirm a CI workflow that runs the same command and fails the build on test failure, providing the positive confirmation needed for this primitive.

  • low

    Document the command in a contributor-facing file (README/CONTRIBUTING) alongside expected duration and how to run locally (same command as CI).

Actionable diagnostics 25%

The codebase does include actionable diagnostics for core authorization failures (structured `TRPCError` with error `code` and `message`). However, some upstream API failures in `apps/api/src/service.ts` are handled primarily via logging and then returning empty/truncated data, which reduces actionability for callers.

  • high

    Standardize upstream API failure handling in `apps/api/src/service.ts`: when `fetchInngestEvents` / `fetchInngestRunsForEvent` encounter non-OK responses, propagate a structured error to callers (with a code/type and fix hint), rather than only logging and returning empty/partial results.

  • med

    Add actionable diagnostics around `resolveRole()` failure modes (especially `JSON.parse(entry.permission)`) so that malformed role permissions produce a clear message and stable error code rather than failing indirectly later.

  • low

    Ensure environment-variable errors (e.g., missing `INNGEST_BASE_URL`) use the same structured diagnostic pattern (stable code/type) as other API errors, not just a raw `Error` string.

    • apps/api/src/service.ts:175-193 — Throws generic `Error("INNGEST_BASE_URL is required...")` when config is missing; improve consistency with stable error codes/types.
Positive confirmation 0%

No explicit “positive confirmation” success contract (e.g., a CI workflow or other governed pass/fail gate that clearly indicates correctness/safe-to-stop) was found in the repository. While the project uses Vitest and includes test files, I did not find any CI/build definition that externally and unambiguously signals test success.

  • high

    Add a GitHub Actions workflow (or equivalent CI) that runs the repo’s canonical test command (e.g., `vitest run` / `pnpm test`) and treats a passing run as the explicit positive-confirmation signal (green status). Ensure the workflow is the single source of truth for the one-command check.

  • med

    Create/verify a single documented root-level script for the “green check” (e.g., `pnpm test`), and ensure it runs the full relevant test suite with deterministic settings (no watch mode).

Machine-readable contracts 92%

Machine-readable contracts are present and well-externalized: the repo generates and commits an OpenAPI specification (openapi.json) from the server router, and the Swagger UI consumes the generated OpenAPI document. Additionally, chatwoot integration expectations are expressed via a .d.ts declaration file.

  • high

    Add/ensure a CI check that re-runs apps/dokploy/scripts/generate-openapi.ts and fails if the generated output differs from the committed openapi.json (keeps contract in strict sync with implementation).

Not applicable to this codebase: Vector / embedding store, Prompt / model-call management.

Tenancy Isolation

A tenant_id on every business table, row-level security in the database, and tests that prove a cross-tenant request returns 403.

21% 11/12 scored
  • Tenant key on every record 50%
    1/2 expected sites
  • Database-enforced isolation 0%
    0/3 expected sites not present
  • Default-scoped queries 0%
    0/2 expected sites not present
  • Tenant context at the boundary 100%
    2/2 expected sites
  • Object/blob partitioning 0%
    0/8 expected sites not present
  • Tenant context in async work 0%
    0/5 expected sites not present
  • Per-tenant resource limits 0%
    0/2 expected sites
  • Tenant-scoped key management 0%
    0/2 expected sites not present
  • Admin / role scoping 80%
    4/5 expected sites
  • Uniform not-found vs. forbidden 0%
    0/2 expected sites not present
  • Cross-tenant isolation tests 0%
    0/1 expected sites not present
Tenant key on every record 50%

The codebase does have tenant keys on some tables (e.g., `projects.organizationId`), but the primitive is not enforced consistently across all business records at the schema/model layer. At least `deployment` lacks any tenant key column, and `session` does not enforce tenant keying on every row via an organization FK.

  • high

    Add a tenant/owner key column to `deployment` (likely `organizationId` referencing `organization.id`) and ensure all access paths filter by/derive this tenant key from the row (or enforce with DB-level policies).

  • high

    For `session`, decide the intended tenant scoping model and enforce it at rest: either add `organizationId` with an FK (if the session is tenant-bound) or remove `activeOrganizationId` if it is purely UI state, ensuring no tenant-dependent logic relies on an unenforced field.

  • med

    Run a repository-wide audit over all `pgTable(...)` writeable schemas to confirm every business table has a required tenant key column (e.g., `organizationId`). For any exceptions, justify and ensure tenant isolation is still enforced by DB-layer defaults (e.g., RLS/forced policies) or by an architectural partition (schema-per-tenant).

Database-enforced isolation 0%

Database-enforced isolation (RLS/schema-per-tenant/DB router enforcement that filters by tenant even for table owners) appears to be absent. The codebase defines org/tenant foreign keys on some tables (e.g., `projects.organizationId`), but schemas like `deployment` do not carry a tenant discriminator and no database policy mechanism is evident from the inspected DB schema files. As a result, isolation appears to rely on application-level checks (which is soft isolation and vulnerable to a single missed filter).

  • high

    Add database-level tenant enforcement using Row Level Security (RLS) (or schema-per-tenant / database-per-tenant). For shared tables like `deployment`, enforce policies that automatically restrict rows to the caller’s organization, and ensure the policy is forced (e.g., `FORCE ROW LEVEL SECURITY`) so it applies even for table owners.

    • packages/server/src/db/schema/deployment.ts:1-120 — Current `deployment` schema lacks a tenant column, so an RLS policy will need either (a) a tenant column added to the table, or (b) RLS predicates that resolve tenant via joins through related org-scoped tables.
  • high

    Implement RLS policies for every tenant-scoped table (not only top-level ones). Verify by testing: issue queries from a role that is the table owner (or otherwise privileged) without applying org filters in the app query and confirm cross-organization reads are denied.

  • med

    Establish a cross-tenant integration test suite that attempts to list/read/export tenant data using only a different organization context, and assert denial. This should include direct DB-access attempts that bypass application query filters (e.g., through an admin/owner DB role) to validate the defense-in-depth goal.

Default-scoped queries 0%

I did not find a default-scoped query mechanism that automatically applies the tenant/org filter at the ORM/repository layer. Instead, tenant isolation appears to rely on schema FKs (e.g., `organizationId` on some tables) and higher-level access checks, while individual data-access functions can perform lookups without an automatic tenant filter.

  • high

    Introduce (or verify) a tenant-aware base repository / default scope for Drizzle queries so that any `find/findMany/select` on tenant-partitioned resources automatically injects `organizationId` (or an equivalent tenant join condition). Ensure the default cannot be bypassed except via an explicitly audited escape hatch.

  • med

    Audit all data-access functions that read by opaque IDs (e.g., `findXById(id)`) and ensure each call site either (a) uses the new default-scoped repository or (b) performs a tenant-scoped join/filter underneath in the data layer.

Tenant context at the boundary 100%

This codebase implements “tenant context at the boundary” for the primary API surface (tRPC). `createTRPCContext` derives trusted session/user via `validateRequest(req)`, and `validateRequest` sets `session.activeOrganizationId` from the authenticated membership’s organization id. I did not find evidence of the boundary being (incorrectly) trusted from request params/body for the tRPC path.

  • high

    Confirm all non-tRPC entry points (e.g., Next.js route handlers under `apps/dokploy/pages/api/**`, webhooks, background workers) also re-establish/derive `activeOrganizationId` (or equivalent) from verified identity, rather than accepting client-supplied org/tenant ids.

    • apps/dokploy/pages/api/providers/github/setup.ts:1-69 — This API route reads `organizationId`-related values from query/URL state (`state` parsing and `req.query`) to call `createGithub(...)`. It is not part of the tRPC boundary implementation audited above, so it should be checked for tenant-context validation via verified identity.
  • med

    Add/verify an automated test that attempts to call an authenticated API while forcing a different `orgId`/`organizationId`, asserting denial (or that the response is based strictly on `activeOrganizationId` from the membership).

Cache key namespacing N/A

I did not find any cache get/set usage where cache keys could be tenant-prefixed (no discovered `cache/redis` key-building patterns for `get/set/mget/mset/...`). The codebase uses Redis primarily for operational concerns (e.g., service setup/Swarm, queue connection) rather than an application-level shared cache with key namespaces, so this tenancy isolation primitive does not appear to be implemented or applicable as a cache-layer concern in this repo.

  • high

    If the system uses a shared application cache in production (e.g., Redis-backed caching via a library like ioredis-cache, node-cache, cache-manager, or custom key/value helpers), locate that cache wrapper and ensure all keys are tenant-prefixed (e.g., `tenant:{id}:...`) and/or enforce cache-level ACLs; then add tests that attempt cross-tenant cache reads.

Object/blob partitioning 0%

Across the blob/object storage interactions used for backups/restores (compose backups, web-server backups, and volume backups), the code constructs S3 object paths using `appName`/`prefix`/`backupFileName` and the configured `destination.bucket`, but does not include a tenant/org identifier in the object key/path. No evidence was found of per-tenant bucket/prefix partitioning (or tenant-bound signed/unguessable URL strategy) being enforced by default at the storage layer for these artifacts.

  • high

    Make storage object keys tenant-scoped by default (e.g., `orgId/` or `tenantId/` prefix) at the point where `bucketDestination`/`s3Path`/`backupPath` are constructed for all backup and restore flows.

  • high

    Ensure restores cannot read arbitrary objects across tenants by validating tenant-scoped prefixes (and/or using tenant-scoped paths stored in DB and enforced during restore).

  • med

    Add integration tests that attempt cross-tenant backup artifact read/list/restore using guessed keys/IDs and assert denial (or same-not-found behavior if you intentionally hide existence).

Tenant context in async work 0%

Tenant context in async work is not implemented defensively in this codebase: queue job payloads and async workers (BullMQ) do not carry organization/tenant identifiers, and the worker-side data access helpers load entities by ID only (no tenant constraint visible). As a result, async processing appears to rely on implicit correctness rather than mandatory tenant context re-establishment from the job message.

  • high

    Extend all async queue job payload schemas/types to include tenant context (e.g., organizationId/orgId) and ensure enqueue calls populate it from the verified request/session identity.

  • high

    In each async worker, re-establish tenant context from the job payload before any DB access, and enforce that DB queries are tenant-scoped by default (either via a tenant-aware repository/base query layer or explicit mandatory tenant filtering).

  • med

    Harden data-access helpers used by async workers to require tenant/org constraints (e.g., change findServerById/findBackupById/findScheduleById/findVolumeBackupById to accept tenant context and apply it in the query), so isolation is enforced beneath application code.

Per-tenant resource limits 0%

This codebase models rate limiting for API keys (rateLimitEnabled/rateLimitTimeWindow/rateLimitMax, etc.) and associates API key metadata with an organizationId. However, I did not find evidence of true per-tenant resource-limit enforcement via tenant-keyed quota/rate-limit buckets. The presence appears closer to per-API-key throttling configuration than per-tenant noisy-neighbor isolation.

  • high

    Locate the enforcement path that actually consumes the API key rateLimitEnabled/timeWindow/max fields (likely in better-auth integration or a request middleware). Verify and implement tenant-scoped limiter keying (e.g., tenant:{orgId}:rate:...) so quota is shared/limited at the tenant level, not only per API key.

  • med

    Add/adjust integration tests to confirm noisy-neighbor resistance: generate multiple API keys in two organizations, saturate limits for Org A, and verify Org B requests still succeed under concurrent load.

Tenant-scoped key management 0%

No evidence of tenant-scoped key management (per-tenant KMS/envelope encryption) was found. Secrets such as SSH private keys and certificate private keys appear to be handled/stored without any tenant-specific envelope-key reference or crypto-erase capable design. Existing code uses global env secrets (e.g., `INNGEST_SIGNING_KEY`) and directly writes sensitive key material to storage paths/files.

  • high

    Introduce tenant-scoped envelope encryption for all persisted/transit key material (e.g., SSH `privateKey`, certificate `privateKey`). Implement a per-organization key-wrapping scheme (BYOK/CMEK-ready) and store only ciphertext + per-tenant key reference metadata.

  • med

    Add cryptographic context plumbing so the lowest crypto layer always resolves tenant/organization from the verified session context, then uses that tenant’s key to encrypt/decrypt. Prevent any code paths from using a global encryption key for tenant secrets.

  • low

    Add integration tests that attempt cross-tenant secret access and verify crypto-erase behavior (after tenant deletion, decryption should fail; ciphertext should remain undecryptable).

Admin / role scoping 80%

Admin/elevated roles appear to be correctly scoped to a tenant (organization) membership model. The codebase stores roles on `member` with `organizationId`, derives `activeOrganizationId` from the authenticated user’s verified membership, and tenant-checks are applied at major entrypoints (organization, custom-role, project) by filtering on `ctx.session.activeOrganizationId` / verifying membership before acting.

  • high

    Add/verify a negative integration test that attempts to change/view custom roles or access projects under a different organizationId than `activeOrganizationId`, asserting denial (including “owner/admin” users). This ensures tenant scoping cannot regress silently.

  • med

    Audit `findUserById` / `findOrganizationById` / other generic admin/owner lookup helpers for missing tenant scoping guarantees in their callers. If these helpers are used in authorization decisions, ensure callers always provide/verify the target organizationId.

  • low

    Confirm that any remaining “admin presence” or “owner exists” checks are either (a) intentionally global and non-authorizing, or (b) properly constrained to the relevant organization. If these checks ever gate tenant-specific capabilities, they should take an organizationId parameter.

Uniform not-found vs. forbidden 0%

The codebase does not implement a uniform “not-found vs forbidden” behavior. At least for organization and project item fetches, access denied and missing resources result in different tRPC error codes (e.g., "FORBIDDEN" / "UNAUTHORIZED" vs "NOT_FOUND"), which can leak resource existence across organizations/tenants.

  • high

    For every single-resource read path that can be influenced by an id (e.g., organization/project/environment/etc. “one” queries), change authorization-denied outcomes to throw the same tRPC error code/message as the missing-resource outcome (typically "NOT_FOUND"). This should be done consistently at the lowest layer that decides between “not found” and “access denied”.

  • med

    Add integration tests that attempt cross-organization reads for each affected resource type (read by id, list, and any export/export-like endpoints) and assert the client always receives the same not-found response whether the target id exists but is forbidden or truly does not exist.

Cross-tenant isolation tests 0%

No cross-tenant isolation integration/security tests were found. Existing tests under `apps/dokploy/__test__` focus on permissions and environment access logic, but they do not attempt cross-tenant read/write/list/export/async operations to prove denial at the boundary.

  • high

    Add a dedicated cross-tenant isolation security test suite that, for at least these paths, creates two tenants/orgs and verifies that tenant A cannot read/write/list/export tenant B resources: deployments, environments/env vars, backups, schedules, projects/services, audits/logs, and any export endpoints. Include async paths by enqueueing background jobs under tenant A and asserting workers cannot access tenant B without the correct tenant context.

    • apps/dokploy/__test__/setup.ts:1-44 — Test setup exists but mocks only DB connectivity; it does not provide any cross-tenant isolation test coverage (no evidence of tenant-crossing security assertions in the found test files).
  • high

    Implement explicit assertions in tests for both read and write denial (including list/export) and ensure responses are uniform (e.g., not leaking existence).

  • med

    Extend the test harness to create real or sufficiently representative multi-tenant fixtures (two orgs/members) and run through actual API routes/routers instead of only calling permission helpers with mocked DB.

Not applicable to this codebase: Cache key namespacing.

Identity & Access

SAML/OIDC libraries, SCIM provisioning endpoints, and a real roles/permissions schema — not a hard-coded isAdmin boolean.

63% 11/11 scored
  • Federated SSO (SAML/OIDC) 67%
    3/3 expected sites
  • Directory provisioning (SCIM) 0%
    0/2 expected sites not present
  • RBAC modeled as data 33%
    1/1 expected sites
  • Centralized authorization 117%
    3/2 expected sites
  • No hardcoded privilege shortcuts 100%
    2/2 expected sites
  • Deny-by-default 40%
    2/5 expected sites
  • AuthN before AuthZ at the boundary 133%
    4/3 expected sites
  • MFA / step-up auth 0%
    0/2 expected sites
  • Session & token hygiene 100%
    4/4 expected sites
  • Scoped machine credentials 100%
    3/2 expected sites
  • IP allowlists / network constraints 0%
    0/1 expected sites
Federated SSO (SAML/OIDC) 67%

Federated SSO is present and wired through a centralized authentication server using `@better-auth/sso` (plugin-based). The app also includes an enterprise SSO management router and an SSO provider data model supporting both OIDC and SAML configuration, plus a user-facing sign-in component that initiates the SSO flow via the auth client and redirects to the IdP flow URL. While the wiring is clear, direct evidence of cryptographic token/assertion validation at the boundary is not explicitly shown in the audited slices (it is likely handled inside the Better-Auth SSO plugin).

  • high

    Add/confirm explicit boundary evidence for cryptographic verification (OIDC token signature verification / JWKS validation; SAML signature/assertion validation, audience/issuer checks) and ensure it occurs before any session creation or authorization logic. Document which module performs verification (Better-Auth SSO internal implementation) and add targeted tests for signature failures and audience/issuer mismatches.

    • packages/server/src/lib/auth.ts:1-420 — SSO is wired via the `sso()` plugin, but the audited code slice does not show the actual verification logic, so verification correctness should be confirmed via deeper code reading or tests.
  • med

    Verify the SSO callback/redirect handling path is fully covered end-to-end (including any ACS/callback endpoints if applicable) and ensure every callback result is tied to the correct organization/provider record (per-org connection enforcement).

  • low

    Harden SSO admin UX paths with clearer enforceSSO behavior guarantees (e.g., when enforceSSO is enabled, ensure local password/email auth is properly suppressed at the UI and server policy layers, not only on the client).

Directory provisioning (SCIM) 0%

SCIM directory provisioning (SCIM v2.0 Users/Groups with create/update and, critically, deprovisioning that revokes access) is not implemented anywhere obvious in the codebase. The enterprise identity layer shows SSO/provider management, and user provisioning appears to be invitation/credential-based rather than directory-driven SCIM.

  • high

    Add a dedicated SCIM v2.0 router (e.g., under the same proprietary/enterprise identity area as SSO) exposing the required endpoints for Users and Groups, including PATCH handling for attributes like `active`/`disabled`.

  • high

    Implement a true deprovision/revocation workflow for SCIM offboarding: when SCIM marks a user inactive, revoke their access by disabling their organization membership and invalidating active sessions/tokens (and ensure follow-on authorization checks stop granting access).

  • med

    Add end-to-end tests that simulate SCIM lifecycle: create → update attributes → deactivate → verify access is actually revoked (authorization denied and sessions/tokens no longer work).

RBAC modeled as data 33%

RBAC modeled as data is present: the codebase has role/permission data tables (`organizationRole`, `member`) and a centralized, data-driven policy module (`packages/server/src/lib/access-control.ts`) plus custom-role persistence (`proprietary/custom-role.ts`). However, at least one important authorization surface (organization create/update/delete) still uses inline hardcoded role checks instead of routing the decision through the centralized, data-driven permission model.

  • high

    Refactor `apps/dokploy/server/api/routers/organization.ts` to authorize organization mutations (create/update/delete) via the centralized permission/role engine (permission resolution based on `member.role` + `organizationRole.permission`) rather than inline checks against `ctx.user.role` / `userMember.role === 'owner'`.

Centralized authorization 117%

The codebase does have centralized authorization: permissions are defined once in `packages/server/src/lib/access-control.ts`, evaluated in a single decision module `packages/server/src/services/permission.ts`, and enforced across the API via a tRPC permission wrapper `withPermission` in `apps/dokploy/server/api/trpc.ts`. Some parts still include direct role-string checks (e.g., owner/admin/member), but the actual resource+action permission checks are centralized through the shared permission engine.

  • high

    Ensure every permission-requiring tRPC procedure uses `withPermission(...)` (or calls `checkPermission/hasPermission`) and remove/avoid inline per-handler authorization logic. Add a lint/test that fails builds when routers add ad-hoc `role`/permission checks instead of routing through the shared permission service.

  • med

    Add auditable logging for every authorization decision at the centralized chokepoint (`checkPermission` / `withPermission`), so decisions are consistently logged for both allow and deny cases (not only where routers happen to import an audit helper).

  • low

    Reduce the number of direct `memberRecord.role === "owner" || "admin"` / `ctx.user.role !== "owner" && !== "admin"` checks by funneling any privileged bypass logic through the centralized permission engine, to minimize future drift.

No hardcoded privilege shortcuts 100%

This codebase generally avoids hardcoded superuser privilege shortcuts: privilege is represented via a role string (`user.role`) and enforced via a centralized permission/policy service (`checkPermission` + `role.authorize(...)`). However, there is at least one router-level ad-hoc privileged-role shortcut based on role names (`member.role !== "owner" && member.role !== "admin"`) instead of routing the decision through the centralized permission layer.

  • high

    Remove router-level privileged shortcuts like `member.role !== "owner" && member.role !== "admin"` and replace them with centralized permission checks (e.g., express the requirement in the permission model and enforce via `checkPermission` / `checkServicePermissionAndAccess`).

  • med

    In `permission.ts`, ensure any “privileged static role” behavior (e.g., owner/admin bypass behavior) is consistently modeled within the policy engine, and document it as part of the role/permission model so it remains auditable and not treated as a scattered shortcut.

Deny-by-default 40%

The codebase implements deny-by-default correctly at the main tRPC boundary (`validateRequest` + 401 on missing session/user) and at the procedure layer (`protectedProcedure` throws UNAUTHORIZED). However, several concrete Next.js API endpoints (health, Stripe webhook, OAuth callbacks, GitHub webhook) are public and rely on per-endpoint trust/verification rather than the shared deny-by-default gate.

  • high

    For every public API entry point (health, Stripe webhook, OAuth callbacks, GitHub webhook), add/verify strong request authentication/verification and ensure there is an explicit, documented trust model (e.g., strict signature verification, state anti-CSRF guarantees). This prevents new endpoints from accidentally remaining open without a deny-by-default rationale.

  • med

    Create a checklist/pattern to ensure new endpoints are either (a) routed through the shared tRPC auth boundary, or (b) explicitly documented as public and secured with the correct verification mechanism (not merely “no session required”).

  • low

    Audit the deployment service separately for consistent deny-by-default semantics across all its exposed endpoints (ensure every route that should be protected has the middleware applied, not just `/deploy` and `/cancel-deployment`).

AuthN before AuthZ at the boundary 133%

This codebase implements a centralized boundary pattern for AuthN-before-AuthZ: tRPC requests authenticate first via `validateRequest(req)` (cryptographic API key verification and session validation) to populate `ctx.session`/`ctx.user`, and then authorization is enforced by `protectedProcedure`/role/enterprise guards that check `ctx.user`/`ctx.session`. Next.js handler(s) also call `validateRequest` before delegating to the tRPC/OpenAPI stack.

  • med

    Audit any additional non-tRPC entrypoints (e.g., custom REST routes, webhook handlers, background workers with user context) to ensure they either (a) use the same `validateRequest` boundary before authorization, or (b) are explicitly public/non-authorized flows with no authz decisions based on unverified identity.

MFA / step-up auth 0%

This codebase does implement MFA at sign-in time (2FA redirect + enforced TOTP/backup-code verification, plus 2FA enrollment UI). However, the “step-up auth” requirement for sensitive actions (additional second-factor enforcement for high-risk operations after baseline authentication) is not clearly implemented: the privileged admin procedures appear to rely on ordinary authorization/role checks without an additional step-up boundary.

  • high

    Add an enforceable step-up MFA policy for high-risk routes/procedures (starting with admin/owner operations and any security/privilege-changing actions). Require a fresh second-factor verification (or a time-bound, audited break-glass) before executing the sensitive operation.

  • med

    Implement server-side enforcement and auditing: record whether the request has a recent step-up verification (timestamp + method), require it in step-up-protected handlers, and include it in the audit log entries for sensitive actions.

    • apps/dokploy/server/api/routers/security.ts:1-75 — There is auditing for security actions (`audit(ctx, ...)`) but no indication of a step-up MFA check prior to the sensitive mutation. Extend audit events to include step-up context and enforce the check.
  • low

    Ensure the step-up requirement is consistent across both UI and API entry points (tRPC procedures and any REST endpoints). Avoid client-only gating (UI prompt) as the sole enforcement mechanism.

    • apps/dokploy/pages/index.tsx:1-230 — The MFA challenge/verification is shown in the UI, but step-up for sensitive actions must be enforced server-side at the privileged procedure boundary.
Session & token hygiene 100%

Session & token hygiene is implemented via better-auth with server-side sessions backed by a sessions table containing expiresAt and a unique token. The centralized auth configuration sets an explicit session TTL (3 days) and includes session.delete lifecycle handling for logout. Request authentication derives identity only from server-side validated session state (api.getSession).

  • med

    Verify that refresh/rotation and revocation are fully enforced for sessions/tokens beyond expiry (e.g., confirm whether refresh rotates the session token or merely extends it, and whether logout actively deletes/invalidates the stored session row in the backing adapter).

  • low

    Add explicit automated tests asserting that after logout (session.delete), subsequent requests with the same cookie/token are rejected.

Scoped machine credentials 100%

This codebase does implement scoped machine credentials via an `apikey` table and an API-key verification path (`x-api-key`) in `packages/server/src/lib/auth.ts`. Keys are database-backed (not a single shared global secret) and can be revoked via deletion. However, the scope/least-privilege enforcement appears to rely primarily on `organizationId` binding and member role, while explicit use of the `apikey.permissions` field for fine-grained authorization is not visible in the reviewed boundary wiring.

  • high

    Ensure API-key scopes/permissions are actually enforced at authorization time. Concretely: when `validateRequest` maps an API key to a session, propagate/derive the API key's scoped `permissions` (from `apikey.permissions`) into authorization checks, and/or enforce deny-by-default based on those permissions for every sensitive endpoint.

    • packages/server/src/lib/auth.ts:420-520 — API key validation loads `apikeyRecord` and uses `apikeyRecord.metadata.organizationId` and `member.role` to set authorization context; reviewed code does not show usage of `apikey.permissions` for fine-grained scope enforcement.
    • packages/server/src/db/schema/account.ts:220-252 — The schema includes a `permissions` field on `apikey`, which should be exercised for scoped/least-privilege enforcement but is not demonstrated in the boundary mapping we inspected.
  • med

    Add explicit checks that API keys respect `enabled` and `expiresAt` (either via `api.verifyApiKey` behavior or an additional DB check). This ensures revoked/expired keys cannot authenticate even if the upstream verification library changes behavior.

IP allowlists / network constraints 0%

The repository includes CIDR matching logic (`isIPInCIDR`) and a Traefik `ipWhiteList` middleware type, but there is no concrete per-tenant IP allowlist enforcement wired into the request path before business logic. In the main Traefik security middleware construction, only BasicAuth middleware is created/managed, not IP allowlists.

  • high

    Identify the tenant/app configuration source for network constraints (e.g., per-server or per-organization settings) and wire a Traefik `ipWhiteList` middleware into the same security chokepoint used for BasicAuth (`createSecurityMiddleware` / middleware chain attachment). Ensure the allowlist executes before any downstream handlers for protected routes/services.

  • med

    Add automated tests that validate deny-by-default behavior for requests from disallowed client IPs at the edge (Traefik), including cases where allowlist is configured per tenant and where it is empty/disabled.

Compliance Code Patterns

Envelope encryption, enforced TLS, validated inputs, and zero secrets anywhere in the full git history.

40% 9/11 scored
  • Encryption in transit 0%
    0/5 expected sites not present
  • Encryption at rest 33%
    1/3 expected sites
  • Secrets management 67%
    2/2 expected sites
  • Input validation at boundaries 100%
    7/7 expected sites
  • Injection-safe data access 100%
    3/3 expected sites
  • Data classification & PII handling 0%
    0/4 expected sites not present
  • Access logging on protected routes 0%
    0/2 expected sites
  • Retention & secure deletion 8%
    1/4 expected sites
  • Secure defaults / hardening 56%
    2/3 expected sites
Encryption in transit 0%

I did not find evidence that encryption in transit (TLS forced everywhere, including internal hops) is enforced by the application code paths visible in this repo. The main server bootstrap creates a plain HTTP server (`http.createServer`) and starts it with `server.listen` without TLS wrapping in the same file, and an internal token exchange uses a configurable base URL without enforcing HTTPS.

  • high

    Enforce TLS at the application edge entrypoint: replace the plain HTTP server bootstrap with an HTTPS server (or ensure the reverse proxy terminator is configured to redirect all HTTP to HTTPS and to set HSTS), and ensure this is consistently applied for every listener (including WebSocket upgrades).

  • high

    Enforce HTTPS for all internal/external service-to-service HTTP calls: in `fetchAccessToken`, validate that `baseUrl` is https:// (or upgrade it), and fail closed (do not allow http) for token exchanges and any similar provider integrations.

  • med

    Harden transport security headers/redirects for every public endpoint: ensure HSTS is set (and HTTP→HTTPS redirects happen) for user-agent flows that use `res.redirect`, and ensure the redirect targets are always HTTPS.

  • med

    Add TLS configuration for the monitoring service listener (or ensure the deployment layer wraps it with TLS): confirm that `app.Listen` is behind HTTPS and that clients cannot reach this service over plain HTTP.

Encryption at rest 33%

Encryption-at-rest appears partially implemented: there is explicit symmetric encryption/decryption for 2FA `secret` and `backupCodes` (including a re-encryption migration script). However, the database schema shows other sensitive credentials stored as plaintext text columns (e.g., account tokens/password and API keys), and no code paths for their at-rest field encryption were identified in this audit slice.

  • high

    Implement and enforce field-level at-rest encryption for all sensitive columns shown in the schema as plaintext text fields (at minimum: `account.accessToken`, `account.refreshToken`, `account.idToken`, `account.password`, `two_factor.secret`, `two_factor.backupCodes`, and `apikey.key`), ensuring the same encryption is preserved for backup/export paths.

  • high

    Remove or gate any insecure legacy defaults for encryption/auth secrets (hardcoded fallback secrets), and require runtime injection from environment/Docker secrets so encryption keys are not embedded in code.

  • med

    Add automated checks/tests to verify encrypted-at-rest guarantees: (1) encrypted ciphertext is stored for each sensitive column, (2) decrypt round-trips work, and (3) exports/backups do not introduce plaintext values.

Centralized key management N/A

This codebase does not show evidence of centralized key management (managed key store + rotation/revocation) in the current codebase structure. Searches for KMS/Vault/KeyVault/SecretsManager-like dependencies and key-rotation/revocation symbols returned no results. As additional context, a full-history secret scan found multiple committed secrets in the repo history, reinforcing that key/secret handling appears ad-hoc rather than centrally governed and rotated via a managed key service.

  • high

    Implement centralized key management using a managed KMS/Vault/KeyVault service, and refactor encryption key usage to fetch keys from the managed store only (no local/shadow keys). Add scheduled rotation and a tested emergency revocation workflow.

    • repo (code graph query results):N/A — The codebase contains no detectable wiring to KMS/Vault/KeyVault/SecretsManager-style libraries and no detectable key rotation/revocation symbols, indicating the primitive is not implemented.
  • high

    Remove/rotate any compromised secrets/keys found in git history; treat them as leaked and replace with runtime-secret retrieval from the managed secret/key system.

Secrets management 67%

This codebase has a partial secrets-management implementation: it supports runtime injection of sensitive values via environment variables and Docker secret files (e.g., BETTER_AUTH_SECRET_FILE and POSTGRES_PASSWORD_FILE). However, it also contains legacy plaintext fallbacks (a hardcoded better-auth secret and a hardcoded production database URL) that undermine the “never hardcoded” requirement, so the control is only partially enforced.

  • high

    Remove insecure hardcoded fallbacks for BETTER_AUTH_SECRET and the production DATABASE URL. Enforce that the secret must be provided via BETTER_AUTH_SECRET or BETTER_AUTH_SECRET_FILE; otherwise fail fast at startup.

  • high

    Enforce POSTGRES_PASSWORD_FILE (or a dedicated secret manager integration) and delete plaintext hardcoded DB credentials. Fail startup if neither DATABASE_URL nor POSTGRES_PASSWORD_FILE is set (or if DATABASE_URL is set, require operational controls).

  • med

    Add/verify rotation documentation and operational checks to ensure secrets are reloaded appropriately (e.g., container restart behavior) and that secret-file paths are used in all deployment manifests.

No secrets in git history N/A

The codebase does not meet the requirement of “no secrets in git history”: it contains a hardcoded authentication secret fallback in `packages/server/src/lib/auth-secret.ts` (legacy default). Any such committed secret should be treated as compromised and rotated.

  • high

    Remove the hardcoded legacy secret fallback and require `BETTER_AUTH_SECRET` or `BETTER_AUTH_SECRET_FILE` to be provided via environment variables/Docker secrets in all non-test deployments; then rotate any previously deployed secret(s) derived from the hardcoded value.

  • med

    Use the full-history secret scan results to identify every committed credential/token/password/private key pattern, validate whether any are real (not placeholders/test fixtures), and rotate them; then rewrite history (if required by policy) and enforce secret scanning in CI.

Input validation at boundaries 100%

This codebase applies input validation at the API boundary using tRPC `.input(...)` with Zod schemas (defined under `packages/server/src/db/schema/*`). The router handlers consume `input` only after schema validation, and invalid inputs are handled through the tRPC/Zod error plumbing.

  • high

    Continue enforcing a `.input(api*Schema)` on every tRPC procedure that reads `input.*` fields; add `.input(...)` for any remaining procedures that currently read request-derived values without a corresponding Zod schema.

  • med

    Audit schema coverage for range/constraint completeness (e.g., min/max, enums, ID formats) in all `api*` schemas to ensure boundary checks are not only present but also strict enough.

Injection-safe data access 100%

The codebase applies injection-safe data access for the observed DB interaction paths: Drizzle ORM query builders are used for Node/TS routes (binding parameters), and the Go monitoring DB layer consistently uses `?` placeholders with argument binding. No evidences of SQL injection anti-patterns (string-concatenated untrusted values into SQL) were found in the inspected sites.

  • med

    Extend auditing to the remaining DB access points (other routers/services) to ensure there are no raw SQL strings that interpolate request data (e.g., via `db.query.*` with `sql.raw(...)` or similar).

Data classification & PII handling 0%

No clear implementation of “Data classification & PII handling” (sensitivity tagging + masking/minimization for PII in logs/audit/exports) was found. While audit logging and access-log cleanup exist, audit records write userEmail directly and there is no evidence of field-level redaction or sensitivity-based filtering on log persistence or retrieval.

  • high

    Add a centralized PII/sensitivity classification and redaction layer for log/audit writes. Concretely: (1) implement masking/redaction for userEmail (e.g., store hash/pseudonym or redact) and (2) scrub/limit auditLog.metadata to remove/avoid PII before JSON.stringify/persistence.

  • high

    Enforce masking on the audit-log retrieval boundary: ensure getAuditLogs either omits sensitive fields or returns masked values depending on authorization, and ensure metadata is filtered/sanitized before returning to clients.

  • med

    Harden the global logger configuration with redaction/serializers so PII never reaches logs by default (rather than relying on each callsite).

  • med

    Review and add sanitization to the access-log ingestion/storage path. Cleanup jobs alone are not sufficient; ensure sensitive fields (IPs, user identifiers, tokens) are redacted at write-time or prior to export.

Access logging on protected routes 0%

The codebase implements an audit logging mechanism for authenticated/sensitive actions using `audit(ctx, ...)` → `createAuditLog(...)`, which records a unique actor identifier (`ctx.user.id`) into the audit log table. However, access logging is not consistently applied to every authenticated/protected route: at least `settings.getWebServerSettings` and `application.one` are authenticated endpoints that return data/perform authorization without calling `audit(ctx, ...)` on those request paths.

  • high

    Add audit/access log calls to authenticated read routes like `settings.getWebServerSettings` and `application.one` so that every protectedProcedure path emits an auditable entry with `ctx.user.id` (actor attribution).

  • med

    Consider centralizing the audit enforcement in tRPC middleware for `protectedProcedure`/`adminProcedure` (e.g., a single wrapper that logs on every successful/meaningful request), to avoid future “present-but-not-on-every-path” drift.

    • apps/dokploy/server/api/trpc.ts:1-220 — Defines `protectedProcedure` and other procedure wrappers; moving/adding audit emission here would help guarantee coverage across all protected routes.
  • low

    Revisit the “fire-and-forget safe” error swallowing in `createAuditLog` if audit completeness is required for compliance questionnaires; ensure failures are observable (e.g., metrics/alerts) even if they don’t break the main operation.

Retention & secure deletion 8%

The codebase does implement retention-style cleanup jobs (metrics row deletion on a time cutoff, scheduled access-log truncation, and S3 volume-backup retention via rclone deletions). However, for the 'secure deletion' portion, there is no evidence of cryptographic wipe or explicit secure-disposal guarantees reaching backups/derived data beyond ordinary delete operations. As a result, enforcement exists, but secure deletion quality appears weak.

  • high

    Define and implement secure deletion semantics for backup/log/derived artifacts: document deletion guarantees (including impact on backups and retention copies), and where cryptographic wipe is required, use encryption-at-rest with per-item keys that can be revoked/destroyed so that deleted data becomes unrecoverable (including in backup systems).

  • med

    Add explicit deletion-on-request and ensure it cascades to all derived/related data and any external storage locations (e.g., backups, exports).

  • med

    For each retention job, add/configure measurable retention policy parameters (time window in days/months) and verify correctness via tests: e.g., ensure log retention is time-based (not just 'last N lines') and confirm cutoff behavior matches compliance requirements.

Secure defaults / hardening 56%

Secure defaults/hardening is partially implemented: Traefik middleware forces HTTP→HTTPS redirects and auth logging is disabled in production. However, the codebase also ships unsafe defaults: Better-Auth cookie hardening is explicitly weakened (`useSecureCookies: false`, `secure: false` under `!IS_CLOUD`) and Traefik’s API is configured as insecure (`api.insecure: true`). These indicate production hardening is not consistently enforced across all paths/config paths.

  • high

    Harden Better-Auth cookie attributes for production: change `advanced.useSecureCookies` and `defaultCookieAttributes.secure` to be Secure=true when running in production (and ensure this is not relaxed merely based on `!IS_CLOUD`).

  • high

    Disable Traefik insecure API/dashboard in production: set `api.insecure` to false (and ensure dashboard access is protected via auth/middleware, not exposed insecurely).

  • med

    Audit debug/verbose error exposure across the full runtime path (Next.js and API handlers). Ensure production builds never return stack traces or verbose debug responses, and that any dev-mode behavior is fully gated by `NODE_ENV === 'production'`.

    • apps/dokploy/server/server.ts:1-81 — Only sets `dev = process.env.NODE_ENV !== "production"` for Next.js; additional hardening for error/debug responses is not established in this bootstrap slice.

Not applicable to this codebase: Centralized key management, No secrets in git history.

Audit, Governance, Residency

An append-only audit_events table, a queryable audit API, and per-region infrastructure keyed on each tenant’s region.

28% 7/10 scored
  • Dedicated audit event store 100%
    3/3 expected sites
  • Append-only / tamper-evidence 0%
    0/3 expected sites not present
  • Comprehensive event coverage 93%
    5/5 expected sites
  • Queryable, provable audit access 0%
    0/3 expected sites
  • Audit retention & separation of duties 0%
    0/3 expected sites not present
  • No cross-region leakage 0%
    0/4 expected sites not present
  • Data-subject rights (export & erase) 0%
    0/2 expected sites not present
Dedicated audit event store 100%

This codebase includes a dedicated, structured audit event store (`audit_log`) with a dedicated schema and persistence layer. Sensitive governance/security mutations (e.g., security resource changes and custom role create/update/remove) are written to the dedicated audit store via a central `audit(ctx, ...)` helper, and there is a tenant-scoped audit log read endpoint. However, the audit writer explicitly swallows errors (fire-and-forget), and there’s no evidence of tamper-evidence (e.g., append-only enforcement or hash chaining) in the audited code slices.

  • high

    Make audit recording reliability provable: replace the current “fire-and-forget safe” behavior (errors swallowed) with a mechanism that guarantees persistence (or a compensating strategy) and surfaces failures to an auditable monitoring channel. Evidence: audit writes are currently wrapped in a try/catch that only logs to console.

  • high

    Enforce audit immutability/tamper-evidence at the storage layer: add/verify DB constraints (no UPDATE/DELETE paths), and ideally implement integrity validation (e.g., append-only with hash chaining/signatures) so an auditor can detect tampering.

  • med

    Expand and validate coverage for all sensitive actions beyond the slices confirmed: ensure permission changes, exports/downloads, and authentication events (login/logout/session changes) all emit structured audit events through the same dedicated store. Use `audit(ctx, ...)` usage coverage to identify any sensitive handler that lacks it.

    • apps/dokploy/server/api/utils/audit.ts:1-32 — Central audit helper exists; completeness depends on every sensitive action correctly calling this helper. The code slices reviewed show some usage, but do not prove full coverage.
Append-only / tamper-evidence 0%

The codebase has a structured audit_log table and a create/get API for audit entries, but there is no provable append-only / tamper-evidence implementation. The audit record schema lacks integrity/hash-chain fields and the write/read code does not implement or verify tamper-evidence (e.g., prev-hash chaining, signing, or integrity validation).

  • high

    Add tamper-evidence to the audit_log record: introduce integrity fields (e.g., prev_hash and record_hash computed over the full canonical event payload + prev_hash) and store them immutably with every append. Optionally sign records (or sign hash roots) for stronger non-repudiation.

  • high

    Enforce append-only behavior at the database layer for the audit evidence store: restrict UPDATE/DELETE on audit_log (via migrations/DB permissions, triggers, or storage engine policies). If deletes are required for lifecycle, require them to be restricted and separately audited and handled via a controlled purge mechanism.

  • med

    Implement verification support on the read/export path: when retrieving audit logs, compute/verify hash-chain continuity (or validate signatures) so an auditor can detect any gap/alteration in the returned evidence sequence.

  • low

    Improve audit-log write semantics to ensure tamper-evidence computation is deterministic: canonicalize metadata serialization (stable JSON canonicalization) before hashing so that re-serialization cannot produce different hashes for the same logical event.

Comprehensive event coverage 93%

This codebase implements a dedicated structured audit event store (audit_log) and a central audit() helper that records action/resource events with tenant and actor attribution. Sensitive actions such as custom role create/update/delete, application create, and various admin settings operations are instrumented with audit() calls, and there is a tenant-scoped, permission-guarded API to read/paginate audit events. Overall, implementation is solid, though createAuditLog is “fire-and-forget safe” (errors are swallowed), which can reduce audit completeness guarantees.

  • high

    Strengthen audit completeness guarantees: change createAuditLog so that failures are not silently swallowed (or at least provide an out-of-band alerting mechanism and/or a durable retry queue), since current behavior can produce gaps without an auditable signal.

  • med

    Expand/verify coverage for other sensitive surfaces (especially login/logout, permission grant/revoke, and data exports) by confirming each such handler emits audit() with appropriate resourceType/resourceId and that auditLogRouter’s action enum covers them.

  • low

    Add documentation/tests that assert each security-relevant route calls audit() (coverage regression tests), ensuring future changes don’t remove audit instrumentation.

Queryable, provable audit access 0%

Queryable, tenant-scoped audit access exists: there is a structured `audit_log` store and a tRPC `auditLogRouter.all` endpoint that returns paginated, filtered audit entries for the caller’s active organization. However, the primitive’s “provable” requirements are not met in code that was found: there is no demonstrated export/evidence packaging endpoint, no cryptographic/tamper-evident integrity fields, and the audit write path swallows failures (reducing confidence in completeness).

  • high

    Add an auditable export path for audit evidence (e.g., `auditLog.export`) that produces independently-verifiable evidence for a selected tenant scope and time range, with an export format (and integrity checks) suitable for customer/auditor verification.

  • high

    Introduce cryptographic/tamper-evident integrity for audit records (e.g., hash-chain over entries, or signed records with verification data) and persist the integrity fields in the audit table.

  • med

    Remove silent failure behavior for audit writes (or make failures auditable and fail-safe). Ensure audit insertion failures cannot silently cause gaps without detectable signals.

Audit retention & separation of duties 0%

The codebase implements a structured audit-log storage model (audit_log) and provides a tenant-scoped, permission-gated read API and an audit() helper that writes audit entries. However, for the specific primitive “Audit retention & separation of duties”, there is no provable retention enforcement (no TTL/purge/retention window job/config found in the audited parts of the code), and no evidence that insiders/admins cannot shorten retention or that audit deletion/purging is itself audited. Therefore this primitive is not provably satisfied.

  • high

    Implement and document enforced audit-log retention: add a scheduled purge/archival job (or DB-level TTL/partition drop) that deletes/archives only after the required compliance window. Ensure the retention window is stored in a protected configuration (not editable by the same role that can administer the system) and that the purge action itself emits an audited event to the audit-log evidence store.

  • high

    Add separation-of-duties controls around retention changes and audit deletion: introduce role-gated APIs/admin endpoints for retention configuration (e.g., require a dedicated 'auditLog:adminRetention' permission), and add server-side authorization checks preventing admins who can manage infrastructure from being able to shorten the audit retention window or delete audit-log rows without generating an auditable record.

  • med

    Strengthen audit-log write reliability so missing audit events are detectable: avoid silent swallowing of audit logging failures (currently errors are swallowed). Instead, route failures to an auditable fallback and/or fail fast for privileged auditing paths, or at least record an internal “audit_log_write_failed” event in a separate evidence channel.

Data residency / region pinning N/A

The codebase does not implement data residency / region pinning for tenants. While there is a `region` input in the UI for external “destination” configuration (e.g., S3 region/endpoint selection), there is no evidence of a tenant-level region attribute that drives region-keyed data placement, compute placement, or internal routing to keep all tenant data in a pinned region.

  • high

    Introduce first-class tenant/org data residency region modeling (e.g., tenant.region / data_residency_region) and enforce it in the data-access layer and deployment/compute orchestration paths, with all write/read operations routed by that region.

  • high

    Ensure every tenant data sink (primary DB, backups/snapshots, replication, analytics/export pipelines, and any scheduled backup jobs) is region pinned and cannot fall back to a global/unpinned default. Add explicit region selection/validation to backup orchestration.

  • med

    Add region-aware configuration validation so that if a tenant is assigned to a pinned region, the system rejects or constrains configurations that would cause cross-region writes (including external destination/backup targets).

No cross-region leakage 0%

I did not find evidence that the codebase enforces “no cross-region leakage” across backup/export/derived sinks. While destinations include a `region` attribute and backup code uploads to S3 using rclone flags derived from that destination region, there is no visible tenant-level residency constraint or validation that would block cross-region placement for backup sinks—so the primitive is not provably implemented.

  • high

    Introduce tenant-scoped residency policy and enforce it at every data-sink configuration point (at least backup destination create/update and at job execution time). Concretely: validate `destination.region` (and any endpoint/provider settings that imply geography) against the tenant’s allowed in-region set; reject or fail jobs when they don’t match.

  • high

    Add an execution-time “region gate” to backup job runners (fail closed). Even if a bad destination slips into configuration, the job should stop before calling rclone/exports.

  • med

    Perform a full sink inventory and extend the enforcement beyond object-store backups: ensure derived stores, scheduled volume backups, and any compose/restore/export paths also validate destination region/routing constraints.

Data-subject rights (export & erase) 0%

No end-to-end data-subject rights system (DSR) for **export & erase** is provably implemented in this codebase. There is a direct admin-style `removeUserById` that deletes the `user` row, but the expected DS-rights workflow (export-all, erase-on-request with cascade to backups/derived stores, and auditable proof) is not evidenced.

  • high

    Implement a dedicated data-subject export endpoint/handler that, for a verified subject, aggregates and returns all personal data from every relevant primary and derived store (not just one table). Ensure the export request and results are tied to an auditable DS-rights event.

  • high

    Replace/augment `removeUserById` with a DSR erase handler that performs cascade deletion across related personal-data tables (account/org membership, API keys, schedules, backups/derived stores) and also triggers/records erasure for backups/retention mechanisms as required by the primitive. Emit a structured, queryable, immutable audit record specifically for DS erasure.

    • packages/server/src/services/admin.ts:92-101 — Current implementation is a direct `db.delete(user).where(eq(user.id, userId))` with no demonstrated cascade/backup/derived-store erasure and no demonstrated DS-erasure audit mechanism in the shown code.
Customer-controlled keys N/A

No implementation of customer-controlled encryption keys (BYOK / customer-managed keys with per-tenant import, rotation, and revocation suitable for crypto-shredding) was found. While the codebase includes customer-supplied key material for other purposes (e.g., SSH key management), it does not implement per-tenant encryption key governance: there are no tenant-managed KMS/KV references, key-import/rotation/revocation flows, or crypto-shredding mechanisms tied to a customer-controlled key lifecycle.

  • high

    Confirm whether the platform encrypts tenant data at rest using per-tenant keys and whether any BYOK requirements exist for customers. If BYOK is required, implement a crypto-governance layer that stores per-tenant customer key references (e.g., KMS/KV URIs), supports customer key import/update, scheduled rotation, and customer-triggered revocation (crypto-shred), and wire it into all encryption/decryption paths.

Sub-processor / data-flow transparency N/A

No in-repo, versioned sub-processor / data-flow transparency artifact (e.g., a sub-processor inventory list for DPAs) was found in this codebase. While the code appears to include third-party integrations (e.g., Stripe and OpenAI SDK imports), there is no corresponding declared, auditable, versioned inventory of sub-processors/data flows that could be used to substantiate DPA governance for this primitive.

  • high

    Add a versioned, in-repo sub-processor / data-flow inventory (e.g., SUBPROCESSORS.md or SUBPROCESSORS.json) that lists each third party, the specific data categories touched, the purpose, the data-flow direction(s), and the relevant service endpoints (e.g., Stripe webhooks; OpenAI calls). Ensure it is kept current as integrations change (PR checklist + CI validation).

  • high

    Cross-check the inventory against the actual SDK usages (e.g., Stripe/OpenAI) and ensure the inventory entries match the code paths where data is sent to those vendors.

  • med

    If the organization uses DPA templates, make the DPA sub-processor section dynamically reference the versioned inventory (or at minimum require an explicit, reviewable update when integrations are added/changed).

Not applicable to this codebase: Data residency / region pinning, Customer-controlled keys, Sub-processor / data-flow transparency.

T2 Execution Velocity

Performance Primitives

A caching layer, an async job runtime, connection pooling, and indexes on the columns that actually need them.

42% 11/11 scored
  • Redundant work in loops 0%
    0/1 expected sites not present
  • Bounded interfaces 0%
    0/1 expected sites not present
  • Memoization / caching 67%
    1/1 expected sites
  • Resource reuse / pooling 33%
    1/3 expected sites
  • Off-critical-path execution 0%
    0/2 expected sites
  • Lookup data structures 100%
    1/1 expected sites
  • Batching round-trips 0%
    0/1 expected sites not present
  • Shared-state synchronization 100%
    1/1 expected sites
  • Bounded concurrency / backpressure 50%
    1/2 expected sites
  • Lazy / minimal computation 100%
    2/2 expected sites
  • Streaming over buffering 17%
    1/6 expected sites
Redundant work in loops 0%

I did not find any spot where this codebase correctly applies the optimization pattern for 'redundant work in loops' (e.g., batching/hoisting expensive work out of an inner loop). However, there is at least one concrete should-be site: per-iteration DB writes inside the container metrics collection loop.

  • high

    Batch container metric persistence instead of calling cm.db.SaveContainerMetric(metric) once per iteration. For example: accumulate metrics in a slice during the loop, then insert them with a single bulk insert and/or a single transaction after the loop.

Bounded interfaces 0%

I found no correctly-bounded collection-returning public surface for this primitive. In particular, `fetchTemplatesList(...)` returns an unbounded array result from a JSON endpoint without any limit/cursor mechanism, indicating the bounded-interfaces primitive is not implemented at this code surface.

  • high

    Change `fetchTemplatesList` to be bounded: add `limit` (and ideally cursor/offset) parameters, pass them to the upstream API if supported, or enforce a hard cap client-side (e.g., slice to a maximum) and document it.

Memoization / caching 67%

The codebase contains a correct memoization/caching implementation: a module-level in-memory TTL cache for `getTrustedOrigins`. The cache is bounded and reused within a time window, but invalidation is only time-based (not immediately updated when underlying trusted origins change).

  • high

    Consider stronger invalidation than TTL alone for correctness/freshness. For example, clear/update `trustedOriginsCache` when trusted origins are modified (e.g., in the SSO/issuer update flows), or switch to a cache keyed by a version/updatedAt timestamp from the DB.

  • med

    Reduce stale failure risk: when `runQuery()` throws, the code returns `trustedOriginsCache?.data ?? []`. Consider preserving last known good values but only if they are still within TTL (or explicitly mark stale).

Resource reuse / pooling 33%

The codebase has one clear, correct instance of resource reuse/pooling: a lazily-initialized, module-level singleton websocket/trpc client on the client side. Elsewhere (notably server websocket deployment log streaming), expensive handles are created per websocket connection (`new Client()` for SSH and `spawn('tail')`), which are candidates for pooling/reuse but are not implemented as such.

  • high

    For server-side deployment log streaming, evaluate reusing SSH clients and/or maintaining a bounded pool keyed by (serverId/connection params) to avoid creating `new Client()` per websocket connection when feasible. If full connection pooling is unsafe, consider at least caching established clients within a bounded scope and reusing them for multiple exec streams.

  • med

    For the non-cloud tailing path, consider reducing per-connection `spawn('tail')` overhead—e.g., a shared tailer per (logPath, serverId) with multiplexed websocket subscribers, or a bounded process pool.

  • low

    Keep the existing client-side singleton approach, but ensure it is robust across tab lifecycle changes (e.g., reconnection semantics, correct teardown on unmount if needed).

Off-critical-path execution 0%

This codebase does implement off-critical-path execution using BullMQ queues/workers for scheduled jobs and deployments (workers run `runJobs(...)` / deployment tasks asynchronously). However, some TRPC handlers (notably in the Postgres router) still perform deployment work inline via `await deployPostgres(...)`, which is exactly the anti-pattern this primitive targets.

  • high

    Refactor the Postgres TRPC handlers to enqueue deployment work instead of calling `deployPostgres(...)` inline. Concretely: replace `await deployPostgres(input.postgresId)` in `saveExternalPort` and the `deploy` mutation with a queue `add(...)` call that a worker processes (mirroring `apps/dokploy/server/queues/deployments-queue.ts`).

  • med

    Ensure the queued deployment job handler updates status and handles failures idempotently (retry-safe). If the existing deployment worker already provides this, reuse it; otherwise, wrap slow operations with retry/idempotency guarantees similar to the worker-based architecture.

  • low

    Optionally standardize background execution behavior for TRPC endpoints (e.g., always return an immediate “queued” response plus job id/state, and stream logs separately where needed) to avoid future inline regressions.

Lookup data structures 100%

The codebase does apply the Lookup data structures primitive at least in authorization logic: membership is checked via `.has` on a precomputed `accessibleIds` collection (Set-like behavior). No other clearly verifiable lookup-vs-linear-scan hot loop patterns were confirmed from the evidence gathered.

  • high

    Audit for remaining anti-patterns: in hot loops, replace repeated `array.includes(...)`, `array.find(...)`, or `array.some(...)` calls over the same collection with precomputed `Set`/`Map` lookups. Start with authorization and filtering paths that run per request or per item.

  • med

    Add/ensure helper utilities return lookup-ready structures (e.g., `getAccessibleServerIds` returning a Set) so call sites avoid converting or scanning arrays repeatedly.

Batching round-trips 0%

I did not find any implementation of true “batching round-trips” (i.e., coalescing many per-item I/O operations into fewer grouped/bulk calls). The main discovered hot path performs one fetch per item (event → runs) rather than batching those run lookups into a single bulk request.

  • high

    Replace the per-event `fetchInngestRunsForEvent(ev.id)` round-trips with a bulk endpoint (if Inngest supports fetching runs for multiple events at once) or add/introduce a server-side batching layer that issues fewer grouped requests (e.g., `fetchInngestRunsForEvents(eventIds[])`).

  • med

    If the upstream API cannot batch, implement batching at the boundary by chunking event ids into groups (e.g., fetch 10–50 event ids per grouped call) using whatever aggregation technique is available (parallelism is not batching; it only hides latency).

    • apps/api/src/service.ts:220-240 — Current implementation uses per-item fetches inside `Promise.all`; chunking would reduce total I/O operations if a grouped transport exists.
Shared-state synchronization 100%

The codebase contains one clear, correct application of shared-state synchronization: a per-service volume backup lock (`flock`/fallback) wrapped around the stop/backup/upload/start workflow to prevent concurrent backup executions from interfering with shared resources.

  • low

    Add/verify tests or runtime instrumentation around the locking behavior (e.g., ensure the lock is always released on failure paths and that lockPath naming cannot collide unexpectedly between services).

Bounded concurrency / backpressure 50%

The codebase uses BullMQ workers and (in the schedules service) configures an explicit `concurrency` cap, which provides bounded concurrency for scheduled backup jobs. However, the deployment worker in the dokploy server does not set any explicit concurrency/backpressure options, leaving deployment work potentially unbounded at the worker level.

  • high

    Add an explicit BullMQ worker concurrency/backpressure configuration to the deployments worker (e.g., `concurrency: <number>` and/or queue-level settings) so deployment tasks cannot run with unbounded in-flight execution.

  • med

    Review whether the existing backup worker `concurrency: 100` is appropriate for backpressure (e.g., lower it or make it configurable per environment/resource type) to prevent resource exhaustion under load.

Lazy / minimal computation 100%

The codebase applies Lazy/Minimal Computation correctly in a few key boundary spots: UI data fetching is gated by permissions and derived collections are computed lazily; server-side aggregation limits the amount of follow-up I/O by capping the event subset before fetching runs.

  • high

    Audit other API/service functions that do fan-out I/O (fetching secondary resources per item) for early bounding (slice/limit) and conditional short-circuiting similar to `fetchDeploymentJobs`.

    • apps/api/src/service.ts:220-240 — Shows the desired pattern (cap `toFetch` before `Promise.all` fetches). Replicate/standardize this across other fan-out fetch flows.
  • med

    In UI components, prefer `enabled` gating and `useMemo` early-return patterns for any derived sort/filter logic tied to optional data (similar to `recentDeployments`).

Streaming over buffering 17%

The primitive exists: `readMonitoringConfig(readAll=false)` correctly uses streaming + bounded early termination for `access.log`. However, the same area has a buffering anti-pattern when `readAll=true` (whole-file `readFileSync`). Several other config/compose/template loaders read whole files/outputs into memory before parsing, which would ideally be streamed or size-limited for arbitrarily large inputs.

  • high

    Fix the `readAll === true` path in `readMonitoringConfig` to avoid whole-file buffering: either always stream with `readline` (and collect/return via a bounded mechanism), or implement a hard size/line limit and/or return an iterator/chunked response instead of a single full string. Evidence: the `readAll=false` branch already demonstrates the preferred streaming pattern.

  • med

    For compose/config/template loaders that currently do `readFileSync` / remote `cat` + full-string parsing, add explicit size/line guards and consider streaming parsing where feasible. At minimum, enforce maximum allowed file size before buffering/`parse()`.

Reliability Primitives

Retries, circuit breakers, idempotency keys, health checks, and a runbook for each service.

23% 11/11 scored
  • Timeouts 0%
    0/4 expected sites
  • Retry with backoff + jitter 0%
    0/1 expected sites not present
  • Idempotency 0%
    0/2 expected sites not present
  • Circuit breaking / fail-fast 0%
    0/1 expected sites
  • Graceful degradation / fallback 100%
    2/2 expected sites
  • Error handling & propagation 0%
    0/4 expected sites
  • Deterministic resource cleanup 50%
    1/2 expected sites
  • Atomicity / all-or-nothing 0%
    0/3 expected sites
  • Input / boundary validation 57%
    4/7 expected sites
  • Failure isolation / bulkheading 0%
    0/2 expected sites not present
  • Graceful shutdown 44%
    3/3 expected sites
Timeouts 0%

The codebase does implement timeouts for some external calls (notably `fetch` via `AbortSignal.timeout(...)` and a TCP readiness check via `socket.setTimeout` + an overall cap). However, several high-risk I/O boundaries still lack timeouts—especially external HTTP calls to the jobs service, Gitea token refresh, and SSH/remote command execution used for docker log/command streaming.

  • high

    Add a timeout/abort signal to all `fetch(...)` calls that contact external services (e.g., wrap with `AbortSignal.timeout(<ms>)`), starting with `apps/dokploy/server/utils/backup.ts` where the jobs-service endpoints are called without any deadline.

  • high

    Bound Gitea token refresh HTTP requests with an `AbortSignal.timeout(...)` so `refreshGiteaToken` can’t hang indefinitely on a slow/unresponsive Gitea endpoint.

  • high

    Add explicit timeouts to the SSH remote execution path: ensure the Promise rejects on connection timeout and command timeout, and always clean up (`conn.end()`) in the timeout/error path.

  • med

    For the docker log WebSocket SSH streaming path, enforce timeouts on both SSH connect and the `exec` command; ensure the WebSocket handler terminates/cleans up when the timeout triggers.

Retry with backoff + jitter 0%

The codebase does not implement the targeted reliability primitive (exponential backoff + jitter with a capped budget). A Postgres readiness loop exists and retries transient failures with a fixed sleep delay, but it does not include exponential backoff and jitter—so it remains an unmatched should-be site.

  • high

    Replace the fixed-delay retry in wait-for-postgres.ts with exponential backoff plus jitter (and keep the existing TIMEOUT_MS as the capped budget). Also consider retrying only for transient connection errors (e.g., ECONNREFUSED/timeout), and fail fast for non-transient misconfigurations.

Idempotency 0%

No clear idempotency/deduplication primitive or pattern (e.g., Stripe event-id guard, request-id dedup, or upsert/uniqueness-based mutation protection) was found in the code paths that perform retryable external-triggered mutations. The Stripe webhook handler is the main high-risk entry point: it mutates the database and triggers side effects but does not deduplicate webhook events.

  • high

    Make the Stripe webhook handler idempotent by persisting processed `event.id` (or Stripe `event.type`+unique identifiers) in a DB table with a unique constraint, and return 200 immediately when an event was already processed. Ensure this check happens before any mutation (including server-status updates and notification emails).

    • apps/dokploy/pages/api/stripe/webhook.ts:56-404 — This handler switches on `event.type` and performs DB updates and calls to `sendInvoiceEmail` / `sendPaymentFailedEmail` and `updateServersBasedOnQuantity`, but there is no observed dedup check for the Stripe event id before mutations.
  • high

    Strengthen server status updates to be idempotent under replay: either (a) rely on the webhook event dedup gate above, or (b) make `activateServer`/`deactivateServer` conditional updates that only write when the status actually needs to change (and consider optimistic checks with row counts).

Circuit breaking / fail-fast 0%

A `CircuitBreakerMiddleware` type/interface exists for Traefik configuration, but there is no evidence of an application-level circuit breaker (tracking failure rate, opening, and probing before closing) wrapping calls to unreliable dependencies. At least the Gitea token refresh flow performs external `fetch` with basic error handling but no circuit-breaker fail-fast behavior.

  • high

    Introduce an application-level circuit breaker around external dependency calls (start with token refresh calls like Gitea/GitLab providers). Ensure it tracks failure rate, opens to fail immediately, and uses a probe/half-open state before closing; also add a timeout to bound hangs.

  • med

    If Traefik-level circuit breaking is intended, ensure the middleware is actually wired into generated Traefik dynamic config (not just typed). Add/verify production config generation that sets Traefik’s circuit breaker parameters and attaches the middleware to relevant routers/services.

Graceful degradation / fallback 100%

The codebase does include graceful degradation/fallback. The clearest implementation is in `packages/server/src/services/admin.ts`: it caches trusted origins and, if the DB fetch fails, returns cached (stale-bounded by TTL) or an explicit empty list. A second instance safely degrades trusted providers loading by returning `[]` on error.

  • med

    Search for other places where the system loads optional configuration/auxiliary lists (e.g., auth-related allowlists, feature metadata, dashboard-only data) and ensure their error branches return an explicit reduced default or cached value instead of throwing.

Error handling & propagation 0%

The primitive exists (there are correct error-handling patterns), but several fallible operations still convert failures into generic booleans or false without logging/wrapping (notably empty catch blocks). Where the code does use try/catch, it often logs and propagates (e.g., updateGitea and rawConfig parsing).

  • high

    Replace empty catches that silently convert failures into false (dockerSwarmInitialized/dockerNetworkInitialized) with either (a) logging plus returning false, or (b) returning false while wrapping/propagating an error with context depending on caller expectations.

  • high

    For containerExists, stop silently collapsing inspect failures into 'false not exists'; log and/or rethrow with context (or return a structured result that distinguishes 'not found' vs 'inspect failed').

  • med

    For the health-check hook, log fetch errors (or propagate them into state) rather than swallowing them and returning false; at minimum include context (e.g., URL, error message) to avoid silent failures during polling.

Deterministic resource cleanup 50%

The primitive exists in parts of the codebase: `runWebServerBackup` deterministically cleans up its temp directory via `finally`, including on error paths. However, other acquired resources are not deterministically cleaned up: the log write stream in the backup flow is only explicitly ended on the success path, and the SSH connection in `execAsyncRemote` is not ended when `conn.exec(...)` itself fails before stream ‘close’/connection error handlers fire.

  • high

    In `runWebServerBackup`, deterministically close/destroy `writeStream` in a `finally` that covers both the inner success path and the outer `catch` path. Ensure it runs even when errors occur after the stream is created (e.g., before `writeStream.end()` currently happens only on success).

    • packages/server/src/utils/backups/web-server.ts:36-129 — A `writeStream` is created and used throughout, but `writeStream.end()` is only called on the success path; the outer `catch` writes an error and calls `writeStream.end()` but there is no `finally` guaranteeing stream closure across all internal early exits.
  • high

    In `execAsyncRemote`, ensure `conn.end()` is called when `conn.exec(command, (err, stream) => ...)` hits the immediate `if (err) { reject(...); return; }` branch (i.e., add cleanup there or use a shared `finally`/guard).

  • med

    Standardize this pattern: whenever the code acquires a handle (fs stream, SSH client, db client, temp dir), ensure release is in the same lexical scope via `try/finally` (or equivalent) at the acquisition site.

Atomicity / all-or-nothing 0%

Atomicity is present in at least one place: notification creation/update correctly wraps related multi-table inserts in a single DB transaction. However, several deployment/backup creation flows do multi-step sequences (external log setup + DB writes) without any transaction spanning the whole operation, meaning failures can leave the system in an observable half-completed or inconsistent state.

  • high

    For deployment creation flows, wrap the entire related DB mutation sequence in a single transaction (e.g., insert the deployment/backup row and any corresponding status updates), and ensure the catch path does not introduce a second inconsistent row without compensating/rollback logic. If external side effects (remote mkdir/echo) must remain, gate them to occur after the DB transaction commits, or use a transactional outbox / compensating action pattern.

  • med

    Avoid inserting a second “error” deployment/backup record after a failure that occurs mid-flow unless it is strictly designed to be consistent. Prefer updating the originally inserted record to `error` (within the same transaction) rather than inserting a new one.

Input / boundary validation 57%

This codebase includes boundary validation for several high-risk public API entry points—especially webhook handlers (GitHub and Stripe) and OAuth (Gitea). These entry points validate required headers/query parameters, verify cryptographic signatures, and reject invalid/unsupported event types before acting. However, other API entry points (e.g., deploy/[refreshToken] and deploy/compose/[refreshToken]) are also trust boundaries that should validate refreshToken/query-derived inputs at the boundary; based on the portions read, the handlers rely heavily on downstream DB lookups and internal branching rather than explicit “shape/range” validation for those entry inputs.

  • high

    Add explicit boundary validation for the deploy webhook entry points’ query params (refreshToken) before using them in DB queries and downstream logic. For example: reject missing refreshToken, array refreshToken, and non-string/empty values with a clear 400 response.

  • med

    Harden request-body derived values used in branching logic (e.g., req.body.commits, refs, repository/owner fields) with explicit type/shape checks at the boundary, not only through optional chaining and comparisons. This prevents malformed payloads from flowing into arrays (flatMap) or string operations.

  • low

    For completeness, apply a consistent error response contract for invalid webhook payload shape (400 with a non-sensitive message) across all webhook providers, so bad inputs fail fast and uniformly.

Failure isolation / bulkheading 0%

The codebase does not implement clear failure-isolation/bulkheading for independent workloads sharing a common queue/worker resource. Jobs for different logical subsystems (applications vs compose vs previews; deploy vs redeploy) are handled by a single shared BullMQ queue/worker, without visible per-partition resource caps or separate isolation boundaries. Additionally, the worker error branch catches errors but only logs them, rather than ensuring isolation via bounded, partitioned execution.

  • high

    Introduce bulkheading by splitting the shared `deployments` queue into separate queues/workers for the independent workload classes (e.g., applications vs compose vs previews, and/or deploy vs redeploy), and/or enforce per-partition concurrency limits. Ensure each queue has its own Worker instance and (ideally) its own redis/pool configuration if that’s part of the failure mode.

  • med

    Add explicit safeguards to prevent one partition from starving the rest: separate queue names, set BullMQ worker `concurrency` per queue, and consider per-tenant/per-application rate limiting (so a hot application can’t monopolize the shared worker).

  • low

    Improve the worker failure path to propagate/retry in a controlled manner (with backoff and bounded retries) rather than only `console.log`. While this isn’t bulkheading by itself, it reduces the chance of repeated failures consuming the shared worker capacity.

Graceful shutdown 44%

The codebase contains a `gracefulShutdown` primitive in `apps/schedules/src/index.ts`, and there is at least one SIGTERM handler elsewhere (`apps/dokploy/server/queues/queueSetup.ts`). However, the shutdown logic frequently ends with `process.exit(0)` immediately after closing some resources, and the main dokploy server (`apps/dokploy/server/server.ts`) does not show any termination-signal draining/closing wiring at the process entry point—risking dropped in-flight HTTP/WS work and incomplete flushing during shutdown.

  • high

    Replace immediate `process.exit(0)` in shutdown handlers with a proper sequence: stop accepting new requests (close/stop the HTTP listener), drain in-flight requests and WS connections, stop background job processing, await all worker/queue shutdown promises, and only then exit (optionally with a bounded timeout).

  • high

    Add a termination-signal handler at the dokploy main server entry (`apps/dokploy/server/server.ts`) that coordinates shutdown of: HTTP server (`server.close()`), all WebSocket servers, and the background workers/cron jobs started in this entry point.

    • apps/dokploy/server/server.ts:1-81 — This file starts the long-running HTTP server and initializes WS and background work, but the shown code does not register SIGTERM/SIGINT graceful-drain logic.
  • med

    For BullMQ shutdown (`queueSetup.ts`), ensure the handler also prevents new jobs from being enqueued/processed during shutdown, and waits for running jobs to finish (or cancels with a bounded grace period) rather than only closing the queue handle.

API & Extensibility

A checked-in OpenAPI spec, versioned routes, a webhook system with retries and signing, and tenant-scoped rate limits.

0% 7/10 scored
  • Machine-readable API contract 0%
    0/2 expected sites not present
  • Versioning & backward compatibility 0%
    0/3 expected sites
  • Programmatic auth with scopes 0%
    0/3 expected sites
  • Idempotent writes 0%
    0/5 expected sites not present
  • Consistent pagination & filtering 0%
    0/1 expected sites
  • Outbound events / webhooks 0%
    0/3 expected sites not present
  • Consistent errors & status codes 0%
    0/6 expected sites not present
Machine-readable API contract 0%

The repository includes logic to generate and serve an OpenAPI contract (a generation script and a Swagger UI page), but the machine-readable API contract artifact (e.g., checked-in `openapi.json`/`openapi.yaml`) is not present in the repo. Therefore, third-party consumers cannot rely on a stable, discoverable, checked-in spec without running generation.

  • high

    Check in the generated OpenAPI artifact (e.g., commit `openapi.json` at the repo root, or `openapi.yaml` under the API module) and ensure it covers the full public TRPC surface exposed via `appRouter`. Wire CI to regenerate and fail if the committed spec drifts from the router.

  • med

    Make the contract versioning explicit in the spec (title/version already exists in the generator) and expose a stable docs URL that points to the committed artifact (not only a runtime TRPC call).

Versioning & backward compatibility 0%

The codebase generates an OpenAPI document (with version metadata), but there is no clear HTTP versioning strategy, no deprecation/sunset markers, and no version negotiation in the actual public endpoints defined in apps/api/src/index.ts (e.g., /deploy, /cancel-deployment, /jobs are unversioned). As a result, backward compatibility appears to rely on convention rather than an explicit, discoverable contract governance mechanism.

  • high

    Introduce an explicit versioning strategy for the public HTTP service in apps/api/src/index.ts (e.g., /v1/deploy, /v1/cancel-deployment, /v1/jobs) and define deprecation/sunset headers (e.g., Deprecation, Sunset) plus migration links when changing request/response schemas or error shapes.

  • med

    Align contract governance with the OpenAPI generation: ensure the generated spec covers the full published HTTP surface and that breaking changes trigger a new version only (additive-only evolution for minor changes), with contract tests to prevent silent breaking schema drift.

Programmatic auth with scopes 0%

The codebase does have programmatic API-key authentication using an `x-api-key` header verified by `better-auth` and integrated into the shared request validation (`validateRequest`). However, the implementation does not clearly enforce per-credential scopes/permissions on the request path; it authenticates the key and associates an organization, but scope enforcement is not evident in the request context setup. This means a third-party integrator may be able to authenticate, but cannot reliably rely on stable, per-credential scope semantics (and revocation/rotation/scoped least-privilege behavior is not clearly consumable from the enforced contract).

  • high

    Enforce per-API-key scope/permissions in `validateRequest`: load the API key’s `permissions` (or whatever scope model is intended), and map it into `ctx.user`/authorization checks so every protected endpoint consistently applies least-privilege based on the credential’s scopes.

    • packages/server/src/lib/auth.ts:240-520 — Shows `validateRequest` verifying `x-api-key`, loading the `apikey` record, extracting `organizationId` from `apiKeyRecord.metadata`, loading the member, and constructing a mock `session`/`user`—but there is no visible step that applies API-key-specific permissions/scopes to the authorization model.
  • high

    Add/confirm API key scope fields are actually written and used: ensure API-key creation (`createApiKey` + related endpoints) stores scopes/permissions in a dedicated, enforceable place (not only UI-only fields), and that `validateRequest` reads those exact fields.

    • packages/server/auth-schema2.ts:60-130 — The `apikey` table includes `permissions`, `expiresAt`, and last-used/rate-limit fields, but the request validation path must explicitly use them to provide a true scoped-credential primitive.
  • med

    Create a stable public contract for API credentials (scope model + revocation/rotation semantics + last-used + rotation cadence expectations) and expose it consistently in the API docs/spec (or the existing Swagger route page if it is intended to be consumable).

    • apps/dokploy/pages/swagger.tsx:1-120 — A Swagger page exists (indicating intent for a machine-discoverable contract). The scoped-credential primitive should be discoverable there with the scope model and auth scheme, not just implemented internally.
Per-tenant rate limiting N/A

I could not find any implementation of per-tenant/per-organization/per-consumer rate limiting in the server API edge. Searches for rate-limit/ratelimit/throttle/limiter/bucket4j-style middleware and related symbols returned no wiring, and the API routing appears to be handled via tRPC procedures without any tenant-scoped limiter middleware or standardized 429 + limit/remaining headers.

  • high

    Add an API-edge rate limiting middleware for tRPC requests that keys the bucket by the tenant/organization/consumer identifier (e.g., ctx.session.activeOrganizationId or authenticated user/app key), and returns 429 with standardized Limit/Remaining headers plus retry guidance.

  • med

    Implement and enforce a shared error/response contract for rate-limit rejections (machine code + correlation/request id) so integrators can programmatically detect rate limiting and back off safely.

  • low

    Document the rate limiting policy in the generated OpenAPI/trpc/OpenAPI doc surface (or a dedicated API docs page), including limits, scope, and header semantics.

    • apps/dokploy/pages/swagger.tsx:1-116 — Swagger spec is served dynamically from tRPC, but this file does not show any rate-limit policy or headers being modeled/enforced at the API layer.
Idempotent writes 0%

Idempotent writes are not implemented as a public, consumable contract in the examined public mutation endpoints. The code handling for `/deploy`, `/cancel-deployment`, `/create-backup`, `/update-backup`, and `/remove-job` performs side-effecting work without reading an idempotency key or providing replay/deduplication behavior.

  • high

    Add support for an idempotency key on each public mutation endpoint (start with `/deploy` and `/create-backup`). Accept a standard header (e.g., `Idempotency-Key`) and persist the key->result mapping (or key->operation status) in a durable store scoped to the authenticated principal/org. On retry, return the original result instead of repeating side effects.

  • high

    Ensure timeouts/retries produce replayable behavior: store request input hash (or canonicalized payload) alongside the idempotency key; if the same key is reused with different inputs, return a distinct conflict error (e.g., HTTP 409) rather than executing a new operation.

  • med

    Extend the same idempotency contract consistently to the remaining mutation endpoints (`/cancel-deployment`, `/update-backup`, `/remove-job`) so integrators can safely retry across the whole write surface.

Consistent pagination & filtering 0%

Pagination/filtering exists for at least one collection (audit logs) using bounded `limit` and `offset` plus multiple filters, but the required cursor-based convention is not implemented there. No list endpoint was found to correctly and consistently apply cursor pagination + a shared filter convention.

  • high

    Introduce cursor-based pagination for the audit-log list endpoint (and any other list endpoints), using a shared convention for cursor parameter name (e.g., `cursor` / `nextCursor`) and returning a deterministic `nextCursor` (or `hasMore`) alongside results.

  • med

    Define and reuse a common filtering contract for list endpoints (same query parameter names and semantics across collections), and ensure all list endpoints interpret filters consistently (e.g., date range boundaries, string match mode).

Outbound events / webhooks 0%

No general-purpose outbound events/webhooks delivery primitive (subscription store + delivery worker + HMAC-signed, versioned payloads + exponential-backoff retry capped with alerting + idempotent delivery + documented event catalog) was found. The codebase does perform outbound POSTs for notifications to third-party services (e.g., Discord/Slack/custom endpoints), but these calls are not implemented as a webhook event delivery primitive with signing, retries, idempotency, and a stable public event contract.

  • high

    Implement a dedicated outbound webhook/event system: (1) subscription store (per tenant/credential), (2) background delivery worker, (3) HMAC signing of every payload with versioned schema, (4) exponential-backoff retries with a max attempt limit then flag-and-alert, (5) idempotent delivery using a persisted delivery id/key, and (6) an event catalog + redelivery semantics in the public API.

    • packages/server/src/utils/notifications/utils.ts:1-394 — Outbound notification delivery is done via direct fetch POST calls to webhook URLs, without any shared webhook event contract, signing, versioning, retry/backoff, or idempotency guarantees visible in the delivery utility.
  • med

    Add consistent integrity and failure-handling around outbound webhook delivery: include an HMAC signature header, payload version field, retry policy, and correlation/delivery ids returned/stored per attempt.

  • low

    Document a public, stable webhook event catalog (events, payload schemas, signature algorithm/header, retry semantics, and redelivery rules) and ensure the entire exported route inventory is covered by the spec.

Consistent errors & status codes 0%

A consistent, machine-parseable error envelope with uniform status-code semantics and a correlation id on every error does not appear to be implemented as a shared response-layer primitive. API endpoints return endpoint-specific ad-hoc JSON error shapes (e.g., Unauthorized 401, invalid API key 403, various 4xx/5xx bodies) without a common error contract that clients can integrate against without per-endpoint handling.

  • high

    Introduce a single, shared error response envelope for all API surfaces (tRPC/Next API route handler and the Hono service). Include: HTTP status, a stable machine error code, a human message, and a correlation/request id on every error response.

  • high

    Standardize status-code mapping across the API: 400 for malformed input, 401/403 for auth, 409 for idempotency conflicts (if used), 422 for semantic/validation failures, 429 for throttling, and restrict 5xx to true server faults (no 200 fallbacks on error cases).

  • med

    Ensure correlation id propagation: generate a correlation id at the top of request handling (middleware) and attach it to all error responses; optionally also log it.

  • med

    Update the OpenAPI generation/error mapping to reflect the shared error envelope (so integrators can rely on documented error schemas and codes).

Sandbox / test mode N/A

No consumer-facing Sandbox / test-mode contract (i.e., a documented sandbox base URL plus test keys and isolated test data) was found in the codebase. The only obvious “test keys” are for the repo’s own unit/integration tests (Vitest), not for external integrators.

  • high

    Add a documented sandbox/test environment contract for integrators: publish a sandbox base URL, create dedicated test credentials/keys with least-privilege scopes, and ensure sandbox data is isolated from production (with clear lifecycle/reset behavior).

  • med

    Add explicit configuration entries for sandbox mode (e.g., SANDBOX_BASE_URL / SANDBOX_API_KEY) and ensure runtime code selects sandbox endpoints/keys when SANDBOX_MODE is enabled, with documentation in the repo (README/docs).

Extension points / plugins N/A

No stable, documented, versioned extension/plugin interface (with a registry and isolation from core) was found. The codebase appears to support configurable integrations (e.g., registries/providers) and internal Next.js API routes/webhooks, but not a third-party plugin contract that an external developer can implement and register without forking.

Not applicable to this codebase: Per-tenant rate limiting, Sandbox / test mode, Extension points / plugins.

Integration Depth

Per-system adapters behind one shared interface with bi-directional sync — not per-customer scripts held together with spreadsheets.

33% 9/10 scored
  • Shared integration abstraction 0%
    0/6 expected sites not present
  • Metadata-driven mappings 67%
    3/4 expected sites
  • Per-integration reliability 0%
    0/2 expected sites not present
  • Sync state & reconciliation 0%
    0/1 expected sites not present
  • Inbound validation & normalization 50%
    3/4 expected sites
  • Per-tenant integration credentials 33%
    3/3 expected sites
  • Per-integration observability 0%
    0/6 expected sites not present
  • Connector breadth for the category 61%
    5/6 expected sites
  • Build-vs-buy posture 83%
    4/4 expected sites
Shared integration abstraction 0%

The codebase has provider-specific integration implementations and provider-specific TypeScript interfaces/types (e.g., `gitea-utils.ts`, and separate token/clone/branch/repo logic for Gitea/GitHub/GitLab). However, there is no clearly shared integration abstraction: no single common adapter interface backed by a canonical data model that multiple distinct provider integrations implement consistently. As a result, the integration surface appears architected as separate per-provider “snowflakes” rather than a shared interface + canonical entities.

  • high

    Introduce a shared integration abstraction layer: define a common adapter interface (per provider) that covers at least (1) OAuth credential lifecycle (authorize/callback handling + token refresh), (2) canonical repo/branch retrieval, and (3) canonical actions (e.g., clone/checkout inputs).

  • high

    Define and enforce canonical domain entities used by all adapters (at minimum: ProviderIdentity, Credential/Token state, RepositoryRef, BranchRef). Add validation/normalization at the adapter boundary so each provider maps external representations into the canonical model.

    • apps/dokploy/utils/gitea-utils.ts:1-88 — Gitea-specific response/entity interfaces exist, but the absence of corresponding shared canonical entities across other providers indicates the canonical-model layer is missing.
  • med

    Refactor provider HTTP surfaces (e.g., `/api/providers/*/authorize` and `/callback`, webhook endpoints) to delegate to the shared integration abstraction instead of duplicating provider-specific OAuth/event logic in route handlers.

Bidirectional sync N/A

No bidirectional sync primitive (read+write synchronization of external system entities with sync state/reconciliation) was found. The provider endpoints observed are for OAuth/setup and webhook routing/redirects, not for maintaining two-way synchronization with external systems.

  • high

    If the product requires two-way synchronization with external systems (e.g., mirroring issues/PRs/resources back into Dokploy, or updating Dokploy state from webhook events), add an explicit bidirectional sync adapter layer per external provider (shared interface, canonical entities, write-back handlers).

  • med

    Implement a persistent sync state mechanism (cursor/watermark) and idempotent upserts for incremental sync, plus drift handling and failure handling per provider/adapter.

Metadata-driven mappings 67%

Metadata-driven mappings appear in the SSO provider configuration flow: the UI constructs per-provider `mapping` objects inside `oidcConfig`/`samlConfig`, the server schema validates them, and the SSO router forwards them as config updates. However, the shared runtime interpreter for applying these mappings during authentication is not evident in the visible shared SSO service layer (only generic helpers are present), so implementation depth looks partial.

  • high

    Locate and verify the runtime interpreter that applies `oidcConfig.mapping` / `samlConfig.mapping` to identity claims during SSO login. Ensure mapping evaluation is centralized (one canonical function/service) and not reimplemented per connector/flow.

  • med

    Confirm end-to-end persistence and versioning: ensure `mapping` is stored exactly as validated by `ssoProviderBodySchema` and that updates are migration-safe (e.g., handle mapping schema changes gracefully).

    • packages/server/src/db/schema/sso.ts:1-134 — Schema defines mapping sub-objects under `oidcConfig.mapping` and `samlConfig.mapping` with optional/required fields; validate that the actual persisted data matches and is used consistently.
  • low

    Reduce UI-time conditional mapping logic (e.g., Azure vs non-Azure) by expressing differences as config variants or defaults computed server-side, so mapping rules remain metadata-driven.

Per-integration reliability 0%

No correctly implemented per-integration reliability pattern was found. While the codebase uses BullMQ workers for deployment tasks and configures Stripe SDK network retries, the integration execution paths do not show the required combination of (1) retry with backoff, (2) a dead-letter queue/parking area for items that still fail, and (3) alerting/visibility per integration when retries are exhausted.

  • high

    Add retry-with-backoff + DLQ/parking for failures inside the BullMQ deployment worker. Concretely: configure BullMQ job attempts/backoff and ensure failed jobs are routed to a dedicated dead-letter queue (or are persisted and re-processed), and emit alerting/metrics when retries are exhausted.

  • med

    For Stripe webhook processing, implement an event processing failure strategy: on handler failures, enqueue the webhook event payload/id to a DLQ (or separate retry queue) and ensure the system reports alerting when max attempts are exceeded. Consider idempotency so replays are safe.

Sync state & reconciliation 0%

No implementation of a reusable “Sync state & reconciliation” primitive (stored cursor/watermark + idempotent upserts + drift detection between systems) was found. The codebase includes external-system fetching and event pagination, but it does not persist integration sync state or perform reconciliation across runs.

  • high

    Add durable per-integration sync state (cursor/watermark) and reconciliation for the external event polling path: persist the last processed position (e.g., Inngest cursor/internal_id or receivedAt) in DB, use it on the next run for incremental fetch, and implement idempotent upserts into your internal job/event tables. Optionally add drift detection (e.g., compare expected vs observed counts/status transitions, or detect missing/changed runs and repair).

    • apps/api/src/service.ts:64-116 — Pagination uses a local variable `cursor` and fetches events/runs without any durable stored watermark or cross-run reconciliation logic.
Inbound validation & normalization 50%

The codebase contains a validation/normalization pattern (Zod schemas with constraints and transforms) and some integration-boundary validation for external inputs (Stripe webhook signature + event-type allowlist; GitLab/Gitea OAuth callbacks validate query parameters and token exchange results before persisting). However, the full inbound validation & normalization primitive (canonical modeling + dedup + quarantining of bad records) does not appear consistently at the integration boundaries—e.g., GitHub’s webhook endpoint is effectively a redirect-only handler with no payload validation/canonicalization.

  • high

    Introduce a consistent inbound boundary contract for external integration endpoints: (1) validate payload/query via schema (e.g., Zod), (2) normalize into a canonical internal input DTO, (3) add idempotency/dedup (event id / OAuth replay protection), and (4) quarantine bad records (store failures with reason, do not silently redirect/throw away). Start by standardizing Stripe webhook, GitHub webhook, and the OAuth callbacks into the same pattern.

  • med

    For each integration boundary, add explicit idempotency: Stripe should key off Stripe event id; OAuth callbacks should add replay protection (e.g., one-time state storage / nonce verification) so repeated callbacks don’t re-write tokens or cause inconsistent state.

  • low

    Leverage existing Zod schemas in `packages/server/src/db/validations/*` as the canonical normalization layer (or create integration-specific schemas) so inbound external data is consistently normalized before persistence.

Per-tenant integration credentials 33%

This codebase does perform per-provider/per-tenant credential refresh for GitHub/GitLab/Gitea integrations (token refresh is keyed by a specific provider id and persisted back to the corresponding credential row). However, the implementation appears to store sensitive OAuth/app credentials directly in database columns (e.g., `refreshToken`, `accessToken`, `clientSecret`, `privateKey`) rather than retrieving them from a dedicated secret manager per tenant, so the “secret-store + revocable per tenant” portion of the primitive is not demonstrated.

  • high

    Move integration credential material (OAuth client secrets, refresh tokens, GitHub private keys) out of plain DB columns into a per-tenant secret manager (e.g., Vault/AWS Secrets Manager/GCP Secret Manager). Change refresh flows to fetch credentials from the secret manager on-demand and ensure rotation/revocation operates per tenant/integration.

  • high

    Update the refresh endpoints (`refreshGiteaToken`, `refreshGitlabToken`) to use secret-manager lookups keyed by organizationId + providerId, and ensure failures revoke/disable invalid tokens for that tenant/provider.

  • med

    Confirm/strengthen revocation semantics when a tenant removes or rotates an integration: ensure secret-manager deletion/disablement is performed and that subsequent token refresh attempts fail safely for that tenant/provider.

Per-integration observability 0%

No per-integration observability (per-provider metrics/status/last-sync) was found in the code paths related to external integrations (GitHub/Gitea/Stripe). The integration handlers/utilities primarily rely on redirects and console.log/console.error/warn without emitting structured per-integration metrics, updating a stored last-sync/status, or otherwise surfacing health/throughput/failures to ops.

  • high

    Introduce a shared per-integration telemetry contract (e.g., IntegrationHealth/SyncStatus store + metrics emitter) and wire it into each integration entrypoint and adapter operation (webhook handlers, OAuth flows, token refresh, provider API calls).

  • high

    Persist per-integration last attempt + last success + last failure reason (and timestamps) in the database (or monitoring store), and ensure ops/customer UIs can read it.

  • med

    Add metrics emission per integration operation: counters for success/failure by provider + latency histograms by operation (e.g., token refresh, webhook processing, repo listing).

  • med

    Standardize error handling so failure reasons are structured (error codes) and are included in telemetry (instead of only console.warn/error strings).

Connector breadth for the category 61%

This codebase does maintain connector breadth via explicit, enumerated provider catalogs (notably S3-compatible destination providers) and via dedicated runtime provider entrypoints under /pages/api/providers/ (e.g., GitHub webhook, Gitea authorize/callback, GitLab callback). However, the breadth story across “vertical table-stakes” categories (identity/CRM/data warehouse/etc.) is not directly confirmable from the inspected evidence; it should be validated by building a full connector inventory beyond Git providers and storage/billing surfaces.

  • high

    Create a connector inventory report (for the audit denominator): enumerate all distinct external systems supported by runtime connector entrypoints (all /pages/api/providers/** routes, plus any server-side provider modules) and compare against vertical table-stakes relevant to this product (identity, CRM, data warehouse, etc.).

  • med

    For each supported external system, confirm breadth is consistent between UI catalogs and runtime onboarding surfaces (authorize/callback/webhook). Missing parity (catalog lists provider but runtime endpoints absent) is a common breadth gap.

  • med

    Ask the team for (and document) known “table-stakes gaps” explicitly (e.g., whether identity/CRM/data-warehouse connectors are intentionally out-of-scope). This primitive should be treated as a lightly sourced breadth follow-up where evidence is completed with product context.

Build-vs-buy posture 83%

The codebase clearly implements owned (first-party) integration depth for external Git providers (GitHub, Gitea, GitLab) via custom OAuth/token and API logic in per-provider modules. Evidence does not indicate reliance on an embedded third-party iPaaS/connectors platform for these integrations (supporting a build posture rather than buy/rented depth).

  • high

    Confirm whether there is (or should be) a shared canonical adapter interface across providers (e.g., a GitProvider contract implemented by GitHub/Gitea/GitLab). Right now, the build posture is present, but cross-provider consistency of the abstraction/interface is not fully evidenced from the inspected files.

  • med

    Do a quick inventory check for any embedded third-party integration platforms (Nango/Zapier/Workato/etc.) or generic connector outsourcing. If none exist, document the build-vs-buy intent and acceptable scope/boundaries in an integration strategy note for future maintainers.

Not applicable to this codebase: Bidirectional sync.

Deployability

CI/CD as code, infrastructure as code, per-environment isolation, and a one-command local boot.

38% 11/11 scored
  • Reproducible one-command build 0%
    0/2 expected sites not present
  • Automated CI pipeline 50%
    1/2 expected sites
  • Automated deployment (CD) 100%
    4/4 expected sites
  • Infrastructure as code 0%
    0/2 expected sites not present
  • Environment isolation 0%
    0/3 expected sites not present
  • Local/production parity 44%
    2/3 expected sites
  • Config & secrets externalized per env 0%
    0/3 expected sites
  • Decouple deploy from release 0%
    0/4 expected sites not present
  • Reversibility / rollback 67%
    3/3 expected sites
  • Delivery cadence (DORA proxy) 89%
    3/3 expected sites
  • Deploy-tooling ownership 67%
    3/3 expected sites
Reproducible one-command build 0%

A deterministic dependency mechanism exists (pnpm lockfile + `pnpm install --frozen-lockfile` in the Docker build). However, the repository does not provide the core “reproducible one-command build” primitive: there is no root-level documented single command / bootstrap script that a developer can run from a clean clone to build and boot locally. Onboarding currently emphasizes a VPS curl|bash install.

  • high

    Add a root bootstrap entry point that supports a clean clone and one-command local build+boot (e.g., `make dev` / `./setup.sh` / `just up`) and document it in README. Prefer using the existing Dockerfile (or a docker-compose/devcontainer) so the command is truly one-shot.

    • README.md:18-34 — README currently documents a VPS install curl|bash workflow instead of a clean-clone local one-command build+boot.
  • high

    Ensure the one-command path includes env + required dependencies setup in a reproducible way (generate/populate `.env` from templates, and start required services like Postgres/Redis via versioned configuration). If you use the Dockerfile, pair it with a versioned `docker-compose`/dev orchestration or scripts that bring up dependencies automatically.

    • package.json:1-27 — Root scripts expose build/start, but there is no single documented command that performs full bootstrap (env + services) for a clean clone.
    • apps/dokploy/.env.example:1-4 — An env template exists, but the current repo onboarding does not wire it into a reproducible one-command bootstrap.
  • med

    Add/confirm CI gating for a “local build” equivalent (not just Docker image builds): run the build steps in CI (install with frozen lockfile + compile/test) so the one-command process has green evidence on main.

    • .github/workflows/dokploy.yml:1-80 — The existing workflow focuses on building/pushing Docker images; it does not demonstrate a developer-equivalent clean-clone local build+boot gate in code.
Automated CI pipeline 50%

An automated CI pipeline exists for pull requests: `.github/workflows/pull-request.yml` runs build/test/typecheck automatically on every PR targeting main/canary. However, the push-based workflow (`.github/workflows/deploy.yml`) appears to focus on building and pushing Docker images on push, and does not run tests/typecheck—so CI coverage for direct pushes/main staying green is incomplete relative to the primitive definition.

  • high

    Add a push-triggered CI workflow (or extend `deploy.yml`) that runs the same `pnpm build`, `pnpm test`, and `pnpm typecheck` steps on every push to `main` and `canary` (and make it an explicit required check for merge).

  • med

    Ensure the CI checks produced by the PR workflow are actually required for merge protection in repository settings (so merges cannot proceed when build/test/typecheck fails).

Automated deployment (CD) 100%

Automated deployment (CD) exists and is implemented as a code-driven event pipeline: GitHub/SCM webhooks validate incoming events and automatically create/enqueue deployment jobs; an Inngest-backed deployment service executes the `deploy(...)` function and emits completion/failure events. Additionally, GitHub Actions workflows automate build/release artifacts on main, supporting an automated release-to-deploy workflow.

  • high

    Add/confirm a documented end-to-end “release → webhook → Inngest deploy → production” path for all supported trigger types (push vs tag vs docker/git source types), including required environment variables and how to test the pipeline safely in staging.

  • med

    Ensure the CD pipeline has an explicit rollback/revert strategy per deployment job type (e.g., database/media/image rollbacks) and that failures produce actionable remediation (not just events).

    • apps/api/src/index.ts:1-115 — The pipeline emits `deployment/failed`, but rollback semantics are not shown in the audited slices; confirm and codify rollback behavior.
Infrastructure as code 0%

No infrastructure-as-code (IaC) implementation was found in this repository: the git artifact scan shows no Terraform/CloudFormation/Pulumi/Helm/wrangler/serverless/K8s-manifest IaC files. The GitHub workflows present are focused on building/pushing Docker images and release automation, but they do not include or reference reproducible, versioned infrastructure definitions to provision/manage environments.

  • high

    Add versioned IaC for the production environment (choose the stack your target platform expects—e.g., Terraform for cloud resources, or Helm/Kubernetes manifests for cluster workloads) and make it the single source of truth for environment provisioning.

  • high

    Wire the deploy/CD workflow to run IaC changes (plan/apply) for each environment with a reproducible path from a clean checkout; ensure prod is reproducible from the same IaC and supports drift control.

  • med

    Create environment directory structure and golden templates for dev/staging/prod (separate state backends/secrets per env), so the infrastructure path to production is code-reviewed and reusable.

Environment isolation 0%

No clear implementation of 'Environment isolation' (separate dev/staging/prod with isolated data + credentials for the platform itself) was found. While the project supports user-defined 'environments' and has a compose-level 'isolatedDeployment' option to prevent Docker collisions, the repository does not show stage-specific deployment configuration/templates or a dev/staging/prod isolation model for credentials/data at the application/infra boundary.

  • high

    Introduce explicit stage separation for the platform deployment (dokploy itself): add stage-specific env templates (e.g., apps/dokploy/.env.staging.example and .env.production.example), ensure secrets/endpoints are injected per stage (no localhost/prod defaults committed), and wire stage selection into the startup/deployment flow.

    • apps/dokploy/.env.example:1-4 — Current config example is single-environment/local-focused (DATABASE_URL points to localhost, NODE_ENV=development) and does not demonstrate stage isolation.
  • high

    Clarify and enforce the mapping between 'dokploy environments' and deployment stages (dev/staging/prod). If they are meant to represent stages, implement first-class stage fields and restrict sharing of underlying accounts/volumes/data across stages; remove/adjust the hard block that prevents a 'production' named environment.

  • med

    Extend isolation beyond Docker collision avoidance: ensure stage selection affects credentials/accounts/state used during deployments (e.g., per-stage docker registry credentials, per-stage database endpoints, per-stage backup/restore targets), rather than only compose network/volume naming.

Local/production parity 44%

Local/production parity mechanisms do exist: the repo includes a VS Code devcontainer that forwards the expected runtime service ports and uses pinned Node and PNPM versions. However, the devcontainer’s Dockerfile is not the same as (or obviously derived from) the production Dockerfile—there is likely some runtime drift (e.g., different Node image variants and production-specific tooling/config expectations).

  • high

    Make the devcontainer build from the same production Dockerfile (or a shared base) so the runtime truly matches. Concretely, point devcontainer.json to the production Dockerfile (or extract a common “runtime base” Dockerfile used by both).

    • .devcontainer/devcontainer.json:1-54 — devcontainer.json builds from Dockerfile='Dockerfile' in the .devcontainer folder, not the root production Dockerfile; this is the main parity decision point.
    • .devcontainer/Dockerfile:1-21 — Local base image is node:24.4.0-bullseye-slim and installs only a minimal toolchain; production Dockerfile is different and includes additional setup.
  • med

    Ensure local config loading mirrors production behavior: document and align which env files/variables are used at runtime (names and semantics), and avoid relying on untracked local/manual setup.

    • Dockerfile:1-73 — Production sets NODE_ENV=production and copies .env.production into the image; local should use the same variable expectations and behavior.
    • apps/dokploy/.env.example:1-4 — Local example config currently hardcodes a localhost DATABASE_URL, which may not reflect production deployment configuration patterns.
  • low

    Add a quick parity check script/README entry (e.g., 'start-dev' that runs the same container command used in production plus migrations/seed steps) to make it easy to recreate the same runtime locally.

Config & secrets externalized per env 0%

The codebase shows partial adoption of environment-driven configuration (e.g., Traefik port configuration via process.env and presence of .env.example templates). However, there are key anti-patterns for this primitive: at least one production/cloud URL is hardcoded in code (https://app.dokploy.com), and deployment-version defaults (e.g., TRAEFIK_VERSION) are not fully externalized, requiring code changes to adjust per environment.

  • high

    Externalize the cloud base URL used in getDokployUrl into configuration (env var / per-env config), and remove the hardcoded https://app.dokploy.com literal.

  • high

    Make TRAEFIK_VERSION fully configurable per environment (remove or minimize hardcoded defaults). Prefer requiring TRAEFIK_VERSION (or a validated config struct) to be provided by env/config rather than defaulting to "3.6.7" in code.

  • med

    Reduce embedded environment-dependent endpoint defaults in createDefaultServerTraefikConfig by moving internal hostnames/URL templates into env/config (or clearly documenting and parameterizing them through the existing config layer).

Decouple deploy from release 0%

No implementation of decouple-deploy-from-release was found. While the codebase includes an `isEnabled` concept, it appears to be permission/environment gating for UI rather than a feature-flag/rollout system that separates deployment from activation. The GitHub workflows build/publish images and create releases without showing flag-based or percentage/canary rollout control for production activation.

  • high

    Introduce a real feature-flag/rollout mechanism (library + server-side gating) and wire it into production activation points. For example, guard new/changed backup scheduling behavior in `backupRouter.create` behind a flag checked at request time, with percentage/canary support.

  • high

    Add rollout-aware gating to user-facing UI routes/entries that should only become available after rollout. Replace/augment permission-only gating with flag-based activation (e.g., show/enable buttons and navigation only when the rollout flag is active).

  • med

    Connect CI/CD releases to progressive activation. If you keep publishing images for `main`, ensure production traffic/users only see new behavior via flags or canary routing rather than all-or-nothing exposure tied to `latest`.

Reversibility / rollback 67%

This codebase does implement a production rollback path for deployments (API + service-layer Swarm service update using stored `fullContext` and an image tag), with healthcheck/rollback config wired into the Swarm TaskTemplate. However, migration reversibility is not demonstrated: the migration script runs forward migrations only and provides no rollback/down/reverse support, which undermines full “reversibility without corruption” for data/schema changes.

  • high

    Add reversible, backward-compatible migrations and a documented rollback workflow (down/reverse migrations or a strategy like expand/contract with versioned compatibility). Ensure the rollback process (or deploy job) runs the matching reverse step when rolling back a release.

  • med

    Strengthen rollback readiness in the rollback API/executor: add/trigger canary or explicit verification steps (e.g., wait for service health/replica readiness, run smoke checks, capture metrics/log correlation IDs) before marking rollback successful.

  • low

    Ensure failed rollback attempts are handled transactionally and surfaced clearly (e.g., if `service.update`/createService fails, return a structured error and preserve rollback state for investigation).

Delivery cadence (DORA proxy) 89%

Delivery cadence appears strong. Git history shows frequent commits/merges and steady tagging. On the repo side, CI automation for building/pushing container images is triggered on pushes to main and canary, and the canary→main promotion PR flow is automated when version changes—together indicating a mature, incremental release process rather than occasional big-bang releases.

  • high

    Add/verify an explicit CD step that deploys to a staging/preview environment on each main/canary change (or PR) rather than only building/publishing images and creating releases. Ensure this is automated and reversible (e.g., environment-per-PR or preview deployments).

    • .github/workflows/dokploy.yml:1-242 — Current workflow content shown includes building/pushing images, combining manifests, and creating releases; it does not (in the observed excerpt) demonstrate an automated staging/preview deployment step.
  • med

    Instrument the release workflow to tighten the small-batch feedback loop (e.g., build/test gates + link deploy artifacts to runs) so that the time-to-production remains consistently short, not just the commit cadence.

    • .github/workflows/deploy.yml:1-109 — Workflow builds and pushes images but (in the shown excerpt) does not clearly show test gates or downstream production/staging rollout wiring.
Deploy-tooling ownership 67%

Deploy/tooling ownership (single-engineer CI/CD time-bomb risk) appears to be mitigated: git history over deploy/CI paths shows 15 distinct authors. However, the top author share is still high (0.89), so while it’s not a strict single-owner failure mode, pipeline ownership is meaningfully concentrated.

  • high

    Reduce concentration of ownership for CI/CD workflows by explicitly assigning codeowners/review rotation for .github/workflows/* (especially deploy.yml and dokploy.yml).

  • med

    Add lightweight workflow-level smoke checks (or reuse existing PR quality signals) that validate the deploy pipeline itself (e.g., verify required secrets/metadata are present in non-publishing mode) so more contributors can safely make changes.

T3 Exit Cleanliness

Engineering Org Resilience

No single-author critical paths: git-blame concentration, CODEOWNERS coverage, and reviewer diversity across the codebase.

19% 6/10 scored
  • Critical-path bus factor 0%
    0/5 expected sites
  • Ownership clarity 0%
    0/1 expected sites
  • Documentation density ("why") 0%
    0/6 expected sites not present
  • Operational runbooks 0%
    0/3 expected sites not present
  • Onboarding reproducibility 111%
    4/3 expected sites
  • Decision history legibility 0%
    0/2 expected sites not present
Critical-path bus factor 0%

The repo shows partial safeguards against critical-path bus-factor risk: there is substantial executable knowledge in deployment tests, but ownership is centralized by default in CODEOWNERS (single default owner) and there are no organizational runbooks/ADRs present (per org-artefacts scan). The deployment pipeline itself (queue worker, deploy/cancel utilities, control-plane router, GitHub webhook trigger) is critical-path; however, I could not confirm distributed co-ownership for each specific critical component from code-only evidence, so only the presence of strong deployment tests was confidently identified as a durability mechanism.

  • high

    Add explicit multi-owner coverage for critical-path directories (deployment queue, deploy/cancel utilities, deployment router, GitHub webhook trigger) in CODEOWNERS, ensuring at least 2 humans (>=3 if possible for the most critical surfaces like deploy + incident response).

    • .github/CODEOWNERS:1-3 — Default owners are centralized to a single user, which undermines critical-path co-ownership guarantees.
  • high

    Introduce operational runbooks for the deployment pipeline (what to check when deployments fail, where to find logs, common failure modes, rollback/cancel procedures).

  • med

    Map each critical-path code site to a corresponding test suite and requirement checklist (e.g., webhook->queue job creation->worker execution->status updates->log tailing), and ensure tests assert the most failure-prone branches and error messages.

Single-author hotspots N/A

No single-author hotspots were detected. In the last 12 months, the git-history hotspots scan returned an empty `danger_files` list (i.e., no high-churn files were simultaneously limited to ≤2 lifetime distinct authors). Therefore, there were no concrete file sites to verify via code inspection.

  • med

    Re-run the hotspots scan for a different window (e.g., 6 months and 24 months) to confirm no emergence of new single/dual-owner gravity wells, then inspect any newly flagged danger files with `code_read` to ensure intent is captured in tests/docs.

Review diversity N/A

This primitive is about *review/merge process diversity* (i.e., whether work lands via PRs and is integrated by multiple people). In this repo, there is evidence from git-history signals that PRs exist and are merged by many humans (distinct_mergers_human=27), but there are no corresponding, codebase-locatable artifacts/config files (e.g., .github/CODEOWNERS or workflow files) that implement or enforce review diversity. Since the audit requires file+line evidence for “present in the codebase,” the primitive is treated as absent for the purposes of this report.

  • high

    Add/verify process artifacts that make review diversity enforceable at the repo level (e.g., branch protection rules requiring PRs, CODEOWNERS to spread ownership, and required status checks). Then ensure PR merges involve multiple human integrators rather than a single gatekeeper.

  • med

    If review diversity is already happening in practice, capture it in repo configuration so it is auditable from the codebase (e.g., required reviewers, CODEOWNERS, and CI checks).

Ownership clarity 0%

An ownership manifest exists (.github/CODEOWNERS), but it is not implemented with ownership clarity as defined: it assigns all paths to one default owner and provides no explicit per-critical-path ownership groups (and thus cannot satisfy the >=2-people requirement). No correctly-applied ownership-clarity sites were found.

  • high

    Replace the single default CODEOWNERS entry with explicit ownership mappings for the repo’s critical path segments (at least apps/* and packages/* subtrees), and ensure each critical mapping lists 2+ people (or a team handle) so knowledge is not concentrated behind one individual.

    • .github/CODEOWNERS:1-3 — Current state: only a wildcard default owner for everything, with a single handle; no per-critical-path or multi-owner mapping.
Retained vs. departed knowledge N/A

This primitive is not implemented as an artifact/mechanism in the codebase. While git-history signals indicate a non-trivial departed-authorship share (recency-based proxy), there is no code/runbook/ownership process artifact here that specifically captures “retained vs departed knowledge” (i.e., ensures critical knowledge remains after authors leave).

  • high

    Create and maintain knowledge-capture artifacts for any critical areas with elevated departed authorship risk: add/expand ownership manifest (more than one owner per area), and add runbook/ADR-style rationale for operational and architectural decisions so knowledge is not tied to single authorship history.

    • .github/CODEOWNERS:1-3 — Only one default owner is listed; without additional owners/knowledge artifacts, the project is vulnerable to knowledge concentration when that person becomes unavailable.
  • med

    Add onboarding checklists and service setup/run instructions for critical paths that explicitly include: where the authoritative operational understanding lives, what tests/commands validate behavior, and who the current co-owners are.

    • CONTRIBUTING.md:1-197 — The contributing guide contains setup/build/test expectations, but does not provide critical-path operational/runbook knowledge capture or a retained-vs-departed continuity plan.
Documentation density ("why") 0%

Across the repo’s tracked documentation artifacts, the content is primarily “how to run/configure/contribute” (setup steps, endpoints, environment variables, minimal READMEs). I did not find durable architecture/design rationale documentation that explains the system’s decisions (“why”), so the documentation density primitive does not appear to be correctly implemented anywhere in the codebase.

  • high

    Create durable architecture/decision “why” docs (e.g., an ADR folder and at least one architecture/design overview) and link them from the root README and CONTRIBUTING. Ensure they cover major components (API, scheduler, monitoring, server setup) and explain tradeoffs, not just instructions.

    • README.md:1-65 — Root README lacks architecture/design rationale and only points to external docs.
    • CONTRIBUTING.md:1-120 — Contribution guide is process-focused and does not point to durable architecture/decision rationale.
  • high

    Augment each critical service README (API, schedules, monitoring) with a short “Architecture / Why this design” section describing core design choices, constraints, and integration rationale (e.g., callback/threshold design, scheduling semantics, API structure).

  • med

    Add a minimal “documentation map” (what docs exist, where to look for ‘why’, how to update them when changing architecture). Put it in README/CONTRIBUTING so new contributors learn the durable rationale locations.

Operational runbooks 0%

Operational runbooks are not present as tracked artifacts anywhere in the repository (absent "runbook" category). While the code contains critical operational workflows (deployment webhook/queueing, restore pipelines, and the Compose service layer), there are no corresponding written runbooks to guide deployment, incident response, or recovery.

  • high

    Create runbooks for each critical service/workflow that can be incident-triggered: (1) Compose deployment webhook/queueing (include replay/retry procedures, watch-path/branch mismatch handling, and queue/job inspection), (2) Compose/database restore procedure (include rclone pipeline expectations, DB-type-specific credentials/verification, and remote-vs-local execution notes), and (3) the Compose service lifecycle (how to validate compose/service loading and diagnose remote exec/compose spec issues).

  • med

    Add an ownership/coverage manifest that names runbook owners (and backup owners) for the runbook-covered workflows, and ensure those owners actually participate in maintaining the docs (to mitigate the gravity-well risk of a single person “who just knows”).

    • .github/CODEOWNERS:1-3 — A CODEOWNERS file exists; use it (or extend it) to cover runbook responsibilities rather than relying on implicit knowledge.
Onboarding reproducibility 111%

Onboarding reproducibility is partially present: there is real written onboarding material (CONTRIBUTING.md) that includes a runnable setup entrypoint and the commands to reach a local dev server, and the referenced setup script exists in code. However, the 'clean clone to productive' flow is not purely one-command (it requires at least `dokploy:setup`, plus additional commands like `server:script` and `dokploy:dev`), so ramp-up may still rely on knowing which follow-up commands/options matter most.

  • high

    Add a single canonical 'from clean clone to productive' command in the onboarding docs (e.g., `pnpm run dokploy:up` that internally runs setup + migrations + starts the dev server, or clearly document that the experience is inherently multi-step and why).

    • CONTRIBUTING.md:103-123 — Docs require multiple commands after `pnpm run dokploy:setup` (`server:script`, then `dokploy:dev`), so it’s not strictly one-command reproducible.
  • med

    Link the docs to the exact setup semantics (what the setup script does/doesn’t cover). For example, clarify in CONTRIBUTING.md which parts are Docker swarm/network/traefik/Redis/Postgres initialization vs. which parts are app boot/migrations, so new engineers don’t need a person to infer gaps.

    • apps/dokploy/setup.ts:1-40 — The setup script performs infrastructure initialization and pulls traefik, but the docs still call extra commands afterwards—clarifying the division of responsibility will improve reproducibility.
  • low

    Add a short 'known-good' verification checklist to onboarding (e.g., expected log lines, health checks, or a smoke test endpoint) to make the doc path objectively verifiable without narration.

    • CONTRIBUTING.md:108-123 — Docs provide the access URL but not objective verification signals beyond 'go to localhost:3000'.
Tests as executable knowledge N/A

The codebase has a substantial and meaningful test suite under `apps/dokploy/__test__`, and tests act as executable knowledge: they include detailed assertions about key business logic (environment variable resolution, template processing including secret/JWT generation, and deployment workflow behavior). Based on the sampled files read, test intent is captured in runnable form rather than only smoke-level checks.

  • med

    Pick the highest-business-risk flows (e.g., deployment/template processing endpoints) and ensure each has at least one focused regression test with clear “inputs → expected outputs” assertions (similar to the env/template suites) plus one integration-style test (like the deploy real test) that covers the workflow wiring.

  • low

    For any remaining high-level ‘real’ tests (that rely on Docker/filesystem/exec), standardize naming and comments to document what is intentionally mocked vs. executed for real, to keep the executable knowledge durable over time.

Decision history legibility 0%

The repo shows a convention for commit message formatting (CONTRIBUTING), but durable decision records (ADRs) are absent, and the key setup/infrastructure code paths do not include recoverable decision rationale in a way that would be reliably reconstructible after departure. Therefore, this primitive is treated as not genuinely and correctly applied anywhere concrete in this codebase.

  • high

    Create ADRs (or equivalent durable decision records) for the major infrastructure/setup decisions: (1) why swarm/network are initialized with the specific address/network settings, (2) why the setup order in apps/dokploy/setup.ts is the chosen dependency order.

  • med

    Ensure commit history consistently carries WHY (not only WHAT) for setup/infrastructure changes: require explanatory bodies for PR commits touching docker/swarm/network/traefik/dependencies.

    • apps/dokploy/setup.ts:1-40 — This file is the likely target for future infrastructure changes; without strong decision-history legibility, edits become risky.
  • low

    Add brief in-code rationale comments at the decision points (e.g., address choice, network driver choice, idempotency approach) as a fallback to complement history/ADRs.

Not applicable to this codebase: Single-author hotspots, Review diversity, Retained vs. departed knowledge, Tests as executable knowledge.

IP & OSS License Hygiene

An SBOM in CI, no AGPL/GPLv3 in the dependency tree, CVEs triaged by severity, and no outside-contributor commits without IP assignment.

27% 11/12 scored
  • Software bill of materials 0%
    0/3 expected sites not present
  • License compliance 17%
    1/2 expected sites
  • Known-vulnerability scan 0%
    0/2 expected sites not present
  • Known-exploited CVEs 0%
    0/2 expected sites
  • Dependency usage & reachability 50%
    1/2 expected sites
  • Dependency freshness 22%
    2/3 expected sites
  • Upstream maintenance 0%
    0/3 expected sites not present
  • Remediation velocity 0%
    0/3 expected sites not present
  • Supply-chain integrity 108%
    5/4 expected sites
  • Dependency-confusion resistance 100%
    4/3 expected sites
  • IP ownership / provenance 0%
    0/2 expected sites not present
Software bill of materials 0%

I did not find any SBOM generation practice wired into the codebase (no syft/cyclonedx/spdx-style generation script referenced in package.json, and no SBOM-related filenames like cyclonedx/syft/spdx were detectable in the code graph). The repo does have committed dependency manifests/lockfiles (pnpm-lock.yaml and Go go.mod), but the primitive (producing and publishing an SBOM as a release artifact) appears absent.

  • high

    Add an SBOM generation script and wire it into CI/release. For example: generate CycloneDX or SPDX using syft (or equivalent) for the pnpm workspace and the Go module(s), then publish the resulting SBOM artifact (e.g., sbom.json / sbom.spdx.json) per release.

    • package.json:1-79 — No SBOM generation script or release hook exists in the root workspace scripts.
  • high

    Ensure the SBOM is based on the exact pinned dependency graphs: pnpm-lock.yaml for the npm ecosystem and the Go module graph for apps/monitoring. Validate that the SBOM covers direct + transitive dependencies and matches the resolved dependency inventory from lockfiles.

  • med

    Add a CI check that fails the build if SBOM artifact generation is missing or empty, and (optionally) compare SBOM contents against the repository’s lockfile-based inventory to prevent drift.

License compliance 17%

License compliance is partially present (lockfiles exist for both npm and Go), but the dependency license scan found a strong-copyleft risk: node-forge@1.3.3 is flagged as 'BSD-3-Clause OR GPL-2.0' (strong-copyleft). This can change deal terms for a proprietary SaaS and requires isolation/replacement and confirmation that attribution/NOTICE obligations are met. Additionally, repository-level LICENSE/NOTICE files were not found at the repo root in this audit run (evidence gap).

  • high

    Replace or remove node-forge@1.3.3 with an alternative dependency that has a single permissive license (or obtain/record a clear license grant/exception from node-forge and document it). Re-run the license scan to confirm the strong-copyleft tier disappears.

    • pnpm-lock.yaml:1-16685 — Strong-copyleft flagged dependency: node-forge@1.3.3 detected as 'BSD-3-Clause OR GPL-2.0' (tier: strong-copyleft).
  • high

    Verify and add/restore release attribution artifacts: ensure a NOTICE file and/or a complete third-party licenses bundle exists (and is current with lockfile changes). This should cover all transitive dependencies, including any that require attribution.

    • pnpm-lock.yaml:1-16685 — Attribution/NOTICE obligations are required for shipping proprietary SaaS using third-party components; no root LICENSE/NOTICE files were available in this run (evidence gap).
  • med

    Document the license compliance process in CI/release (e.g., fail builds on strong/network copyleft tiers; generate an SBOM + third-party licenses report during release).

    • pnpm-lock.yaml:1-16685 — Lockfile is present, but enforcement/documentation evidence was not found in this run; add CI gates around the same license scan logic.
Known-vulnerability scan 0%

I did not find any wiring that actually performs a known-vulnerability scan as part of the repo’s automation. The root package.json lacks a vulnerability-scan script/command hook, and (separately) the dependency lockfiles contain a large number of HIGH/CRITICAL OSV findings according to osv-scanner—meaning that a scan would be meaningful, but the repo does not appear to apply the primitive.

  • high

    Add a CI job (e.g., GitHub Actions) that runs a lockfile-based vulnerability scan (osv-scanner/osv) over all relevant manifests/lockfiles (pnpm-lock.yaml and apps/monitoring/go.mod), and fails the build when there are un-triaged HIGH/CRITICAL findings (plus produce a SARIF/artifact report).

  • med

    Create/standardize a dedicated script (e.g., `pnpm run security:vuln-scan`) that executes osv-scanner against the repo lockfiles, and document the triage workflow for each HIGH/CRITICAL finding (remediate vs. exception with justification).

    • package.json:1-79 — Root scripts are currently focused on build/test/lint only; add a dedicated security script to ensure consistent execution.
Known-exploited CVEs 0%

The 'known-exploited CVEs' hygiene primitive is applicable and was effectively executed at scan time: osv-scanner’s known-exploited detector returned known_exploited_count=0. However, I did not find a concrete in-repo implementation of this primitive (e.g., a CI step/config) to cite—only the off-graph scan results and the presence of lockfiles/manifests.

  • high

    Add/confirm a CI gate that runs osv-scanner (or equivalent) in the known-exploited mode on every PR and fails the build if known_exploited=true findings appear; ensure it covers pnpm-lock.yaml and Go modules.

    • pnpm-lock.yaml:1-120 — This lockfile is the required anchor for the known-exploited check; wire a CI job to scan it.
    • apps/monitoring/go.mod:1-35 — The Go manifest is another required dependency anchor; wire it into the same known-exploited scan gate.
    • osv_dep_scan(mode=vulns):n/a — Scan result indicates known_exploited_count=0, but without a cited CI implementation, this is not guaranteed to be enforced over time.
Dependency usage & reachability 50%

The codebase does show correct, concrete on-graph reachability for at least one major dependency surface: drizzle-orm is imported and directly used to construct the application DB handle (packages/server/src/db/index.ts). Beyond that, the remaining dependency usage/reachability checks for unused/phantom deps and call-site reachability could not be fully enumerated within this run because the call-site API (receiver→callee resolution) queries returned no results for specific receivers (likely due to graph modeling/receiver-binding behavior), so the audit coverage is partial rather than a clean, comprehensive mapping.

  • high

    Extend reachability validation beyond drizzle-orm by running call-site-based queries per frequently imported external library (e.g., axios, hono, vitest, protobufjs) and then code_read the highest-importance call sites to confirm vulnerable-function reachability (not just import presence).

  • med

    Create an explicit “declared-but-never-imported” and “imported-but-undeclared/phantom” review list by diffing manifest deps (pnpm lock/importers + go.mod) against virgil_query raw_import for each external package family, then confirm removals/manifest corrections with code_read of the relevant module boundaries.

    • apps/monitoring/go.mod:1-21 — Go manifest declares fiber and other modules; ensure virgil_query raw_import shows actual imports in the monitoring app code paths (currently only reachability for TS/ORM was confirmed).
    • pnpm-lock.yaml:1-60 — pnpm lock declares app/API dependencies; reachability diff requires comparing these declared deps to raw_import usage in source.
Dependency freshness 22%

Dependency freshness is only partially implemented: the repo commits lockfiles/go.mod with pinned versions (good for determinism), but it lacks evidence of an active dependency-update mechanism (update bot not configured) and OSV findings show critical/high vulnerabilities tied to pinned versions (notably Fiber v2.52.6 in apps/monitoring/go.mod). Overall, freshness hygiene exists as “pinning,” but not as “ongoing remediation,” so it’s weak.

  • high

    Upgrade the pinned runtime dependency github.com/gofiber/fiber/v2 from v2.52.6 to a fixed version (per OSV/GHSA advisories), and re-lock. Start with Fiber because the vuln scan flags CRITICAL issues on the exact pinned version.

  • high

    Enable and operationalize an automated dependency update mechanism (dependabot/renovate) and ensure update PRs actually merge (remediation velocity).

    • git history (tool output):N/A — git_dep_provenance shows update_bot_configured=false (no bot configured). Without a mechanism, lockfile freshness tends to decay.
  • med

    Add/verify CI steps that generate and publish SBOM/CVE/License reports on each release and (ideally) on PRs, so freshness is continuously measured—not just pinned.

    • pnpm-lock.yaml:1-30 — pnpm lockfile is present, but CI freshness/SBOM generation could not be confirmed from the provided evidence set; introduce explicit CI gates for freshness.
Upstream maintenance 0%

Upstream-maintenance hygiene (i.e., actively detecting and replacing deprecated/abandoned upstream dependencies) is not evidenced anywhere in the codebase in a concrete way. While dependencies are pinned via go.mod and pnpm-lock.yaml files, there is no demonstrable upstream-maintenance control/verification wired into these dependency sources.

  • high

    Add an upstream-maintenance gate to CI that fails the build (or opens an automated PR) when dependencies are deprecated/abandoned upstream or no longer maintained. Concretely, run an OSV/deprecation/abandonment check over the resolved lockfile(s) (Go + pnpm) and block merges until replacements are proposed.

  • med

    Ensure remediation velocity is actionable: enable and configure Renovate/Dependabot (or equivalent) and require successful lockfile-upgrade PRs for any deprecated/abandoned upstream hits.

Remediation velocity 0%

Remediation velocity is not clearly implemented as an automated dependency-update mechanism in this codebase: automated bot configuration evidence is missing (no dependabot/renovate config files found in the repo), so the primitive cannot be verified as an operating mechanism even though the repository appears to have dependency-update activity in git history.

  • high

    Add/enable an automated dependency-update bot (Dependabot and/or Renovate) with CI/workflow configuration, and ensure it is actually triggered (config present in-repo).

    • package.json:1-79 — Current root configuration does not include any bot/workflow mechanism for dependency updates.
  • high

    Verify end-to-end velocity: ensure dependency-update PRs are created and merged regularly (especially within the last 90 days), then confirm the number of merged dependency-update commits remains non-zero.

    • package.json:1-79 — No evidence of an update-automation pipeline is present in repository configuration files read so far.
  • med

    Ensure CI includes a dependency scanning/SBOM generation step to keep upgrade blast-radius and CVE backlog remediation timely (precondition for meaningful velocity).

    • package.json:1-79 — No CI/SBOM/scan steps are defined in package.json; rely on CI workflows to implement this.
Supply-chain integrity 108%

Supply-chain integrity is present: the JS ecosystem uses committed pnpm lockfiles with explicit integrity hashes (`resolution.integrity`), and the Go monitoring service uses pinned go.mod versions backed by go.sum content hashes. Overall quality is good for lockfile integrity verification, but grading is not “perfect” likely because this codebase appears to use multiple lockfiles (root + nested) and the audit did not confirm a single unified CI enforcement point in this pass.

  • high

    Confirm CI/build uses the committed lockfiles for installation (e.g., `pnpm install --frozen-lockfile` for both root and nested lockfile scopes, and `go mod download` with go.sum verification).

    • pnpm-lock.yaml:1-60 — Evidence of committed pinning/integrity-capable lockfile; ensure CI actually enforces frozen usage.
  • med

    For multi-lockfile setups (root pnpm + nested pnpm under packages/server/src/emails), document and standardize which workflows/commands target which lockfile to avoid drift or accidental installs that bypass one lockfile.

Dependency-confusion resistance 100%

Dependency-confusion resistance appears implemented primarily via committed, pinned lockfiles (root pnpm-lock.yaml and a package-specific pnpm-lock.yaml for server emails) plus explicit, fully-qualified Go module dependencies in apps/monitoring/go.mod. I did not find evidence of unscoped private names/typo-squatted package specs in the lockfile sections reviewed, and the workspace dependency is correctly treated as a local link rather than a registry package.

  • high

    Also read each package.json (and any .npmrc/pnpmrc specifying registries) for unscoped private dependencies, typo-similar names, or floating version ranges (e.g., ^/*) that could allow resolution drift beyond the lockfile’s guarantees.

    • pnpm-lock.yaml:1-120 — Lockfile pinning is present, but the primitive’s full verification requires confirming the corresponding manifests (package.json) don’t contain unscoped/private or ambiguous dependency specs.
IP ownership / provenance 0%

I did not find any explicit contributor IP assignment/provenance mechanism (e.g., CLA/contributor agreement) in the repository documentation. The Contributing Guide provides contribution workflow guidance but does not include any CLA/assignment terms, so the IP ownership / provenance primitive is not demonstrably implemented in a durable way in this codebase.

  • high

    Add/enable a concrete contributor IP assignment mechanism (e.g., CLA assistant or DCO + explicit IP license/assignment), and document it in CONTRIBUTING.md (requirements, how to sign, enforcement on PRs).

  • high

    Create a dedicated legal artifact (e.g., CLA.md / Contributor Agreement) referenced from CONTRIBUTING.md, including terms for inbound IP assignment/license and how exceptions are handled.

    • CONTRIBUTING.md:1-197 — Currently contains workflow/setup/build guidance but no links or statements about contributor agreement/IP assignment.
  • med

    Add a short README section that points contributors to the CLA/contributor agreement page (or a link to it from the contributing section).

    • README.md:1-65 — README does not include any pointer to contributor IP assignment/provenance terms.
AI-coding-tool provenance N/A

AI features exist in the codebase (e.g., AI provider selection and UI wiring), but there is no evidence that AI-coding provenance is tracked or that AI-generated code is labeled/attributed via a documented convention (no AI-usage/provenance policy or generated-code markers observed). Per this primitive’s rubric, this should be treated as N/A for actionable site matching because the required provenance-tracking machinery is not present.

  • high

    Add an explicit AI-coding provenance policy and convention (e.g., required PR description and/or file header/trailer for AI-generated/assisted code; include how to record prompts, model/provider, and review sign-off).

  • med

    Introduce repository-level generated-code/provenance markers (examples: standardized comment header for files or block-level markers indicating AI assistance; optionally enforce via lint/CI checks).

Not applicable to this codebase: AI-coding-tool provenance.

Implementation & Customization

Configuration over per-customer branches: no "if customer_id == 12345", no pricing literals scattered outside the billing module.

76% 8/10 scored
  • Configuration over code branches 100%
    3/3 expected sites
  • Centralized pricing/plan logic 33%
    1/3 expected sites
  • Metering decoupled from pricing model 0%
    0/4 expected sites not present
  • Feature gating via flags, not forks 100%
    6/6 expected sites
  • Customization isolation & upgrade safety 100%
    5/4 expected sites
  • Theming / white-label as config 100%
    7/7 expected sites
  • Tenant-configurable behavior surface 100%
    3/3 expected sites
  • Onboarding-by-configuration cost 75%
    4/4 expected sites
Configuration over code branches 100%

This codebase applies the “configuration over code branches” primitive for at least one concrete variation surface: whitelabeling/branding. Branding differences (meta title, favicon, and custom CSS) are stored in a structured whitelabelingConfig JSON configuration, served via API endpoints, and injected into the client via a provider component—supporting different instances/customizations without creating divergent code paths.

  • high

    Audit other variation surfaces beyond whitelabeling (e.g., billing/entitlements, feature availability, deployment templates) and refactor any remaining plan/customer-specific behavior to be driven from configuration/state models similar to whitelabelingConfig.

  • med

    If more customization knobs are expected, extend the existing webServerSettings.whitelabelingConfig JSON schema (and its TRPC input/output validation) rather than introducing new UI branches.

No hardcoded customer branching N/A

No hardcoded customer/tenant/org/account ID branching (e.g., `if customerId === 123`) was found. Where tenant identity is used, it is for data scoping/authorization via variable values from session/context (e.g., `orgId`, `ctx.session.activeOrganizationId`), which is the correct approach.

  • low

    Keep validating future changes by searching for direct comparisons of identity fields against literals (e.g., `customerId === <number|string>`, `tenantId === '<literal>'`, `orgId === '<literal>'`) in business-logic/business-layer code.

Centralized pricing/plan logic 33%

A centralized pricing/plan module exists at `apps/dokploy/server/utils/stripe.ts` (tier definitions + `getStripeItems`). However, pricing/plan rules are still duplicated elsewhere: the billing UI re-implements price calculations, and the Stripe webhook re-encodes Startup included-server rules with local constants. The router’s checkout flow is correctly wired to the centralized module, but overall pricing logic is not fully centralized end-to-end.

  • high

    Remove/replace the duplicated client-side pricing math in `show-billing.tsx` with calls to the centralized pricing module (or expose a shared “pricing preview” helper from `server/utils/stripe.ts` and reuse it on the client).

  • high

    Update `pages/api/stripe/webhook.ts` to compute included server quantity using the centralized tier/price mapping (e.g., reuse Startup base price IDs and included-server quantity from `server/utils/stripe.ts`).

  • med

    Ensure all tier identification/detection paths rely on centralized constants (expand the set of exported constants/rules from `server/utils/stripe.ts` as needed, and remove any local tier-identification logic elsewhere).

Metering decoupled from pricing model 0%

The codebase does not implement a metering layer that captures usage generically and maps it to charges in a separate billing layer. Instead, Stripe subscription price IDs and webhook events are used to compute entitlement quantities (serversQuantity) and immediately drive core behavior (server activation/inactivation and user entitlement fields). This is indicative of pricing/plan mechanics being coupled to core product logic rather than decoupled via a generic metering subsystem.

  • high

    Introduce a generic metering/usage capture component (e.g., record server usage events or current usage counters) that writes usage records without any Stripe pricing knowledge. Then implement a billing/mapping component that converts usage meters to charges/entitlements, and finally have core apply entitlements from that billing result.

  • med

    Move plan determination away from inline Stripe price-id checks in core/routers. Instead, compute entitlements centrally (from billing mapping) and expose a stable entitlement interface (e.g., maxServers) to the rest of the app.

  • low

    Add tests around the separation boundary: (a) metering correctness independent of Stripe, (b) billing mapping correctness independent of core activation logic, and (c) core entitlement application behavior independent of Stripe pricing structures.

Feature gating via flags, not forks 100%

The codebase has a strong, centralized entitlement-gating approach for enterprise features: a reusable `EnterpriseFeatureGate` component for UI and an `enterpriseProcedure`/router checks for backend enforcement. Enterprise-locked pages and modules (whitelabeling, SSO, audit logs) consistently use the gate rather than introducing forked per-plan/per-customer logic.

  • med

    For proprietary routers, prefer consistently using `enterpriseProcedure` (or a single shared server-side guard) where feasible, to reduce duplicated `hasValidLicense(...)` checks (e.g., compare `audit-log.ts` with the `enterpriseProcedure` pattern).

Documented extension interface N/A

No documented extension/plugin interface (a stable, versioned contract for customer/partner extension isolated from core) was found. The codebase instead handles variation (providers/source types and webhook behavior) via core conditional logic and hardcoded component registrations, which implies extension requires code changes rather than config-driven plugin registration.

  • high

    Introduce a documented extension contract (interfaces + registration mechanism) for deploy/webhook handling and provider implementations, so new providers or webhook behaviors can be added by registering an implementation instead of editing core `if/else` logic.

  • med

    Refactor the provider UI (and any provider-specific server logic) to consume a registry of provider definitions/components rather than hardcoded imports + union types.

Customization isolation & upgrade safety 100%

This codebase implements customization isolation for whitelabeling in a largely upgrade-safe, config-driven way. Whitelabeling is centralized behind a dedicated API router and applied through a provider that injects branding/meta/CSS from persisted configuration, rather than forking core UI logic per customer.

  • high

    Add/verify sanitization and safety controls around customCss since it is injected via dangerouslySetInnerHTML. Ensure the contract clearly defines what CSS is allowed so upgrades and security posture remain consistent across versions.

  • med

    Document the whitelabelingConfig schema as a stable, versioned contract (fields, expected formats, backward-compat rules). This reduces the chance that future core upgrades break older saved customer configurations.

  • low

    If whitelabeling is intended to be tenant-scoped (multiple tenants), confirm the persistence layer (getWebServerSettings/updateWebServerSettings) truly scopes config per tenant rather than using a single global value.

Theming / white-label as config 100%

The codebase implements white-label theming as a persisted, data-driven configuration (whitelabelingConfig) with TRPC endpoints and React hooks. Branding is applied via a dedicated WhitelabelingProvider that injects runtime CSS and document metadata, and key public/error auth surfaces read config via useWhitelabelingPublic(). This supports onboarding new partners by updating configuration rather than forking builds.

  • low

    Consider adding a small set of automated checks/tests to ensure all required themable fields (metaTitle, faviconUrl, customCss, loginLogoUrl, errorPageTitle/Description, footerText) are consistently read on each branded surface after UI refactors.

Tenant-configurable behavior surface 100%

Tenant-configurable behavior surface exists and is implemented as a configuration model for whitelabeling/branding. The system provides an owner-gated mutation to update persisted `whitelabelingConfig`, a public read to expose branding fields to unauthenticated pages, and UI components that render onboarding branding from that configuration.

  • high

    Audit other customer-requested variation areas (e.g., workspace limits, feature rules, workflow fields) for the same pattern: a persisted settings/rules model + centralized update/read APIs + consumption points in UI/business logic. Whitelabeling is present; verify whether other behavior requests are still hardcoded in code paths.

  • med

    Confirm that non-public read paths and all rendering entrypoints consistently use the configuration model (avoid any lingering hardcoded defaults that require code edits for further customization).

Onboarding-by-configuration cost 75%

The codebase supports low-touch onboarding by relying on generic, self-serve flows (register + invitation acceptance) and tenant provisioning via data-layer mutations (organization creation inserts DB records). Additionally, onboarding UI branding is driven by whitelabeling configuration rather than per-customer code forks. Overall, this aligns with “onboarding-by-configuration cost” as new customers/orgs appear to be onboarded by creating/updating data, not editing or forking code.

  • high

    Add/confirm a documented, self-serve onboarding runbook that explicitly states: (1) how a new org/tenant is created (API/DB provisioning flow), (2) how invitations are issued and accepted, and (3) what whitelabeling configuration options are required for brand onboarding—so onboarding is operational/config, not engineering.

  • med

    Verify and centralize the whitelabeling configuration source-of-truth (used by onboarding UI) and ensure it supports all onboarding-relevant brand fields without requiring code changes.

Not applicable to this codebase: No hardcoded customer branching, Documented extension interface.

Procurement Code Readiness

Data-export and data-subject erase/export endpoints, region pinning, and DPA-mapped controls that survive enterprise procurement.

0% 7/10 scored
  • Self-serve trust documentation 0%
    0/1 expected sites
  • Controls-to-contract mapping 0%
    0/1 expected sites not present
  • Data export mechanism 0%
    0/4 expected sites not present
  • Deletion / erase-on-request 0%
    0/3 expected sites not present
  • Data residency commitment 0%
    0/3 expected sites not present
  • Enterprise access controls 0%
    0/2 expected sites not present
  • Sub-processor transparency 0%
    0/3 expected sites not present
Self-serve trust documentation 0%

A single committed doc exists (SECURITY.md), but it is limited to vulnerability disclosure expectations. The repository does not provide a self-serve trust documentation set suitable for procurement deal-closing (certifications/attestations, DPA/contract commitments, versioned sub-processor transparency artifact, pen-test summaries, and operational control/status evidence are not packaged in the trust docs).

  • high

    Create or expand a prospect-facing trust-center doc set (e.g., docs/trust/ + a trust landing page) that self-serves the standard procurement artifacts: current SOC 2/ISO attestations (with version/date), DPA/contract commitments (or the current DPA/terms), a maintained versioned sub-processor list that prospect reviewers can reconcile to the service’s actual integrations, pen-test/security assessment summaries, and control/status evidence (and how it is kept current).

    • SECURITY.md:1-29 — Current content covers vulnerability reporting but does not package the required trust artifacts for self-serve procurement diligence.
  • med

    Add a dedicated maintained sub-processor transparency document (in docs/trust or similar) that is explicitly prospect-consumable (not just code templates) and includes versioning/date, and ensure the entries match the third-party integrations actually used.

    • packages/server/src/templates/processors.ts:1-20 — A ‘processors’ template file is present, but it is implementation code and not a prospect-consumable, maintained sub-processor inventory doc; a real trust-list doc should exist alongside this.
Questionnaire response library N/A

No questionnaire response library (CAIQ/SIG/VSA response bank) is present in the repository. Per the primitive’s definition, this is a DATA-ROOM artifact; its absence here is expected. Request the current, versioned questionnaire response set from the seller for procurement diligence.

  • high

    Ask the seller’s GC / R&W underwriter for the current, versioned security questionnaire response library (e.g., CAIQ/SIG/VSA) mapped to the relevant frameworks/controls and aligned to the system versions in production.

    • : — git_artifact_scan: `questionnaire` category count=0 (absent). This is a DATA-ROOM follow-up; do not treat as a code gap.
Controls-to-contract mapping 0%

The controls-to-contract mapping primitive is not packaged in this codebase. The only relevant doc-adjacent security artifact found (SECURITY.md) does not include any DPA/MSA mapping of commitments to implemented controls and audit evidence. No DPA/MSA/legal mapping artifact was found, so there is nothing to grade for deal-closing traceability.

  • high

    Create/locate the seller’s DPA/MSA controls-to-contract mapping document (controls mapping table) that explicitly maps each DPA commitment (e.g., encryption, retention, breach notice, data residency) to (a) the implemented system mechanism(s) and (b) the audit evidence artifact(s) (e.g., SOC 2 Type II control tests, configuration evidence, logs/reports). Ensure it is versioned and cross-references any code-visible enforcement points.

    • SECURITY.md:1-29 — Current doc content does not contain the required mapping/traceability statements; serves as evidence of what is currently packaged.
  • high

    Add a packaged traceability section to the trust/security documentation set that names the DPA/MSA commitments and points reviewers to the controls mapping document and its evidence sources (e.g., SOC 2 report version, audit evidence index).

    • SECURITY.md:1-29 — Shows the repository’s only trust/security doc and indicates where the mapping should be integrated or linked.
Data export mechanism 0%

No complete tenant-scoped 'export all my data on request' mechanism was found in the codebase. The only 'download/export' behaviors observed are partial: feature/user downloads (2FA backup codes), view-specific downloads (Docker logs), and backup/restore tooling for specific artifacts (volume backups, web-server backup jobs) rather than a packaged, tenant-scoped export endpoint/job covering ALL tenant data in a portable format.

  • high

    Add a tenant-scoped 'export all data' request flow: (1) a protected API/TRPC endpoint that initiates an async export job for the tenant; (2) job execution that gathers all tenant-owned data across products/modules; (3) a portable output format (e.g., tenant data JSON/CSV + media bundles) packaged into a downloadable archive; (4) completion status + secure download link; (5) explicit pagination/streaming and size limits.

  • high

    Ensure tenant scoping and permission checks are explicit in the export job scheduler and data retrieval layer (e.g., require tenantId/orgId context and enforce membership).

  • med

    Wire the export mechanism into the UI as a single 'Download my data' action that triggers the tenant export job rather than providing multiple partial feature downloads.

Deletion / erase-on-request 0%

The codebase contains many generic “delete/remove” operations (e.g., deleting projects and backup records), but there is no verifiable, tenant/subject-scoped erase-on-request implementation that demonstrably cascades through backups/derived data and ties the deletion to an auditable erase request. The primary evidence found is consistent with row deletions rather than an erase-on-request primitive.

  • high

    Implement a dedicated, customer/data-subject-initiated erase workflow (e.g., `eraseOnRequest(subjectId|tenantId, requestId)`) that (1) validates authorization, (2) determines all data domains/derived stores/backups to remove for that tenant/subject, (3) executes a verified cascade (including backup/object-store cleanup), and (4) records an auditable, end-to-end deletion status report keyed by the erase request ID.

  • high

    Augment deletion handlers to explicitly remove associated backup artifacts in storage (or create a job that does so) and link that action to the erase request/audit log; do not rely on deleting only DB metadata for backups.

  • med

    Add/extend automated tests that prove cascade behavior: when an erase request is executed for a tenant/subject, derived tables and backup/storage artifacts are absent afterward (or marked as deleted) and the final deletion result is recorded against the request ID.

Data residency commitment 0%

No end-to-end “data residency commitment” mechanism was found. While the codebase contains a `region` field, it is used only for configuring S3 backup/destination connectivity (passed to `rclone` as `--s3-region`). There is no evidence of tenant/org region pinning plus region-keyed routing that enforces where tenant data and compute run.

  • high

    Add a tenant-scoped residency attribute (org/tenant “data residency region”) and enforce it end-to-end: data placement (storage location/bucket/DB/cluster) and request routing should be keyed off this tenant residency value, with checks at boundaries (API/middleware/router) so cross-region writes/reads are blocked.

  • high

    Update backup/export logic to enforce residency rather than only target S3 region. If backups must also be residency-bound, derive the required region from the tenant residency setting (not only from a per-destination credential field) and validate compliance at backup upload time.

  • med

    Add explicit audit/traceability for residency enforcement (e.g., log the tenant residency region used for placement/routing decisions and persist it with job/run metadata) so procurement reviewers can verify enforceability.

Enterprise access controls 0%

The codebase shows partial IP-related primitives (an IP-in-CIDR helper and Traefik whitelist middleware typing), but there is no evidence that a tenant-configurable IP allowlist is enforced at the request boundary with a corresponding admin surface. Router middleware wiring appears to cover auth and path/redirect behaviors, not network restriction allowlisting.

  • high

    Implement tenant-scoped IP allowlist enforcement at the Traefik edge boundary: add middleware generation that writes a Traefik `ipWhiteList`/equivalent allowlist middleware based on tenant configuration, and attach it to the router during `createRouterConfig` (or an equivalent per-tenant router builder).

  • high

    Add an admin surface for managing the allowlist (per tenant) that persists the CIDRs and triggers propagation to the dynamic Traefik config (local and remote/serverId paths), so procurement can obtain an auditable control mapping.

  • med

    If CDN IP handling is intended to be part of access control, wire it into the tenant allowlist enforcement path (instead of only providing helper functions and hardcoded CDN ranges). Ensure the allowlist logic is actually used to accept/deny requests at the edge.

Sub-processor transparency 0%

The repository does not provide evidentiary, versioned sub-processor transparency artifacts suitable for closing a DPA procurement clause. The only repo artifact surfaced under “subprocessors” is `packages/server/src/templates/processors.ts`, but its contents are unrelated to maintaining a sub-processor list (it’s a template processing utility). Code does include third-party integrations (e.g., Stripe and AI provider SDKs), but there is no corresponding packaged, versioned sub-processor inventory available in the repo to match those integrations to a DPA clause.

  • high

    Create/restore the expected maintained, versioned sub-processor inventory artifact (e.g., under `docs/subprocessors/` or a committed `SUBPROCESSORS.md`) and make it explicitly DPA-backed (including: version/date, named sub-processors, and what they do / data categories as applicable). Ensure new third-party processor additions trigger an update to this list.

  • high

    Cross-check the new/updated sub-processor list against actual third-party SDK usage in code (at minimum: Stripe integrations, and the AI-provider selection/invocation utilities). Add any missing third parties to the inventory and bump the version/date.

Compliance attestation readiness N/A

The codebase does not contain (and the scan did not find) the required procurement data-room artifact for this primitive: a current compliance attestation readiness package (e.g., SOC 2 Type II report + control-to-code traceability/control mapping). This is expected because the artifact should be supplied from the seller’s data room, not derived from source code.

  • high

    Request the current SOC 2 Type II (or equivalent: ISO 27001, and any required pen-test/assurance materials) and the corresponding control-to-code (Dim 5 audit evidence) traceability package from the seller/data owner. Ensure it is current (latest report period) and includes a clear mapping of each relevant control to implemented mechanisms and evidence artifacts.

  • med

    Ask for versioned documentation that ties the attestation to the specific product/release scope being procured (e.g., what services/tenants are in-scope for the Type II report, and how the control mapping corresponds to that scope).

Reliability / SLA evidence N/A

No packaged Reliability/SLA evidence artifacts (e.g., status page configuration, published SLA terms, or incident postmortems/runbooks) were found in the repository. The git-based evidence scan reports the `status_sla` category as absent. While the codebase contains health-check/monitoring logic (operational mechanics), that does not constitute deal-ready procurement evidence for uptime/SLA track record.

  • high

    Request the seller’s current, published SLA terms and status/uptime reporting artifacts (status page URL or repo config, uptime/availability metrics definition, and any incident/postmortem write-ups) so procurement can map the operational track record.

  • med

    Ask for runbooks/post-incident reports showing reliability handling (e.g., how incidents are detected, triaged, communicated, and resolved) and versioned evidence that these practices are maintained.

Not applicable to this codebase: Questionnaire response library, Compliance attestation readiness, Reliability / SLA evidence.

Reporting & Data Export

Customer-accessible export endpoints (CSV, Parquet, JSON), scheduled exports, and a documented map of emitted events.

36% 6/10 scored
  • On-demand data export 0%
    0/2 expected sites not present
  • Scheduled / recurring exports 71%
    6/7 expected sites
  • In-product reporting / analytics 67%
    3/4 expected sites
  • Documented export / event schema 78%
    3/3 expected sites
  • Export access control & audit 0%
    0/2 expected sites
  • Exit portability / no lock-in 0%
    0/4 expected sites not present
On-demand data export 0%

No on-demand tenant data export primitive was found. The codebase exposes backup/volume-backup APIs (with permission checks and audit logging), but these are infrastructure/data-backup workflows (create/schedule/run/restore), not a tenant-scoped “download/export my data” takeout in portable formats suitable for customer analytics/warehouse/exit.

  • high

    Implement a true tenant-scoped on-demand export/download handler (e.g., TRPC/HTTP route) that exports the full tenant dataset (all customer-relevant entities) into portable formats (tabular/columnar/structured), with streaming or chunking for large exports.

  • high

    Ensure the export endpoint is authorization-gated at tenant scope (not only service-level), and write an audit log entry on each export request/result (including export scope and output metadata).

  • med

    Add/standardize a portable export format contract (e.g., versioned schema + manifests) and validate completeness coverage against the tenant’s data model before allowing downloads.

Export completeness & fidelity N/A

No code-visible primitive for “Export completeness & fidelity” was found. The repository appears to implement scheduled/manual backups (database + filesystem/volume backup artifacts) with S3 uploads and restore flows, but it does not include a customer-facing, tenant-scoped export endpoint/job that exports the customer’s complete data model with correct types/relationships for round-trippable analytics/warehouse ingestion.

  • high

    Add a true tenant-scoped “data export” mechanism (export endpoint + export job) that serializes ALL exportable entities/fields from the customer’s data model (customer, financial, operational, config, permissions/accounts, integration specs, and historical analytics if applicable) into a portable format, with explicit inclusion/exclusion and stable typing/relations.

  • high

    Ensure the export path is permission-gated and tenant-scoped, and includes auditing of export initiation/completion plus integrity checks (e.g., row counts/hashes) to prevent silent truncation.

  • med

    Define and test an export coverage matrix (entity/field → export output → schema) and add regression tests that diff the export output against the expected data model so that adding/removing entities doesn’t silently break export completeness.

Large / async export handling N/A

No code-visible mechanism matching the “Large / async export handling” primitive (i.e., an async export job that handles large tenant datasets with streamed output/download and progress/notification for data portability) was found. The repository primarily implements backup/restore and job scheduling for operational volume/database backups, including some streaming of restore logs, but this does not constitute the customer-facing bulk dataset export primitive this audit targets.

  • high

    Confirm whether the product has a customer data export/takeout feature at all (tenant-scoped, permissions + audit), and if so, locate its implementation. If it exists outside the codebase (or under a different term than export/backup/takeout), update the search accordingly (e.g., “takeout”, “data export”, “dump”, “export file”, “download job”).

  • med

    If you intend backups to satisfy this primitive, replace/extend the backup workflow with an export-job pipeline that (1) covers all tenant dataset categories in a portable format, (2) runs asynchronously (job/queue), (3) streams the output (no buffering the whole dataset in-request), and (4) provides progress/notification and a downloadable artifact.

Scheduled / recurring exports 71%

Scheduled/recurring execution exists and is implemented via a persistent schedule model (`cronExpression`, `enabled`, `organizationId`), a runner that initializes enabled schedules into recurring queue jobs, and background BullMQ workers that execute them. For “scheduled exports”, the concrete recurring export delivery path is the backup scheduling logic: enabled backup schedules are cron-triggered and run backup routines that use configured destinations (S3 credentials are constructed). Schedule creation/update/delete is permission-gated and audited.

  • high

    Add/verify retry policy and a dead-letter queue (DLQ) on the recurring export queue jobs. Current queue setup removes jobs on completion/failure, but no explicit retry/backoff/DLQ handling is shown in the scheduler/worker wiring.

  • med

    Confirm tenant isolation end-to-end for scheduled execution. The runner queries enabled schedules broadly (no organization filter in the shown bootstrapping), so verify the underlying DB/data model and job payload always confine execution to the owning organization (and that backups/destinations are organization-scoped).

Warehouse sync / reverse-ETL N/A

No warehouse-sync / reverse-ETL primitive was found. The repo contains no off-graph warehouse connector configuration artifacts (e.g., dbt/airbyte/fivetran/singer/meltano configs) in the expected `warehouse_sync_config` category, and the codebase appears to focus on infrastructure orchestration plus backups/restores rather than exporting customer data into external BI/warehouse destinations via incremental sync.

  • high

    Add a warehouse-sync reverse-ETL layer with maintained, tenant-scoped connector configs (dbt/airbyte/fivetran/singer/meltano or an equivalent internal sync service), including incremental sync state storage and documented target support.

  • high

    Implement and document a complete data export contract for warehouse sync: supported destinations, sync frequency/incremental semantics, and the exported schema so customers can rely on portable results.

  • med

    Ensure the sync/export path is permission-gated and audited (tenant scope + audit logs for export actions and failures).

In-product reporting / analytics 67%

The repo contains a real in-product reporting/analytics module: customer dashboard pages (e.g., Docker container dashboard) backed by tenant-scoped, permission-gated TRPC queries. The UI supports typical analytics interactions (filter/sort/paginate) and is not just a static admin chart. However, evidence of broader portable “data-out” reporting exports is not present in the discovered reporting paths for this primitive (the audit scope here is limited to in-product reporting itself).

  • high

    Extend the reporting surfaces with explicit customer data-export paths (tenant-scoped, permission-gated, auditable) so “insight out” is possible beyond in-product views.

  • med

    Confirm tenant-scoping and permission enforcement consistently across all reporting dashboards (not only Docker). For the monitoring dashboard page, verify the server-side permission checks complete the flow through to the metric source endpoints.

Event stream completeness N/A

N/A for this primitive in this codebase. While there are runtime event-emission usages (e.g., WebSocket/EventEmitter style 'emit' for log streaming), the repository does not expose a complete, documented internal event catalog for an export/reporting event stream that can be diffed against internal event emit/publish/dispatch/track sites. As a result, there is no implementable 'emitted-vs-documented' completeness loop to audit for drift.

  • high

    Add (and maintain) a documented internal event catalog for the reporting/export event stream (event names + payload schema + versioning) in a doc-adjacent artifact that matches what the backend actually emits (e.g., AsyncAPI EVENTS.md or equivalent), then wire backend emission to that catalog.

  • med

    Implement an internal event emission layer with a single dispatch function (e.g., publish/track wrapper) and ensure all product event occurrences flow through it, so the emitted set can be deterministically compared to the documented catalog.

  • low

    Create an automated CI check that extracts emitted event names from the code (emit/publish/track/dispatch sites) and diffs against the documented catalog to detect drift.

Documented export / event schema 78%

This codebase has documented schema artifacts for consumer integration via a maintained, versioned OpenAPI specification (openapi.json) generated from the live API router and synced through CI. However, this appears to be API contract documentation rather than an explicit async/event-catalog schema (e.g., asyncapi-style event catalog) specifically for export/event payloads.

  • high

    Add/maintain an explicit export/event schema catalog (e.g., asyncapi or EVENTS.md-style documentation) that enumerates exported event names and their payload shapes, and version it alongside the OpenAPI contract.

    • openapi.json:1-120 — Current documented schema evidence is the OpenAPI spec artifact; no separate, explicitly versioned export/event payload catalog was identified in the doc-adjacent scan output.
  • med

    Ensure the documented schema clearly marks which endpoints correspond to bulk/export or event streams, and include response/payload examples for the egress surfaces.

Export access control & audit 0%

The codebase does include an export-adjacent access control + audit mechanism for backup lifecycle operations: backup create/update/delete and manual backup runs are permission-checked (tenant/service scoping) and consistently written to the audit trail using audit(ctx, ...). However, the backup file listing (rclone lsjson) and the restore-with-logs streaming endpoint appear to be permission/tenant-checked but do not show audit log writes on those specific data-movement endpoints, which is a gap for an “export access control & audit” primitive.

  • high

    Add audit log writes to the listBackupFiles handler on the actual data-egress/export-like operation path (after permission + org checks, before/after rclone listing). Include action (e.g., 'list'), resourceType (e.g., 'backupFile' or 'destination'), and resourceId (destinationId/serverId if applicable).

  • high

    Add audit log writes to restoreBackupWithLogs for the restore action, using the same audit(ctx, ...) utility. Audit should occur once per restore request (and optionally include destinationId/source identifiers) after permission checks succeed.

  • med

    Create a small internal convention helper (e.g., auditBackupAction(ctx, action, backupId, destinationId)) and use it across all backup/export-like endpoints (create/update/delete/run/list/restore) to reduce drift and ensure every portable data-access path is auditable.

Exit portability / no lock-in 0%

Exit portability (no lock-in) is not implemented as a complete, customer-accessible full-account data export/takeout mechanism. The codebase has backup/retention automation (DB/compose dumps to destinations) and schedule management with permission checks, but there is no evidence of a tenant-scoped full-account export endpoint/job that guarantees complete, portable data extraction for exit. The available terms file also does not contain an explicit data portability / termination export-rights clause.

  • high

    Add a tenant-scoped, permissioned “full account export / data takeout” API endpoint that triggers a job to export ALL tenant-relevant data (at minimum: services/instances, configurations, volumes/metadata needed to restore, and historical operational data that the product uses). Ensure completeness is verified against the tenant data model (no silent truncation) and export output is in a portable format (e.g., versioned JSON/CSV bundles).

  • high

    Extend/introduce an export-job pipeline that is auditable and tenant-scoped end-to-end: (1) authz + tenant validation at job creation, (2) export execution with streaming/packaging, (3) audit-log writes on job start/completion/failure, and (4) a secure download link or user-notified artifact location.

  • med

    Add an export completeness checklist + tests that assert all tenant entities are included in the export bundle. Use this to prevent drift as the schema evolves.

  • med

    Contract hand-off: confirm with the buyer’s GC that the MSA/terms include a termination/off-boarding data portability clause (export rights prior to lock-in). The current repository terms file does not include such language.

Not applicable to this codebase: Export completeness & fidelity, Large / async export handling, Warehouse sync / reverse-ETL, Event stream completeness.