HYPERFLEET-634 - doc: hyperfleet-api operator guide#76
HYPERFLEET-634 - doc: hyperfleet-api operator guide#7686254860 wants to merge 1 commit intoopenshift-hyperfleet:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughA new comprehensive API Operator Guide documentation file is added to the repository. The guide covers the HyperFleet API component architecture, including resource models, lifecycle management, status aggregation logic, adapter registration and readiness computation, configuration precedence and settings, deployment guidance, and operational troubleshooting. The document provides technical reference material and practical operational guidance without introducing code changes. Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
docs/api-operator-guide.md (1)
73-87: Add explicit languages to the plain fenced blocks.These fences are currently unlabeled and will keep tripping MD040. Tagging them as
textwould make the markdownlint warnings go away without changing the rendered content.Also applies to: 229-235, 396-402, 524-531, 901-903, 907-909
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/api-operator-guide.md` around lines 73 - 87, The fenced code blocks like the one showing the Resource schema (the block beginning with "Resource (e.g., Cluster)" and similar plain blocks elsewhere) are unlabeled and trigger MD040; update each plain triple-backtick fence to include the "text" language tag (e.g., change ``` to ```text) so markdownlint stops flagging them, ensuring you edit the code fences around the Resource/Cluster schema and the other unlabeled blocks referenced in the review.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/api-operator-guide.md`:
- Around line 100-105: The doc is contradictory about validation of spec; update
the text to state clearly that the API stores the resource spec as-is with no
business-logic interpretation (so adapters perform reconciliation based on
generation and status), but that the API will perform schema validation against
the configured OpenAPI schema if one exists. Edit the paragraphs that mention
"stored without validation or interpretation" and the OpenAPI validation section
(references: spec, generation, status, OpenAPI schema) so both say: spec is
persisted unchanged for controller/adapters to act on, while the API will reject
or validate requests against an OpenAPI schema when a schema is configured; keep
generation increment semantics and status reporting unchanged.
- Around line 522-531: The listed secret field names use the prefix secrets/db.*
which contradicts the rest of the doc that expects db.*; update the field list
to use db.host, db.port, db.name, db.user, db.password, and db.rootcert so it
matches the secret creation command and expected output, and make the same
correction for the other occurrences that currently show secrets/db.* (the
sections showing db.* outputs and examples should remain unchanged).
- Around line 75-76: The docs currently state the "id" field is a UUID but the
example uses an opaque non-UUID value; update the documentation so they match by
either (a) changing the field description for "id" from "(UUID, auto-generated)"
to reflect the opaque identifier format used in the sample, or (b) replace the
sample response value with a valid UUID to match the current "(UUID,
auto-generated)" description—ensure you update both occurrences referenced (the
listing that includes "id (UUID, auto-generated)" and the sample response
example) so client validation expectations are consistent.
- Around line 719-723: Replace the hardcoded "1/1 READY" rollout check text in
the kubectl command guidance (the line referencing `kubectl get pods -n
hyperfleet-system -l app=hyperfleet-api -w`) with wording that checks that all
containers are Ready (e.g., "all containers Ready") and add an alternative
recommended command using `kubectl wait` (mention `kubectl wait` by name) to
wait for pod readiness instead of relying on a fixed READY fraction so sidecars
like pgbouncer don't break the guidance.
---
Nitpick comments:
In `@docs/api-operator-guide.md`:
- Around line 73-87: The fenced code blocks like the one showing the Resource
schema (the block beginning with "Resource (e.g., Cluster)" and similar plain
blocks elsewhere) are unlabeled and trigger MD040; update each plain
triple-backtick fence to include the "text" language tag (e.g., change ``` to
```text) so markdownlint stops flagging them, ensuring you edit the code fences
around the Resource/Cluster schema and the other unlabeled blocks referenced in
the review.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8c273be3-42dd-4378-a513-f0f629725a1c
📒 Files selected for processing (1)
docs/api-operator-guide.md
| ├── id (UUID, auto-generated) | ||
| ├── kind (Resource "Cluster" or "NodePool") |
There was a problem hiding this comment.
Align the documented id format with the example.
The guide says id is a UUID, but the sample response uses a non-UUID opaque identifier. Please make these consistent; otherwise client implementations may bake in the wrong validation/parsing rule.
Suggested doc fix
-├── id (UUID, auto-generated)
+├── id (opaque auto-generated identifier)or replace the sample value with an actual UUID.
Also applies to: 129-129
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/api-operator-guide.md` around lines 75 - 76, The docs currently state
the "id" field is a UUID but the example uses an opaque non-UUID value; update
the documentation so they match by either (a) changing the field description for
"id" from "(UUID, auto-generated)" to reflect the opaque identifier format used
in the sample, or (b) replace the sample response value with a valid UUID to
match the current "(UUID, auto-generated)" description—ensure you update both
occurrences referenced (the listing that includes "id (UUID, auto-generated)"
and the sample response example) so client validation expectations are
consistent.
| 1. **Desired State (spec)**: When you create or update a resource, you provide a `spec` containing the desired configuration (e.g., cluster region, version, node count). The API stores this as-is without validation or interpretation. | ||
|
|
||
| 2. **Automatic Version Tracking (generation)**: Every time you update the `spec`, the API automatically increments the `generation` counter. This allows distributed adapters to detect when they need to reconcile infrastructure changes. | ||
|
|
||
| 3. **Observed State (status)**: Adapters report their progress and results back to the API via status endpoints. The API aggregates these reports into unified resource-level conditions (e.g., `Ready`, `Available`). | ||
|
|
There was a problem hiding this comment.
Clarify whether spec is validated or stored blindly.
Section 2.1 says the API stores spec without validation, but Section 3.7 says spec is validated against the OpenAPI schema. Those behaviors are materially different for operators and API consumers. Please reconcile this and spell out the real rule, e.g. “stored without business-logic interpretation, but schema-validated when a schema is configured.”
Also applies to: 625-629
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/api-operator-guide.md` around lines 100 - 105, The doc is contradictory
about validation of spec; update the text to state clearly that the API stores
the resource spec as-is with no business-logic interpretation (so adapters
perform reconciliation based on generation and status), but that the API will
perform schema validation against the configured OpenAPI schema if one exists.
Edit the paragraphs that mention "stored without validation or interpretation"
and the OpenAPI validation section (references: spec, generation, status,
OpenAPI schema) so both say: spec is persisted unchanged for controller/adapters
to act on, while the API will reject or validate requests against an OpenAPI
schema when a schema is configured; keep generation increment semantics and
status reporting unchanged.
| The secret must contain these fields: | ||
|
|
||
| ``` | ||
| secrets/db.host → PostgreSQL hostname | ||
| secrets/db.port → PostgreSQL port | ||
| secrets/db.name → Database name | ||
| secrets/db.user → Database username | ||
| secrets/db.password → Database password | ||
| secrets/db.rootcert → (Optional) SSL root certificate | ||
| ``` |
There was a problem hiding this comment.
Fix the secret key names; the current guide contradicts itself.
The “must contain these fields” list uses secrets/db.*, but the creation command and expected output use db.*. The field list should match the actual secret keys, otherwise operators will create the wrong secret shape.
Suggested doc fix
-secrets/db.host → PostgreSQL hostname
-secrets/db.port → PostgreSQL port
-secrets/db.name → Database name
-secrets/db.user → Database username
-secrets/db.password → Database password
-secrets/db.rootcert → (Optional) SSL root certificate
+db.host → PostgreSQL hostname
+db.port → PostgreSQL port
+db.name → Database name
+db.user → Database username
+db.password → Database password
+db.rootcert → (Optional) SSL root certificateAlso applies to: 661-667, 674-674
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 524-524: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/api-operator-guide.md` around lines 522 - 531, The listed secret field
names use the prefix secrets/db.* which contradicts the rest of the doc that
expects db.*; update the field list to use db.host, db.port, db.name, db.user,
db.password, and db.rootcert so it matches the secret creation command and
expected output, and make the same correction for the other occurrences that
currently show secrets/db.* (the sections showing db.* outputs and examples
should remain unchanged).
| - [ ] Monitor pod startup: | ||
| ```bash | ||
| kubectl get pods -n hyperfleet-system -l app=hyperfleet-api -w | ||
| ``` | ||
| Wait until all pods show `1/1` READY and `Running` status (press Ctrl+C to stop watching) |
There was a problem hiding this comment.
Avoid hardcoding 1/1 READY in the rollout check.
This will be wrong as soon as the pod includes an additional container, so the guidance should be phrased as “all containers Ready” or use kubectl wait instead of a fixed 1/1 expectation. Based on learnings, the pgbouncer sidecar can be rendered into the deployment when the in-chart PostgreSQL path is used.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/api-operator-guide.md` around lines 719 - 723, Replace the hardcoded
"1/1 READY" rollout check text in the kubectl command guidance (the line
referencing `kubectl get pods -n hyperfleet-system -l app=hyperfleet-api -w`)
with wording that checks that all containers are Ready (e.g., "all containers
Ready") and add an alternative recommended command using `kubectl wait` (mention
`kubectl wait` by name) to wait for pod readiness instead of relying on a fixed
READY fraction so sidecars like pgbouncer don't break the guidance.
| **How registration works:** You define which adapters are required for each resource type to be marked as `Ready`. Adapters are registered at API startup via environment variables or Helm configuration, and the API will not start if this configuration is missing. | ||
|
|
||
| Only **registered adapters** participate in status aggregation: | ||
| - The `Ready` condition checks if all registered adapters report `Available=True` |
There was a problem hiding this comment.
At the current resource.spec.generation
This is not the case for adapters reporting Available=True for older generations
| | Type | Meaning | | ||
| |------|---------| | ||
| | **Available** | Adapter's work is complete and operational | | ||
| | **Applied** | Kubernetes resources were created/configured | |
There was a problem hiding this comment.
May we say "kubernetes/maestro resources
And created/applied ? because what does it mean for a kubernetes resource to be "configured" ?
| - `status` must be `True`, `False`, or `Unknown` | ||
| - `observed_generation` must be a valid integer | ||
|
|
||
| #### 2.3.3 Aggregation logic |
There was a problem hiding this comment.
I think one subtle missing part here is that aggregated conditions come also with observed_generation
Available can be True for an older generation
Ready always refers to the newest generation
I think that it can be VERY confusing for users when they have Available=True for resource.spec.generation != resource.status.condition[Available].observed_generation so I think it would be worth adding this comment explicitly
There was a problem hiding this comment.
Also to comment on the meaning of that state... which is to keep the "last known good generation" while a new generation is reconciling.
Because without this reasoning it seems like a random rule (at least for me)
|
|
||
| **When `Available=Unknown` is reported for the first time** (no prior status from this adapter): | ||
| - ✅ Accepts and stores the status | ||
| - ⏭️ Skips status aggregation (resource-level conditions unchanged) |
There was a problem hiding this comment.
I'm not sure if we skip the aggregation, or the aggregation just does a no-op.
I think this is an implementation detail not worth mentioning here.
The only thing relevant is that getting resource/statuses will contain some conditions with status=Unknown which convey the meaning of "in-progress"
| - Returns `204 No Content` (no action taken) | ||
|
|
||
| **When `Available=True` or `Available=False` is reported** (any time): | ||
| - ✅ Always stored and triggers status aggregation |
There was a problem hiding this comment.
Always is a bit strong.... if an adapter reports for an older observed_generation the response may be discarded if there is already a newer condition stored.
|
|
||
| This pattern allows adapters to signal initial progress without affecting resource readiness, while ensuring eventual convergence to a definitive state (`True` or `False`). | ||
|
|
||
| ### 2.4 Generation |
There was a problem hiding this comment.
Should it be worth mentioning that there is no optimistic locking in place?
When reading about generation my first intuition is to say... ah, if I want to change an spec I have to provide the current generation in the POST because it will be checked.
But we are not checking anything, so just to avoid missconfussion
| | **401 Unauthorized** | Missing or invalid JWT token | Verify authentication is enabled (`auth.enableJwt=true`). If production, ensure valid JWT token is provided. Reference: [Authentication Guide](authentication.md). | | ||
| | **404 Not Found** | Resource doesn't exist | Verify resource ID is correct. Check if resource was deleted: `curl http://<api-service>:8000/api/hyperfleet/v1/clusters/$CLUSTER_ID`. | | ||
| | **409 Conflict** | Concurrent update or generation mismatch | Retry with exponential backoff. Ensure only one controller updates the same resource. | | ||
| | **500 Internal Server Error** | Database error or unexpected panic | Check API logs: `kubectl logs -n hyperfleet-system -l app=hyperfleet-api --tail=100`. Verify database connectivity with `/readyz` endpoint. | |
There was a problem hiding this comment.
The troubleshooting row references --db-timeout=60s but this flag does not exist in the codebase. There is no db-timeout flag registered in pkg/config/db.go or anywhere else. Did you mean a different flag, or is this a planned feature?
|
|
||
| ### Phase 4: Post-Deployment Validation | ||
|
|
||
| **Verify Service Health** |
There was a problem hiding this comment.
This kubectl exec command assumes the API container has psql installed and that DB_HOST, DB_USER, DB_NAME are available as environment variables. In the actual deployment, database credentials are mounted as files at /app/secrets/db.host, /app/secrets/db.user, etc. — not env vars. The container image is also a minimal Go binary without psql.
Consider replacing this with the pg-debug pod pattern already shown above (line 684), or use:
kubectl run pg-debug --rm -it --image=postgres:15-alpine \
--restart=Never -n hyperfleet-system -- \
psql -h <db-host> -U hyperfleet -d hyperfleet -c "\dt"There was a problem hiding this comment.
Mind you that when we change the config to standard form (@yasun1 is on it) there will be no more files for configuring the credentials (🎉)
|
|
||
| **Helm deployment:** | ||
|
|
||
| ```yaml |
There was a problem hiding this comment.
The default bind addresses in the code are localhost:8000, localhost:8080, and localhost:9090 — these bind to loopback only. The doc shows :8000, :8080, :9090 (all interfaces) which is what the Helm chart overrides to, but isn't the built-in default.
This matters for the "Direct binary execution" examples — running ./hyperfleet-api serve without flags will bind to localhost, not all interfaces. The examples here show --api-server-bindaddress=:8000 which works, but the parenthetical "(default: :8000)" is misleading since the actual default is localhost:8000.
| - **Authentication**: See [Authentication Guide](authentication.md) | ||
| - **Helm chart values**: See [Deployment Guide](deployment.md) | ||
|
|
||
| ### 3.1 Configuration Overview |
There was a problem hiding this comment.
I'm worried that we will duplicate the configuration here and then in a configuration.md doc that will contain all the settings
Could we leave out this section in favor of that configuration.md file? (which is pending)
|
|
||
| Production should use `LOG_LEVEL=info` and `LOG_FORMAT=json` for structured logging. | ||
|
|
||
| ### 3.7 Schema Validation |
There was a problem hiding this comment.
I think there is some "magic" happening here.
How does the API know what to validate the spec against?
Right now, the API has hardcoded validations for Clusters and Nodepools
- It uses the
payload.kindfield to determine the type - It then looks within the schema for
ClusterSpecorNodePoolSpec - We may want to generalize this for the future
|
|
||
| **Define Adapter Requirements** | ||
|
|
||
| - [ ] List required adapters for each resource type (see [Adapter Registration](#22-adapter-registration)) |
There was a problem hiding this comment.
I would highlight that these are examples, the names are quite convincing on making me think that this should be the real configuration
There was a problem hiding this comment.
Also, for validation we could have cluster-validation and nodepool-validation as I doubt there will be a single adapter for both
|
|
||
| | Condition | Meaning | When True | | ||
| |-----------|---------|-----------| | ||
| | **Available** | Resource is operational at any known good configuration | All registered adapters report `Available=True` for the **same** `observed_generation` | |
There was a problem hiding this comment.
Priority: Bug
The Available condition definition is factually wrong in the table (line 323)
and in the pseudocode (lines 349-351). Both say adapters must report
Available=True for the same observed_generation, but ComputeAvailableCondition
in pkg/services/status_aggregation.go checks for Available=True at any
generation — no generation matching is performed.
This is actually the key semantic difference between Available and Ready:
- Available = all adapters report Available=True at any generation (preserves
last known good state) - Ready = all adapters report Available=True at the current generation
| | **Available** | Resource is operational at any known good configuration | All registered adapters report `Available=True` for the **same** `observed_generation` | | |
| | **Available** | Resource is operational at any known good configuration | All registered adapters report `Available=True` (at any generation) | |
Suggested fix for pseudocode (lines 349-351):
// True when all registered adapters report Available=True (at any generation)
// False when any registered adapter reports Available=False (at any
generation)| HYPERFLEET_NODEPOOL_ADAPTERS='[]' | ||
| ``` | ||
|
|
||
| This makes all resources immediately `Ready=True`. |
There was a problem hiding this comment.
Priority: Bug
Line 505 says empty adapter arrays ('[]') make resources "immediately
Ready=True", but the code does the opposite. Both ComputeReadyCondition and
ComputeAvailableCondition in pkg/services/status_aggregation.go return false
when requiredAdapters is empty:
if len(adapterStatuses) == 0 || len(requiredAdapters) == 0 {
return false
}So empty arrays make resources permanently Ready=False / Available=False. An
operator following this guidance would get stuck.
Either the doc should be corrected to say resources stay Not Ready, or (if the
intent is to allow "no adapter" mode) the code needs to be updated. Worth
clarifying which is the desired behavior.
There was a problem hiding this comment.
Does running the API without any required adapter make sense?
I will not document this level of detail in the doc, as it is an exceptional situation
| 1. **CRUD operations** for resources (cluster, nodepool) with provider-agnostic specifications | ||
| 2. **Status aggregation** from multiple adapters into unified resource conditions | ||
| 3. **Generation tracking** to coordinate spec changes across distributed adapters | ||
| 4. **Resource lifecycle management** with cascade deletion and referential integrity |
There was a problem hiding this comment.
Priority: Inconsistency
Line 57 says the API provides "cascade deletion and referential integrity",
but the DB migration for node_pools uses ON DELETE RESTRICT (prevents parent
deletion when children exist), not ON DELETE CASCADE (auto-delete children).
Also, clusterHandler.Delete returns errors.NotImplemented("delete"), so
cluster deletion isn't implemented yet.
Suggested fix:
| 4. **Resource lifecycle management** with cascade deletion and referential integrity | |
| 4. **Resource lifecycle management** with referential integrity between parent and child resources |
There was a problem hiding this comment.
We also will implement soft/logical deletion, so I don't think CASCADE will work with that
|
|
||
| - **Exponential backoff** on retries (5xx errors, conflicts) | ||
| - **Polling intervals** of at least 5-10 seconds for status checks | ||
| - **Batch operations** when creating multiple resources |
There was a problem hiding this comment.
We don't have batch operations capabilities in the API, what should clients do?
Summary by CodeRabbit