Skip to content

HYPERFLEET-634 - doc: hyperfleet-api operator guide#76

Open
86254860 wants to merge 1 commit intoopenshift-hyperfleet:mainfrom
86254860:HYPERFLEET-634
Open

HYPERFLEET-634 - doc: hyperfleet-api operator guide#76
86254860 wants to merge 1 commit intoopenshift-hyperfleet:mainfrom
86254860:HYPERFLEET-634

Conversation

@86254860
Copy link
Contributor

@86254860 86254860 commented Mar 10, 2026

Summary by CodeRabbit

  • Documentation
    • Added comprehensive HyperFleet API Operator Guide with deployment procedures, configuration reference, and troubleshooting guidance.

@openshift-ci openshift-ci bot requested review from rh-amarin and yasun1 March 10, 2026 10:46
@openshift-ci
Copy link

openshift-ci bot commented Mar 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yasun1 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

Walkthrough

A new comprehensive API Operator Guide documentation file is added to the repository. The guide covers the HyperFleet API component architecture, including resource models, lifecycle management, status aggregation logic, adapter registration and readiness computation, configuration precedence and settings, deployment guidance, and operational troubleshooting. The document provides technical reference material and practical operational guidance without introducing code changes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Suggested labels

lgtm

Suggested reviewers

  • ciaranRoche
  • vkareh
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a comprehensive HyperFleet API operator guide documentation. It is specific, concise, and clearly identifies the documentation addition as the primary change.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
docs/api-operator-guide.md (1)

73-87: Add explicit languages to the plain fenced blocks.

These fences are currently unlabeled and will keep tripping MD040. Tagging them as text would make the markdownlint warnings go away without changing the rendered content.

Also applies to: 229-235, 396-402, 524-531, 901-903, 907-909

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/api-operator-guide.md` around lines 73 - 87, The fenced code blocks like
the one showing the Resource schema (the block beginning with "Resource (e.g.,
Cluster)" and similar plain blocks elsewhere) are unlabeled and trigger MD040;
update each plain triple-backtick fence to include the "text" language tag
(e.g., change ``` to ```text) so markdownlint stops flagging them, ensuring you
edit the code fences around the Resource/Cluster schema and the other unlabeled
blocks referenced in the review.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/api-operator-guide.md`:
- Around line 100-105: The doc is contradictory about validation of spec; update
the text to state clearly that the API stores the resource spec as-is with no
business-logic interpretation (so adapters perform reconciliation based on
generation and status), but that the API will perform schema validation against
the configured OpenAPI schema if one exists. Edit the paragraphs that mention
"stored without validation or interpretation" and the OpenAPI validation section
(references: spec, generation, status, OpenAPI schema) so both say: spec is
persisted unchanged for controller/adapters to act on, while the API will reject
or validate requests against an OpenAPI schema when a schema is configured; keep
generation increment semantics and status reporting unchanged.
- Around line 522-531: The listed secret field names use the prefix secrets/db.*
which contradicts the rest of the doc that expects db.*; update the field list
to use db.host, db.port, db.name, db.user, db.password, and db.rootcert so it
matches the secret creation command and expected output, and make the same
correction for the other occurrences that currently show secrets/db.* (the
sections showing db.* outputs and examples should remain unchanged).
- Around line 75-76: The docs currently state the "id" field is a UUID but the
example uses an opaque non-UUID value; update the documentation so they match by
either (a) changing the field description for "id" from "(UUID, auto-generated)"
to reflect the opaque identifier format used in the sample, or (b) replace the
sample response value with a valid UUID to match the current "(UUID,
auto-generated)" description—ensure you update both occurrences referenced (the
listing that includes "id (UUID, auto-generated)" and the sample response
example) so client validation expectations are consistent.
- Around line 719-723: Replace the hardcoded "1/1 READY" rollout check text in
the kubectl command guidance (the line referencing `kubectl get pods -n
hyperfleet-system -l app=hyperfleet-api -w`) with wording that checks that all
containers are Ready (e.g., "all containers Ready") and add an alternative
recommended command using `kubectl wait` (mention `kubectl wait` by name) to
wait for pod readiness instead of relying on a fixed READY fraction so sidecars
like pgbouncer don't break the guidance.

---

Nitpick comments:
In `@docs/api-operator-guide.md`:
- Around line 73-87: The fenced code blocks like the one showing the Resource
schema (the block beginning with "Resource (e.g., Cluster)" and similar plain
blocks elsewhere) are unlabeled and trigger MD040; update each plain
triple-backtick fence to include the "text" language tag (e.g., change ``` to
```text) so markdownlint stops flagging them, ensuring you edit the code fences
around the Resource/Cluster schema and the other unlabeled blocks referenced in
the review.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8c273be3-42dd-4378-a513-f0f629725a1c

📥 Commits

Reviewing files that changed from the base of the PR and between d48668e and b8bab2b.

📒 Files selected for processing (1)
  • docs/api-operator-guide.md

Comment on lines +75 to +76
├── id (UUID, auto-generated)
├── kind (Resource "Cluster" or "NodePool")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align the documented id format with the example.

The guide says id is a UUID, but the sample response uses a non-UUID opaque identifier. Please make these consistent; otherwise client implementations may bake in the wrong validation/parsing rule.

Suggested doc fix
-├── id                    (UUID, auto-generated)
+├── id                    (opaque auto-generated identifier)

or replace the sample value with an actual UUID.

Also applies to: 129-129

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/api-operator-guide.md` around lines 75 - 76, The docs currently state
the "id" field is a UUID but the example uses an opaque non-UUID value; update
the documentation so they match by either (a) changing the field description for
"id" from "(UUID, auto-generated)" to reflect the opaque identifier format used
in the sample, or (b) replace the sample response value with a valid UUID to
match the current "(UUID, auto-generated)" description—ensure you update both
occurrences referenced (the listing that includes "id (UUID, auto-generated)"
and the sample response example) so client validation expectations are
consistent.

Comment on lines +100 to +105
1. **Desired State (spec)**: When you create or update a resource, you provide a `spec` containing the desired configuration (e.g., cluster region, version, node count). The API stores this as-is without validation or interpretation.

2. **Automatic Version Tracking (generation)**: Every time you update the `spec`, the API automatically increments the `generation` counter. This allows distributed adapters to detect when they need to reconcile infrastructure changes.

3. **Observed State (status)**: Adapters report their progress and results back to the API via status endpoints. The API aggregates these reports into unified resource-level conditions (e.g., `Ready`, `Available`).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clarify whether spec is validated or stored blindly.

Section 2.1 says the API stores spec without validation, but Section 3.7 says spec is validated against the OpenAPI schema. Those behaviors are materially different for operators and API consumers. Please reconcile this and spell out the real rule, e.g. “stored without business-logic interpretation, but schema-validated when a schema is configured.”

Also applies to: 625-629

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/api-operator-guide.md` around lines 100 - 105, The doc is contradictory
about validation of spec; update the text to state clearly that the API stores
the resource spec as-is with no business-logic interpretation (so adapters
perform reconciliation based on generation and status), but that the API will
perform schema validation against the configured OpenAPI schema if one exists.
Edit the paragraphs that mention "stored without validation or interpretation"
and the OpenAPI validation section (references: spec, generation, status,
OpenAPI schema) so both say: spec is persisted unchanged for controller/adapters
to act on, while the API will reject or validate requests against an OpenAPI
schema when a schema is configured; keep generation increment semantics and
status reporting unchanged.

Comment on lines +522 to +531
The secret must contain these fields:

```
secrets/db.host → PostgreSQL hostname
secrets/db.port → PostgreSQL port
secrets/db.name → Database name
secrets/db.user → Database username
secrets/db.password → Database password
secrets/db.rootcert → (Optional) SSL root certificate
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the secret key names; the current guide contradicts itself.

The “must contain these fields” list uses secrets/db.*, but the creation command and expected output use db.*. The field list should match the actual secret keys, otherwise operators will create the wrong secret shape.

Suggested doc fix
-secrets/db.host         → PostgreSQL hostname
-secrets/db.port         → PostgreSQL port
-secrets/db.name         → Database name
-secrets/db.user         → Database username
-secrets/db.password     → Database password
-secrets/db.rootcert     → (Optional) SSL root certificate
+db.host                → PostgreSQL hostname
+db.port                → PostgreSQL port
+db.name                → Database name
+db.user                → Database username
+db.password            → Database password
+db.rootcert            → (Optional) SSL root certificate

Also applies to: 661-667, 674-674

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 524-524: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/api-operator-guide.md` around lines 522 - 531, The listed secret field
names use the prefix secrets/db.* which contradicts the rest of the doc that
expects db.*; update the field list to use db.host, db.port, db.name, db.user,
db.password, and db.rootcert so it matches the secret creation command and
expected output, and make the same correction for the other occurrences that
currently show secrets/db.* (the sections showing db.* outputs and examples
should remain unchanged).

Comment on lines +719 to +723
- [ ] Monitor pod startup:
```bash
kubectl get pods -n hyperfleet-system -l app=hyperfleet-api -w
```
Wait until all pods show `1/1` READY and `Running` status (press Ctrl+C to stop watching)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid hardcoding 1/1 READY in the rollout check.

This will be wrong as soon as the pod includes an additional container, so the guidance should be phrased as “all containers Ready” or use kubectl wait instead of a fixed 1/1 expectation. Based on learnings, the pgbouncer sidecar can be rendered into the deployment when the in-chart PostgreSQL path is used.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/api-operator-guide.md` around lines 719 - 723, Replace the hardcoded
"1/1 READY" rollout check text in the kubectl command guidance (the line
referencing `kubectl get pods -n hyperfleet-system -l app=hyperfleet-api -w`)
with wording that checks that all containers are Ready (e.g., "all containers
Ready") and add an alternative recommended command using `kubectl wait` (mention
`kubectl wait` by name) to wait for pod readiness instead of relying on a fixed
READY fraction so sidecars like pgbouncer don't break the guidance.

**How registration works:** You define which adapters are required for each resource type to be marked as `Ready`. Adapters are registered at API startup via environment variables or Helm configuration, and the API will not start if this configuration is missing.

Only **registered adapters** participate in status aggregation:
- The `Ready` condition checks if all registered adapters report `Available=True`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the current resource.spec.generation

This is not the case for adapters reporting Available=True for older generations

| Type | Meaning |
|------|---------|
| **Available** | Adapter's work is complete and operational |
| **Applied** | Kubernetes resources were created/configured |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May we say "kubernetes/maestro resources
And created/applied ? because what does it mean for a kubernetes resource to be "configured" ?

- `status` must be `True`, `False`, or `Unknown`
- `observed_generation` must be a valid integer

#### 2.3.3 Aggregation logic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one subtle missing part here is that aggregated conditions come also with observed_generation

Available can be True for an older generation
Ready always refers to the newest generation

I think that it can be VERY confusing for users when they have Available=True for resource.spec.generation != resource.status.condition[Available].observed_generation so I think it would be worth adding this comment explicitly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to comment on the meaning of that state... which is to keep the "last known good generation" while a new generation is reconciling.

Because without this reasoning it seems like a random rule (at least for me)


**When `Available=Unknown` is reported for the first time** (no prior status from this adapter):
- ✅ Accepts and stores the status
- ⏭️ Skips status aggregation (resource-level conditions unchanged)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we skip the aggregation, or the aggregation just does a no-op.

I think this is an implementation detail not worth mentioning here.

The only thing relevant is that getting resource/statuses will contain some conditions with status=Unknown which convey the meaning of "in-progress"

- Returns `204 No Content` (no action taken)

**When `Available=True` or `Available=False` is reported** (any time):
- ✅ Always stored and triggers status aggregation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always is a bit strong.... if an adapter reports for an older observed_generation the response may be discarded if there is already a newer condition stored.


This pattern allows adapters to signal initial progress without affecting resource readiness, while ensuring eventual convergence to a definitive state (`True` or `False`).

### 2.4 Generation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be worth mentioning that there is no optimistic locking in place?

When reading about generation my first intuition is to say... ah, if I want to change an spec I have to provide the current generation in the POST because it will be checked.

But we are not checking anything, so just to avoid missconfussion

| **401 Unauthorized** | Missing or invalid JWT token | Verify authentication is enabled (`auth.enableJwt=true`). If production, ensure valid JWT token is provided. Reference: [Authentication Guide](authentication.md). |
| **404 Not Found** | Resource doesn't exist | Verify resource ID is correct. Check if resource was deleted: `curl http://<api-service>:8000/api/hyperfleet/v1/clusters/$CLUSTER_ID`. |
| **409 Conflict** | Concurrent update or generation mismatch | Retry with exponential backoff. Ensure only one controller updates the same resource. |
| **500 Internal Server Error** | Database error or unexpected panic | Check API logs: `kubectl logs -n hyperfleet-system -l app=hyperfleet-api --tail=100`. Verify database connectivity with `/readyz` endpoint. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The troubleshooting row references --db-timeout=60s but this flag does not exist in the codebase. There is no db-timeout flag registered in pkg/config/db.go or anywhere else. Did you mean a different flag, or is this a planned feature?


### Phase 4: Post-Deployment Validation

**Verify Service Health**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kubectl exec command assumes the API container has psql installed and that DB_HOST, DB_USER, DB_NAME are available as environment variables. In the actual deployment, database credentials are mounted as files at /app/secrets/db.host, /app/secrets/db.user, etc. — not env vars. The container image is also a minimal Go binary without psql.

Consider replacing this with the pg-debug pod pattern already shown above (line 684), or use:

kubectl run pg-debug --rm -it --image=postgres:15-alpine \
  --restart=Never -n hyperfleet-system -- \
  psql -h <db-host> -U hyperfleet -d hyperfleet -c "\dt"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind you that when we change the config to standard form (@yasun1 is on it) there will be no more files for configuring the credentials (🎉)


**Helm deployment:**

```yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default bind addresses in the code are localhost:8000, localhost:8080, and localhost:9090 — these bind to loopback only. The doc shows :8000, :8080, :9090 (all interfaces) which is what the Helm chart overrides to, but isn't the built-in default.

This matters for the "Direct binary execution" examples — running ./hyperfleet-api serve without flags will bind to localhost, not all interfaces. The examples here show --api-server-bindaddress=:8000 which works, but the parenthetical "(default: :8000)" is misleading since the actual default is localhost:8000.

- **Authentication**: See [Authentication Guide](authentication.md)
- **Helm chart values**: See [Deployment Guide](deployment.md)

### 3.1 Configuration Overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that we will duplicate the configuration here and then in a configuration.md doc that will contain all the settings

Could we leave out this section in favor of that configuration.md file? (which is pending)


Production should use `LOG_LEVEL=info` and `LOG_FORMAT=json` for structured logging.

### 3.7 Schema Validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is some "magic" happening here.

How does the API know what to validate the spec against?

Right now, the API has hardcoded validations for Clusters and Nodepools

  • It uses the payload.kind field to determine the type
  • It then looks within the schema for ClusterSpec or NodePoolSpec
  • We may want to generalize this for the future


**Define Adapter Requirements**

- [ ] List required adapters for each resource type (see [Adapter Registration](#22-adapter-registration))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would highlight that these are examples, the names are quite convincing on making me think that this should be the real configuration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for validation we could have cluster-validation and nodepool-validation as I doubt there will be a single adapter for both


| Condition | Meaning | When True |
|-----------|---------|-----------|
| **Available** | Resource is operational at any known good configuration | All registered adapters report `Available=True` for the **same** `observed_generation` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority: Bug

The Available condition definition is factually wrong in the table (line 323)
and in the pseudocode (lines 349-351). Both say adapters must report
Available=True for the same observed_generation, but ComputeAvailableCondition
in pkg/services/status_aggregation.go checks for Available=True at any
generation — no generation matching is performed.

This is actually the key semantic difference between Available and Ready:

  • Available = all adapters report Available=True at any generation (preserves
    last known good state)
  • Ready = all adapters report Available=True at the current generation
Suggested change
| **Available** | Resource is operational at any known good configuration | All registered adapters report `Available=True` for the **same** `observed_generation` |
| **Available** | Resource is operational at any known good configuration | All registered adapters report `Available=True` (at any generation) |

Suggested fix for pseudocode (lines 349-351):

  // True when all registered adapters report Available=True (at any generation)
  // False when any registered adapter reports Available=False (at any
  generation)

HYPERFLEET_NODEPOOL_ADAPTERS='[]'
```

This makes all resources immediately `Ready=True`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority: Bug

Line 505 says empty adapter arrays ('[]') make resources "immediately
Ready=True", but the code does the opposite. Both ComputeReadyCondition and
ComputeAvailableCondition in pkg/services/status_aggregation.go return false
when requiredAdapters is empty:

  if len(adapterStatuses) == 0 || len(requiredAdapters) == 0 {
      return false
  }

So empty arrays make resources permanently Ready=False / Available=False. An
operator following this guidance would get stuck.

Either the doc should be corrected to say resources stay Not Ready, or (if the
intent is to allow "no adapter" mode) the code needs to be updated. Worth
clarifying which is the desired behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does running the API without any required adapter make sense?
I will not document this level of detail in the doc, as it is an exceptional situation

1. **CRUD operations** for resources (cluster, nodepool) with provider-agnostic specifications
2. **Status aggregation** from multiple adapters into unified resource conditions
3. **Generation tracking** to coordinate spec changes across distributed adapters
4. **Resource lifecycle management** with cascade deletion and referential integrity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority: Inconsistency

Line 57 says the API provides "cascade deletion and referential integrity",
but the DB migration for node_pools uses ON DELETE RESTRICT (prevents parent
deletion when children exist), not ON DELETE CASCADE (auto-delete children).
Also, clusterHandler.Delete returns errors.NotImplemented("delete"), so
cluster deletion isn't implemented yet.

Suggested fix:

Suggested change
4. **Resource lifecycle management** with cascade deletion and referential integrity
4. **Resource lifecycle management** with referential integrity between parent and child resources

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also will implement soft/logical deletion, so I don't think CASCADE will work with that


- **Exponential backoff** on retries (5xx errors, conflicts)
- **Polling intervals** of at least 5-10 seconds for status checks
- **Batch operations** when creating multiple resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have batch operations capabilities in the API, what should clients do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants