Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NE-1277: status: Add Gateway API objects to relatedObjects #933

Conversation

Miciah
Copy link
Contributor

@Miciah Miciah commented May 18, 2023

  • pkg/operator/controller/names.go (GlobalOperatorsNamespace): New const for the "openshift-operators" namespace.
    (ServiceMeshSubscriptionName): Use GlobalOperatorsNamespace.
  • pkg/operator/controller/status/controller.go (New): Add a watch for subscriptions so that the status controller updates relatedObjects when the OSSM subscription is added or removed.
    (Config): Add GatewayAPIEnabled field.
    (Reconcile): If GatewayAPIEnabled is true, add resources related to Gateway API to relatedObjects. If haveOSSMSubscription is true, add resources that require the subscription.
    (operatorState): Add haveOSSMSubscription field.
    (getOperatorState): Set haveOSSMSubscription.
  • pkg/operator/operator.go (New): Watch the "openshift-operators" namespace, and specify GatewayAPIEnabled in the status controller's config.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 18, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented May 18, 2023

@Miciah: This pull request references NE-1277 which is a valid jira issue.

In response to this:

  • pkg/operator/controller/names.go (GlobalOperatorsNamespace): New const for the "openshift-operators" namespace.
    (ServiceMeshSubscriptionName): Use GlobalOperatorsNamespace.
  • pkg/operator/controller/status/controller.go (New): Add a watch for subscriptions so that the status controller updates relatedObjects when the OSSM subscription is added or removed.
    (Config): Add GatewayAPIEnabled field.
    (Reconcile): If GatewayAPIEnabled is true, add resources related to Gateway API to relatedObjects. If haveOSSMSubscription is true, add resources that require the subscription.
    (operatorState): Add haveOSSMSubscription field.
    (getOperatorState): Set haveOSSMSubscription.
  • pkg/operator/operator.go (New): Watch the "openshift-operators" namespace, and specify GatewayAPIEnabled in the status controller's config.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 18, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 18, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2023
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2023
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 29, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Sep 29, 2023

@Miciah: This pull request references NE-1277 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

  • pkg/operator/controller/names.go (GlobalOperatorsNamespace): New const for the "openshift-operators" namespace.
    (ServiceMeshSubscriptionName): Use GlobalOperatorsNamespace.
  • pkg/operator/controller/status/controller.go (New): Add a watch for subscriptions so that the status controller updates relatedObjects when the OSSM subscription is added or removed.
    (Config): Add GatewayAPIEnabled field.
    (Reconcile): If GatewayAPIEnabled is true, add resources related to Gateway API to relatedObjects. If haveOSSMSubscription is true, add resources that require the subscription.
    (operatorState): Add haveOSSMSubscription field.
    (getOperatorState): Set haveOSSMSubscription.
  • pkg/operator/operator.go (New): Watch the "openshift-operators" namespace, and specify GatewayAPIEnabled in the status controller's config.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 29, 2023
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Oct 29, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 29, 2023

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah
Copy link
Contributor Author

Miciah commented Mar 6, 2025

/reopen

@openshift-ci openshift-ci bot reopened this Mar 6, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 6, 2025

@Miciah: This pull request references NE-1277 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

  • pkg/operator/controller/names.go (GlobalOperatorsNamespace): New const for the "openshift-operators" namespace.
    (ServiceMeshSubscriptionName): Use GlobalOperatorsNamespace.
  • pkg/operator/controller/status/controller.go (New): Add a watch for subscriptions so that the status controller updates relatedObjects when the OSSM subscription is added or removed.
    (Config): Add GatewayAPIEnabled field.
    (Reconcile): If GatewayAPIEnabled is true, add resources related to Gateway API to relatedObjects. If haveOSSMSubscription is true, add resources that require the subscription.
    (operatorState): Add haveOSSMSubscription field.
    (getOperatorState): Set haveOSSMSubscription.
  • pkg/operator/operator.go (New): Watch the "openshift-operators" namespace, and specify GatewayAPIEnabled in the status controller's config.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Mar 6, 2025

@Miciah: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Miciah Miciah force-pushed the NE-1277-status-add-Gateway-API-objects-to-relatedObjects branch from 6246acc to 78c31dd Compare March 6, 2025 21:04
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2025
@Miciah
Copy link
Contributor Author

Miciah commented Mar 6, 2025

https://github.com/openshift/cluster-ingress-operator/compare/6246acc7f400ebe0376589efa793d8033c5a8683..78c31dd241a7d0e8a579a4d1f8b870d569c1c96d rebases onto #1152 in order to get the Gateway API v1 and Sail Operator v1 CRDs. Some other changes were made redundant by #1115 and are thus dropped from this PR.

@Miciah
Copy link
Contributor Author

Miciah commented Mar 6, 2025

/test e2e-aws-operator

@Miciah Miciah force-pushed the NE-1277-status-add-Gateway-API-objects-to-relatedObjects branch from 78c31dd to 459761a Compare March 11, 2025 17:06
@Miciah
Copy link
Contributor Author

Miciah commented Mar 11, 2025

Rebased onto #1152.

@Miciah Miciah added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Mar 12, 2025
@candita
Copy link
Contributor

candita commented Mar 20, 2025

/assign

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2025
Namespace: subscriptionName.Namespace,
Name: subscriptionName.Name,
})
if state.IngressNamespace != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for a gateway to be installed outside the IngressNamespace and still be a related object?

Also, how about HTTPRoutes, GRPCRoutes, and ReferenceGrants?

Is there a way to make sure the pod that runs conformance tests ends up as a related object so we can check the logs in artifacts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for a gateway to be installed outside the IngressNamespace and still be a related object?

I don't think it would be appropriate to include objects that constitute user workload, and for component-owned objects, the precedent is that the component that owns the object specifies it in its relatedObjects field. For example, the authentication operator specifies its route in the authentication clusteroperator's relatedObjects field: https://github.com/openshift/cluster-authentication-operator/blob/11b2201203fd5ba0dec937015fa542f4cd4c8879/pkg/operator/starter.go#L566

Also, how about HTTPRoutes, GRPCRoutes, and ReferenceGrants?

We don't have any. Again, we wouldn't want to include other components' objects or objects related to user workload. This is analogous to how the ingress clusteroperator specifies ingresscontrollers but not routes in relatedObjects. (Come to think of it, maybe we ought to specify the canary route. I can file an issue for that if you agree.)

Is there a way to make sure the pod that runs conformance tests ends up as a related object so we can check the logs in artifacts?

I think that that would be misuse of relatedObjects. Aside from that, the must-gather artifacts can only include objects that exist when must-gather runs, and CI runs must-gather after the tests have finished, so unless the conformance tests leak objects, must-gather wouldn't collect them anyway.

@lihongan
Copy link
Contributor

/test e2e-aws-gatewayapi-conformance

@Miciah Miciah force-pushed the NE-1277-status-add-Gateway-API-objects-to-relatedObjects branch from bef05ce to 86f0c17 Compare March 26, 2025 14:29
@Miciah
Copy link
Contributor Author

Miciah commented Mar 26, 2025

https://github.com/openshift/cluster-ingress-operator/compare/bef05ce144a269f1a170e1e8c22e53a769a5128b..86f0c17f610eee033337b3710a0e728a02319db6 adds "dnsrecords" in the "openshift-ingress" namespace to the expected relatedObjects in TestClusterOperatorStatusRelatedObjects and adds some logging to help diagnose the TestCanaryRoute failure on e2e-aws-operator.

@Miciah Miciah force-pushed the NE-1277-status-add-Gateway-API-objects-to-relatedObjects branch from 86f0c17 to 77beec1 Compare March 26, 2025 20:27
@Miciah
Copy link
Contributor Author

Miciah commented Mar 26, 2025

@candita
Copy link
Contributor

candita commented Mar 27, 2025

/test e2e-aws-operator-techpreview

@@ -36,65 +36,66 @@ import (
func TestCanaryRoute(t *testing.T) {
kubeConfig, err := config.GetConfig()
if err != nil {
t.Fatalf("failed to get kube config: %v", err)
t.Fatalf("Failed to get kube config: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any more test changes are needed, could you please put them in a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean additional changes to TestCanaryRoute or to other tests?

Copy link
Contributor Author

@Miciah Miciah Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can split the other test fixes off into separate PRs and Jira issues if you prefer. I've just been adding them as commits in this PR as I diagnose CI failures that impact this PR.

Miciah added 2 commits March 27, 2025 14:03
This commit resolves NE-1277.

https://issues.redhat.com/browse/NE-1277

* pkg/operator/controller/status/controller.go (New): Add a watch for
subscriptions so that the status controller updates relatedObjects with the
OSSM subscription when it is added or removed.
(Config): Add GatewayAPIEnabled field.
(Reconcile): Add dnsrecords in the operand namespace to relatedObjects.
If GatewayAPIEnabled is true, add resources related to Gateway API.
If haveOSSMSubscription is true, add resources that require the subscription.
(operatorState): Add haveOSSMSubscription field.
(getOperatorState): Set haveOSSMSubscription.
* pkg/operator/operator.go (New): Specify GatewayAPIEnabled in the status
controller's config.
* test/e2e/operator_test.go (TestClusterOperatorStatusRelatedObjects): Expect to
observe "dnsrecords" for the "openshift-ingress" namespace as well as
"gatewayclasses" in relatedObjects.
* test/e2e/canary_test.go (TestCanaryRoute): Add some more logging to make it
easier to diagnose test failures.
Comment on lines +222 to +226
{
Group: iov1.GroupVersion.Group,
Resource: "dnsrecords",
Namespace: "openshift-ingress",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +240 to +241
// This test runs before TestGatewayAPI, so we do *not* expect
// to see subscriptions, istios, or gateways in relatedObjects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this runs before TestGatewayAPI, what created the gatewayClass?

Copy link
Contributor Author

@Miciah Miciah Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gatewayapi controller creates the CRD if the featuregate is enabled. We only need the CRD to exist in order to add the gatewayclasses resource to relatedObjects; we don't need any CR to exist. (However, if TestGatewayAPI does run as part of CI, then having "gatewayclasses" in relatedObjects will cause the CI job to gather the gatewayclass that TestGatewayAPI created when the CI job runs the must-gather step, after the test step.)

@@ -562,14 +563,14 @@ func assertGatewaySuccessful(t *testing.T, namespace, name string) (*gatewayapiv

err := wait.PollUntilContextTimeout(context.Background(), 1*time.Second, 1*time.Minute, false, func(context context.Context) (bool, error) {
if err := kclient.Get(context, nsName, gw); err != nil {
t.Logf("failed to get gateway %s, retrying...", name)
t.Logf("Failed to get gateway %v: %v", nsName, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
t.Logf("Failed to get gateway %v: %v", nsName, err)
t.Logf("Failed to get gateway %v: %v, retrying...", nsName, err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -598,15 +601,15 @@ func assertHttpRouteSuccessful(t *testing.T, namespace, name string, gateway *ga
// Wait 1 minute for parent/s to update
err := wait.PollUntilContextTimeout(context.Background(), 1*time.Second, 1*time.Minute, false, func(context context.Context) (bool, error) {
if err := kclient.Get(context, nsName, httproute); err != nil {
t.Logf("failed to get httproute %s/%s, retrying...", namespace, name)
t.Logf("Failed to get httproute %v: %v", nsName, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more cases here that had the "retrying" before. Also, I expect that the error message is not going to change much on each iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added back "retrying" in https://github.com/openshift/cluster-ingress-operator/compare/77beec1a8b6af8c8c5ce00f29f7a60683a84cd75..92e89a2a02fccb3d149e3e0491a30f19fe160774.

I don't think it's worth the effort to save and compare the last err.String() to make logging slightly more compact. If it is worth that effort, then we should probably write a getWithRetry helper and update all tests to use it.

if err := assertDNSRecord(t, dnsRecordName); err != nil {
return err
}

// Wait and check that the dns name resolves first. Takes a long time, so
// if the hostname is actually an IP address, skip this.
t.Logf("Attempting to resolve %s...", hostname)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be precise, it won't attempt to resolve the hostname if it is actually an IP address. This log message is in accurate in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -843,20 +846,21 @@ func assertDNSRecord(t *testing.T, recordName types.NamespacedName) error {

err := wait.PollUntilContextTimeout(context.Background(), 1*time.Second, 1*time.Minute, false, func(context context.Context) (bool, error) {
if err := kclient.Get(context, recordName, dnsRecord); err != nil {
t.Logf("failed to get DNSRecord %s/%s: %v, retrying...", recordName.Namespace, recordName.Name, err)
t.Logf("Failed to get DNSRecord %v: %v", recordName, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again a case where "retrying" in the log message is helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more logging to functions that testGatewayAPIObjects uses in order to make
it easier to diagnose test failures.

* test/e2e/gateway_api_test.go (ensureGatewayObjectSuccess): Remove t.Helper()
and add more logging.  This function has core test logic that can mark the test
as failed, and marking this function as a helper obfuscates the point of
failure.
* test/e2e/util_gatewayapi_test.go (assertGatewayClassSuccessful): Add more
logging.
(assertGatewaySuccessful): Add more logging.
(assertHttpRouteSuccessful):  Add more logging.
(assertHttpRouteConnection): Remove t.Helper() and add more logging.
(assertDNSRecord): Add more logging.
@Miciah Miciah force-pushed the NE-1277-status-add-Gateway-API-objects-to-relatedObjects branch from 77beec1 to 92e89a2 Compare March 27, 2025 23:03
@Miciah
Copy link
Contributor Author

Miciah commented Mar 27, 2025

@Miciah
Copy link
Contributor Author

Miciah commented Mar 28, 2025

Some earlier runs of e2e-aws-operator and e2e-aws-operator-techpreview failed on TestIngressControllerCustomEndpoints. This test is failing on multiple PRs, and these failures are being tracked with OCPBUGS-53432.

@Miciah
Copy link
Contributor Author

Miciah commented Mar 28, 2025

Many CI errors earlier today were caused by a Marketplace outage.

@candita
Copy link
Contributor

candita commented Mar 28, 2025

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 28, 2025
Copy link
Contributor

openshift-ci bot commented Mar 28, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: candita

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 28, 2025
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 7f0fd6d and 2 for PR HEAD 92e89a2 in total

1 similar comment
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 7f0fd6d and 2 for PR HEAD 92e89a2 in total

Copy link
Contributor

openshift-ci bot commented Mar 29, 2025

@Miciah: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 92e89a2 link false /test e2e-aws-ovn-single-node
ci/prow/okd-scos-e2e-aws-ovn 92e89a2 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@Miciah
Copy link
Contributor Author

Miciah commented Mar 30, 2025

hypershift-e2e-aks failed because EnsureNoCrashingPods failed:

{Failed  === RUN   TestNodePool/HostedCluster2/ValidateHostedCluster/EnsureNoCrashingPods
    util.go:155: Successfully waited for kubeconfig to be published for HostedCluster e2e-clusters-nnpcz/node-pool-b94t4 in 25ms
    util.go:172: Successfully waited for kubeconfig secret to have data in 25ms
    util.go:606: Container snapshot-controller in pod csi-snapshot-controller-69dcc9f5c4-wchld has a restartCount > 0 (1)
            --- FAIL: TestNodePool/HostedCluster2/ValidateHostedCluster/EnsureNoCrashingPods (0.33s)
}

/test hypershift-e2e-aks

@openshift-merge-bot openshift-merge-bot bot merged commit 81e314f into openshift:master Mar 30, 2025
17 of 19 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-ingress-operator
This PR has been included in build ose-cluster-ingress-operator-container-v4.19.0-202503311540.p0.g81e314f.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants