Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORS-3923, CORS-3927: Support confidential cluster installation on SEV-SNP and TDX nodes on GCP #9395

Merged
merged 4 commits into from
Apr 1, 2025

Conversation

bgartzi
Copy link
Contributor

@bgartzi bgartzi commented Jan 23, 2025

This patch series aims to support installing a cluster on AMD SEV SNP or TDX confidential nodes on GCP.
Previously, only AMD SEV nodes were supported, which were configured through the confidentialCompute: Enabled configuration flag.
Now that GCP supports specifying the confidential instance type, we are letting users specify which node type (SEV/SEV-SNP/TDX) they would like to deploy the cluster on.
This can be done by using any of the new available values in condidentialCompute such as AMDEncryptedVirtualization (similar to Enabled), AMDEncryptedVirtualizationNestedPaging (AMD SEV-SNP) or IntelTrustedDomainExtensions (Intel TDX).

This series depend on the following patches:

Which have been merged now and and vendored in this patch.

A new cluster-api-provider-gcp release containing the changes I submitted isn't available yet. I created the data/data/cluster-api/gcp-infrastructure-components.yaml myself. I'm not sure if that's a valid approach, or if we should rather wait until a new version is out.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 23, 2025
Copy link
Contributor

openshift-ci bot commented Jan 23, 2025

Hi @bgartzi. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@bgartzi bgartzi force-pushed the gcp-sev_snp branch 3 times, most recently from abf4942 to ca991db Compare January 28, 2025 16:30
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 30, 2025
@bgartzi bgartzi force-pushed the gcp-sev_snp branch 2 times, most recently from 82ded89 to a143152 Compare January 30, 2025 11:36
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 30, 2025
@barbacbd
Copy link
Contributor

barbacbd commented Feb 7, 2025

/uncc

@openshift-ci openshift-ci bot removed the request for review from barbacbd February 7, 2025 19:30
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2025
@bgartzi bgartzi changed the title DRAFT: Support installation on SEV SNP confidential nodes CORS-3923, CORS-3927: Support confidential cluster installation on SEV-SNP and TDX nodes on GCP Mar 6, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 6, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 6, 2025

@bgartzi: This pull request references CORS-3923 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.19.0" version, but no target version was set.

This pull request references CORS-3927 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.19.0" version, but no target version was set.

In response to this:

This patch series aims to support installing a cluster on AMD SEV SNP confidential nodes.
Previously, only AMD SEV nodes were supported, which were configured through the confidentialCompute configuration flag.
Now that GCP supports specifying the confidential instance type, we are letting users specify which node type (SEV/SEV-SNP) they would like to deploy the cluster on.

This series depend on the following patches:
kubernetes-sigs/cluster-api-provider-gcp#1410
openshift/api#2165
openshift/machine-api-operator#1324
openshift/machine-api-provider-gcp#107

I will update the go.mod replace sections pointing to personal forks once they are accepted.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 6, 2025

@bgartzi: This pull request references CORS-3923 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.19.0" version, but no target version was set.

This pull request references CORS-3927 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.19.0" version, but no target version was set.

In response to this:

This patch series aims to support installing a cluster on AMD SEV SNP or TDX confidential nodes on GCP.
Previously, only AMD SEV nodes were supported, which were configured through the confidentialCompute: Enabled configuration flag.
Now that GCP supports specifying the confidential instance type, we are letting users specify which node type (SEV/SEV-SNP/TDX) they would like to deploy the cluster on.
This can be done by using any of the new available values in condidentialCompute such as AMDEncryptedVirtualization (similar to Enabled), AMDEncryptedVirtualizationNestedPaging (AMD SEV-SNP) or IntelTrustedDomainExtensions (Intel TDX).

This series depend on the following patches:

Which have been merged now and and vendored in this patch.

A new cluster-api-provider-gcp release containing the changes I submitted isn't available yet. I created the data/data/cluster-api/gcp-infrastructure-components.yaml myself. I'm not sure if that's a valid approach, or if we should rather wait until a new version is out.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bgartzi bgartzi force-pushed the gcp-sev_snp branch 2 times, most recently from a8c7227 to 87b325e Compare March 17, 2025 09:48
@jianli-wei
Copy link
Contributor

@bgartzi Suggestions on the explain command outputs.

(1) the current outputs:

$ ./openshift-install explain installconfig.compute.platform.gcp.confidentialCompute
KIND:     InstallConfig
VERSION:  v1

RESOURCE: <string>
  Valid Values: "","Enabled","Disabled","AMDEncryptedVirtualization","AMDEncryptedVirtualizationNestedPaging","IntelTrustedDomainExtensions"
  confidentialCompute is an optional field defining whether the instance should have confidential compute enabled or not, and the confidential computing technology of choice.
Allowed values are omitted, Disabled, Enabled, AMDEncryptedVirtualization, AMDEncryptedVirtualizationNestedPaging, and IntelTrustedDomainExtensions
When set to Disabled, the machine will not be configured to be a confidential computing instance.
When set to Enabled, the machine will be configured as a confidential computing instance with no preference on the confidential compute policy used. In this mode, the platform chooses a default that is subject to change over time. Currently, the default is to use AMD Secure Encrypted Virtualization.
When set to AMDEncryptedVirtualization, the machine will be configured as a confidential computing instance with AMD Secure Encrypted Virtualization (AMD SEV) as the confidential computing technology.
When set to AMDEncryptedVirtualizationNestedPaging, the machine will be configured as a confidential computing instance with AMD Secure Encrypted Virtualization Secure Nested Paging (AMD SEV-SNP) as the confidential computing technology.
When set to IntelTrustedDomainExtensions, the machine will be configured as a confidential computing instance with Intel Trusted Domain Extensions (Intel TDX) as the confidential computing technology.
If any value other than Disabled is set the selected machine type must support that specific confidential computing technology. The machine series supporting confidential computing technologies can be checked at https://cloud.google.com/confidential-computing/confidential-vm/docs/supported-configurations#all-confidential-vm-instances
Currently, AMDEncryptedVirtualization is supported in c2d, n2d, and c3d machines.
AMDEncryptedVirtualizationNestedPaging is supported in n2d machines.
IntelTrustedDomainExtensions is supported in c3 machines.
If any value other than Disabled is set, the selected region must support that specific confidential computing technology. The list of regions supporting confidential computing technologies can be checked at https://cloud.google.com/confidential-computing/confidential-vm/docs/supported-configurations#supported-zones
If any value other than Disabled is set onHostMaintenance is required to be set to "Terminate".
If omitted, the platform chooses a default, which is subject to change over time, currently that default is Disabled.

$ 

(2) one sample outputs:

$ openshift-install explain installconfig.platform.gcp.userTags
KIND:     InstallConfig
VERSION:  v1

RESOURCE: <[]object>
  userTags has additional keys and values that the installer will add as
tags to all resources that it creates on GCP. Resources created by the
cluster itself may not include these tags. Tag key and tag value should
be the shortnames of the tag key and tag value resource.

FIELDS:
    key <string> -required-
      key is the key part of the tag. A tag key can have a maximum of 63 characters and
cannot be empty. Tag key must begin and end with an alphanumeric character, and
must contain only uppercase, lowercase alphanumeric characters, and the following
special characters `._-`.

    parentID <string> -required-
      parentID is the ID of the hierarchical resource where the tags are defined,
e.g. at the Organization or the Project level. To find the Organization ID or Project ID refer to the following pages:
https://cloud.google.com/resource-manager/docs/creating-managing-organization#retrieving_your_organization_id,
https://cloud.google.com/resource-manager/docs/creating-managing-projects#identifying_projects.
An OrganizationID must consist of decimal numbers, and cannot have leading zeroes.
A ProjectID must be 6 to 30 characters in length, can only contain lowercase letters,
numbers, and hyphens, and must start with a letter, and cannot end with a hyphen.

    value <string> -required-
      value is the value part of the tag. A tag value can have a maximum of 63 characters
and cannot be empty. Tag value must begin and end with an alphanumeric character, and
must contain only uppercase, lowercase alphanumeric characters, and the following
special characters `_-.@%=+:,*#&(){}[]` and spaces.

$ 

(3) The suggestions:

  • Make the max length of each line being 85~90 (as the sample outputs).
  • Remove the statement Allowed values are omitted, Disabled, Enabled, AMDEncryptedVirtualization, AMDEncryptedVirtualizationNestedPaging, and IntelTrustedDomainExtensions, as the "Valid Values: " line had told it.
  • Remove the statement If any value other than Disabled is set, the selected region must support that specific confidential computing technology. The list of regions supporting confidential computing technologies can be checked at https://cloud.google.com/confidential-computing/confidential-vm/docs/supported-configurations#supported-zones.
  • Remove the statement If omitted, the platform chooses a default, which is subject to change over time, currently that default is Disabled..
  • Replace statements like When set to Disabled with " With Disabled" (please note the 2 leading spaces).
  • Replace statements like the machine will (not )?be configured as a confidential computing instance with "confidential computing is enabled|disabled".
  • Replace statement When set to Enabled, the machine will be configured as a confidential computing instance with no preference on the confidential compute policy used. In this mode, the platform chooses a default that is subject to change over time. Currently, the default is to use AMD Secure Encrypted Virtualization. with " With Enabled, confidential computing is enabled, but without preferred Confidential Computing technology. The platform chooses a default, i.e. AMD SEV-SNP, which is subject to change over time. "
  • Replace statement If any value other than Disabled is set the selected machine type must support that specific confidential computing technology. with "If any value other than Disabled is set, the machine type, which supports the selected Confidential Computing technology, must be specified. "
  • In If any value other than Disabled is set the selected machine type must support that specific confidential computing technology. The machine series supporting confidential computing technologies can be checked at https://cloud.google.com/confidential-computing/confidential-vm/docs/supported-configurations#all-confidential-vm-instances, please replace the link with "https://cloud.google.com/confidential-computing/confidential-vm/docs/supported-configurations#machine-type-cpu-zone".
  • With the above link, suggest to remove lines Currently, AMDEncryptedVirtualization is supported in c2d, n2d, and c3d machines. AMDEncryptedVirtualizationNestedPaging is supported in n2d machines. IntelTrustedDomainExtensions is supported in c3 machines.

@bgartzi
Copy link
Contributor Author

bgartzi commented Mar 20, 2025

I just addressed your concerns in the last force-pushed version. Could you confir @jianli-wei?

I also fixed the validation issue you found when onHostMaintenance=Migrate was configured when ConfidentialCompute value wasn't disabled.

Thanks a lot!

@sadasu
Copy link
Contributor

sadasu commented Mar 20, 2025

/retest-required

@jianli-wei
Copy link
Contributor

I just addressed your concerns in the last force-pushed version. Could you confir @jianli-wei?

@bgartzi Thanks for the quick action! The updates looks good to me, just 2 minor suggestions.

(1) The current outputs:

$ ./openshift-install explain installconfig.platform.gcp.defaultMachinePlatform.confidentialCompute
KIND:     InstallConfig
VERSION:  v1

RESOURCE: <string>
  Default: "Disabled"
  Valid Values: "","Enabled","Disabled","AMDEncryptedVirtualization","AMDEncryptedVirtualizationNestedPaging","IntelTrustedDomainExtensions"
  confidentialCompute is an optional field defining whether the instance should have
confidential compute enabled or not, and the Confidential Computing technology of choice.
 With Disabled, Confidential Computing is disabled.
 With Enabled, Confidential Computing is enabled with no preference on the
Confidential Compute policy. The platform chooses a default i.e. AMD SEV,
which is subject to change over time.
 With AMDEncryptedVirtualization, Confidential Computing is enabled with
AMD Secure Encrypted Virtualization (AMD SEV).
 With AMDEncryptedVirtualizationNestedPaging, Confidential Computing is
enabled with AMD Secure Encrypted Virtualization Secure Nested Paging
(AMD SEV-SNP).
 With IntelTrustedDomainExtensions, Confidential Computing is enabled with
Intel Trusted Domain Extensions (Intel TDX).
If any value other than Disabled is set, a machine type and region that supports
Confidential Computing must be specified. Machine series and regions supporting
Confidential Computing technologies can be checked at
https://cloud.google.com/confidential-computing/confidential-vm/docs/supported-configurations#machine-type-cpu-zone
If any value other than Disabled is set onHostMaintenance is required to be set
to "Terminate".

$ 

(2) Minor suggestions:

  • Replace "confidential compute" with "Confidential Computing", and replace "Confidential Compute policy" with "Confidential Computing technologies".
  • Insert 4 leading spaces before the statements like With {Disabled|Enabled|AMDEncryptedVirtualization|AMDEncryptedVirtualizationNestedPaging|IntelTrustedDomainExtensions}, , and the statements If any value other than Disabled is set.
  • Insert a comma after the word "set", in If any value other than Disabled is set onHostMaintenance is required....

FYI with the above suggestions, the outputs should look like,

$ ./openshift-install explain installconfig.platform.gcp.defaultMachinePlatform.confidentialCompute
KIND:     InstallConfig
VERSION:  v1

RESOURCE: <string>
  Default: "Disabled"
  Valid Values: "","Enabled","Disabled","AMDEncryptedVirtualization","AMDEncryptedVirtualizationNestedPaging","IntelTrustedDomainExtensions"
  confidentialCompute is an optional field defining whether the instance should have
Confidential Computing enabled or not, and the Confidential Computing technology of choice.
    With Disabled, Confidential Computing is disabled.
    With Enabled, Confidential Computing is enabled with no preference on the
Confidential Computing technologies. The platform chooses a default i.e. AMD SEV,
which is subject to change over time.
    With AMDEncryptedVirtualization, Confidential Computing is enabled with
AMD Secure Encrypted Virtualization (AMD SEV).
    With AMDEncryptedVirtualizationNestedPaging, Confidential Computing is
enabled with AMD Secure Encrypted Virtualization Secure Nested Paging
(AMD SEV-SNP).
    With IntelTrustedDomainExtensions, Confidential Computing is enabled with
Intel Trusted Domain Extensions (Intel TDX).
    If any value other than Disabled is set, a machine type and region that supports
Confidential Computing must be specified. Machine series and regions supporting
Confidential Computing technologies can be checked at
https://cloud.google.com/confidential-computing/confidential-vm/docs/supported-configurations#machine-type-cpu-zone
    If any value other than Disabled is set, onHostMaintenance is required to be set
to "Terminate".

$ 

@patrickdillon
Copy link
Contributor

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 21, 2025
@patrickdillon
Copy link
Contributor

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 21, 2025
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 24, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 24, 2025
@bgartzi
Copy link
Contributor Author

bgartzi commented Mar 24, 2025

Hey @jianli-wei, @patrickdillon, thanks for the review and testing. I just pushed a new version. Comparing to the previous one:

  • I rebased the branch on latest main due to conflicts.
  • I dropped the commit bumping openshift/api as the commits I needed are included in the latest vendored version.
  • I addressed explain's output style concerns.
  • I added a validation step that checks that the configured confidentialCompute matches the confidential computing technology supported by the configured instance type, as requested in https://issues.redhat.com/browse/CORS-3924.

@bgartzi
Copy link
Contributor Author

bgartzi commented Mar 24, 2025

Fixing some linter an gofmt issues that the ci found...

I'm not sure about why ci/prow/okd-scos-images failed though. I couldn't find much helpful information in the job logs. Is it really hitting the rate limiter?

bgartzi added 4 commits March 25, 2025 11:07
In the latest openshift/api bump confidentialCompute was extended from
the previous Enabled/Disabled enum to also consider:
    - AMDEncryptedVirtualization
    - AMDEncryptedVirtualizationNestedPaging
    - IntelTrustedDomainExtensions

Update godocs in gcp machinepools to consider those values that will let
the user choose from different confidential instance types.
Re-generate data/data/install.openshift.io_installconfigs.yaml based on
latest openshift/api bump.
OnHostMaintenance has to be set to Terminate when a ConfidentialCompute
value that is not Disabled is configured. Extend the validation so the
new values added to ConfidentialCompute are also considered.
Confidential computing in GCP is supported in a limited set of instance
types only (see [0]). Other openshift components such as cluster-api
already cover this validation. However, running this validation previous
to manifest generation can be beneficial for the user.

This commit adds validation to GCP instances that have been configured a
confidentialCompute value other than Disabled. It checks that the
instance type supports the configured confidential computing technology.

[0] https://cloud.google.com/confidential-computing/confidential-vm/docs/supported-configurations#machine-type-cpu-zone
Copy link
Contributor

openshift-ci bot commented Mar 25, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: patrickdillon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bgartzi
Copy link
Contributor Author

bgartzi commented Mar 25, 2025

I just dropped the commit bumping the cluster-api-provider-gcp version. The changes I needed were already included by #9528.

@patrickdillon
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2025
@jianli-wei
Copy link
Contributor

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Mar 28, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 28, 2025

@bgartzi: This pull request references CORS-3923 which is a valid jira issue.

This pull request references CORS-3927 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.19.0" version, but no target version was set.

In response to this:

This patch series aims to support installing a cluster on AMD SEV SNP or TDX confidential nodes on GCP.
Previously, only AMD SEV nodes were supported, which were configured through the confidentialCompute: Enabled configuration flag.
Now that GCP supports specifying the confidential instance type, we are letting users specify which node type (SEV/SEV-SNP/TDX) they would like to deploy the cluster on.
This can be done by using any of the new available values in condidentialCompute such as AMDEncryptedVirtualization (similar to Enabled), AMDEncryptedVirtualizationNestedPaging (AMD SEV-SNP) or IntelTrustedDomainExtensions (Intel TDX).

This series depend on the following patches:

Which have been merged now and and vendored in this patch.

A new cluster-api-provider-gcp release containing the changes I submitted isn't available yet. I created the data/data/cluster-api/gcp-infrastructure-components.yaml myself. I'm not sure if that's a valid approach, or if we should rather wait until a new version is out.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Mar 31, 2025

@bgartzi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node c310fda link false /test e2e-aws-ovn-single-node
ci/prow/e2e-gcp-ovn-xpn 06d73a5 link false /test e2e-gcp-ovn-xpn
ci/prow/e2e-azurestack c310fda link false /test e2e-azurestack
ci/prow/e2e-vsphere-ovn-multi-network 06d73a5 link false /test e2e-vsphere-ovn-multi-network
ci/prow/e2e-gcp-ovn-byo-vpc 06d73a5 link false /test e2e-gcp-ovn-byo-vpc
ci/prow/e2e-azure-ovn-shared-vpc c310fda link false /test e2e-azure-ovn-shared-vpc
ci/prow/okd-scos-e2e-aws-ovn 06d73a5 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-azure-ovn-resourcegroup 06d73a5 link false /test e2e-azure-ovn-resourcegroup
ci/prow/e2e-vsphere-externallb-ovn 06d73a5 link false /test e2e-vsphere-externallb-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD f404808 and 2 for PR HEAD 06d73a5 in total

@openshift-merge-bot openshift-merge-bot bot merged commit 97875b2 into openshift:main Apr 1, 2025
18 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants