Skip to content

fix: add support for cpu nodes in default cluster configuration #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

imprateeksh
Copy link
Member

Description

This PR aims to fix the gap in validation by moving GPU configuration to additional worker pool and some minor changes emerged during DA rally syncup.
Issue: #13375

Release required?

  • No release
  • Patch release (x.x.X)
  • Minor release (x.X.x)
  • Major release (X.x.x)
Release notes content
  • Updated the default worker pool to include non GPU machine types.
  • Support for GPU configuration in the additional worker pool.

Run the pipeline

If the CI pipeline doesn't run when you create the PR, the PR requires a user with GitHub collaborators access to run the pipeline.

Run the CI pipeline when the PR is ready for review and you expect tests to pass. Add a comment to the PR with the following text:

/run pipeline

Checklist for reviewers

  • If relevant, a test for the change is included or updated with this PR.
  • If relevant, documentation for the change is included or updated with this PR.

For mergers

  • Use a conventional commit message to set the release level. Follow the guidelines.
  • Include information that users need to know about the PR in the commit message. The commit message becomes part of the GitHub release notes.
  • Use the Squash and merge option.

@imprateeksh imprateeksh self-assigned this Apr 27, 2025
@imprateeksh imprateeksh requested a review from Vipin654 as a code owner April 27, 2025 18:36
@imprateeksh
Copy link
Member Author

/run pipeline

@imprateeksh
Copy link
Member Author

Validations:

1. When provided incorrect machine_type in additional worker pools i.e. GPU pool

image

2. When provided some wrong GPU configuration - less than 8CPUs and 32 GB memory

image

For Default pool (Non GPU) machine type - This has no restrictions over having machine type as GPU or non GPU but default is non GPU

3. When provided unsupported CPU and Memory configuration
image

image

4. No Error when using GPU as default machine type but if the configuration does not matches then error is shown
image


Comment on lines 90 to 113
validation {
condition = alltrue([
for pool in var.additional_worker_pools : (
startswith(pool.machine_type, "gx2") ||
startswith(pool.machine_type, "gx3") ||
startswith(pool.machine_type, "gx4")
)
])
error_message = "All machine types must be from the 'gx2', 'gx3', or 'gx4' series. Please choose valid machine types with GPU support."
}

validation {
condition = alltrue([
for pool in var.additional_worker_pools : tonumber(split("x", split(".", pool.machine_type)[1])[0]) >= 8
])
error_message = "To install Red Hat OpenShift AI, all worker nodes in all pools must have at least an 8-core CPU."
}

validation {
condition = alltrue([
for pool in var.additional_worker_pools : tonumber(split("x", split(".", pool.machine_type)[1])[1]) >= 32
])
error_message = "To install Red Hat OpenShift AI, all worker nodes in all pools must have at least 32GB memory."
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need separate validations here? These should be validated in base-ocp.

Consider a scenario, if an user adds gpu nodes as default worker pool and non-gpu as additional, then this wont work.

Copy link
Member Author

@imprateeksh imprateeksh Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We needed validations here too because it holds good for all the validations in this file. Please refer this comment.
    Moreover in base-ocp these validations are not part of additional_worker_pools variable but the addons variable hence it will not fail fast for additional worker pools.

  2. Regarding the scenario you mentioned above, should not be considered as a valid scenario as:

  • We need to restrict somewhere on GPU nodes as GPU nodes are mandatory (though not in documentation yet), so enforcing this restriction to have only GPU node should be placed in additional worker pool. The default worker pool will continue to support both GPU and non-GPU nodes as needed.
  • Another point is that for GPU configuration, we need additional storage which is now only part of additional_worker_pool as now removed the default_gpu_worker_pool_storage variable.

Maybe I can include this thing in description i.e machine type has to be GPU only. Please let me know if you think otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, got the point regarding storage.

  1. But still additional worker pools can be added with cpu worker nodes.
  2. If I chose to install OCP AI addon, then does the base-ocp code not validate the inputs for gpu worker node, 32 gb memory etc? If no, that should be added.

Copy link
Member Author

@imprateeksh imprateeksh Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think for this we need to get clarification explicitly. My understanding says, we have to enforce having a GPU as mandatory but I do get your point now to have another item in the list as non GPU type.
    In last DA rally, it was mentioned to have GPU support in additional worker pool, so does it means we need to restrict default pool for only non GPU machines?

If that is so how should we enforce the GPU restriction? With default being GPU machine type, it was easy to handle. I think this needs to be bring to next deep dive.

On a second note, I think base-ocp alone if sufficient to verify this which runs across all worker pools and all machine types to verify if any machine is GPU using anytrue here

  1. Base ocp already handles it, I had provided the link in previous response. Providing again here - https://github.com/terraform-ibm-modules/terraform-ibm-base-ocp-vpc/blob/cc39fcc9521c498fc2800fca96c2314104fdc2a4/variables.tf#L311
    But base-ocp is not handling it inside additional worker pools which is now mandatory and the new change with this PR.
    The validation in base-ocp will be checked when OCP AI addon is added. Also, I can confirm this after testing it with combination of base-ocp and ocp-ai.
    IMHO, even if validations are removed from here, will be gracefully handled in base-ocp.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no validation is required in this OCP AI code if it is already there in base-ocp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have validation here because the plan was successful without any validations when I provided no GPU machine type. This plan ran only for OCP AI code and as no validation passed.

image

Finished quickly -
image

If we rely for validation on OCP AI, the failure will be seen quickly when we start the deployment process during validate but when we depend on validations of base-ocp, it will be validated but after quite some time as individual addons (in this case OCP DA) will be validated and that too depend on order of validation if OCP DA chosen at last, error will be caught very late.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants