-
Notifications
You must be signed in to change notification settings - Fork 0
fix: add support for cpu nodes in default cluster configuration #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/run pipeline |
validation { | ||
condition = alltrue([ | ||
for pool in var.additional_worker_pools : ( | ||
startswith(pool.machine_type, "gx2") || | ||
startswith(pool.machine_type, "gx3") || | ||
startswith(pool.machine_type, "gx4") | ||
) | ||
]) | ||
error_message = "All machine types must be from the 'gx2', 'gx3', or 'gx4' series. Please choose valid machine types with GPU support." | ||
} | ||
|
||
validation { | ||
condition = alltrue([ | ||
for pool in var.additional_worker_pools : tonumber(split("x", split(".", pool.machine_type)[1])[0]) >= 8 | ||
]) | ||
error_message = "To install Red Hat OpenShift AI, all worker nodes in all pools must have at least an 8-core CPU." | ||
} | ||
|
||
validation { | ||
condition = alltrue([ | ||
for pool in var.additional_worker_pools : tonumber(split("x", split(".", pool.machine_type)[1])[1]) >= 32 | ||
]) | ||
error_message = "To install Red Hat OpenShift AI, all worker nodes in all pools must have at least 32GB memory." | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need separate validations here? These should be validated in base-ocp.
Consider a scenario, if an user adds gpu nodes as default worker pool and non-gpu as additional, then this wont work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
We needed validations here too because it holds good for all the validations in this file. Please refer this comment.
Moreover inbase-ocp
these validations are not part ofadditional_worker_pools
variable but the addons variable hence it will not fail fast for additional worker pools. -
Regarding the scenario you mentioned above, should not be considered as a valid scenario as:
- We need to restrict somewhere on GPU nodes as GPU nodes are mandatory (though not in documentation yet), so enforcing this restriction to have only GPU node should be placed in additional worker pool. The default worker pool will continue to support both GPU and non-GPU nodes as needed.
- Another point is that for GPU configuration, we need additional storage which is now only part of
additional_worker_pool
as now removed thedefault_gpu_worker_pool_storage
variable.
Maybe I can include this thing in description i.e machine type has to be GPU only. Please let me know if you think otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, got the point regarding storage.
- But still additional worker pools can be added with cpu worker nodes.
- If I chose to install OCP AI addon, then does the base-ocp code not validate the inputs for gpu worker node, 32 gb memory etc? If no, that should be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think for this we need to get clarification explicitly. My understanding says, we have to enforce having a GPU as mandatory but I do get your point now to have another item in the list as non GPU type.
In last DA rally, it was mentioned to have GPU support in additional worker pool, so does it means we need to restrict default pool for only non GPU machines?
If that is so how should we enforce the GPU restriction? With default being GPU machine type, it was easy to handle. I think this needs to be bring to next deep dive.
On a second note, I think base-ocp alone if sufficient to verify this which runs across all worker pools and all machine types to verify if any machine is GPU using anytrue
here
- Base ocp already handles it, I had provided the link in previous response. Providing again here - https://github.com/terraform-ibm-modules/terraform-ibm-base-ocp-vpc/blob/cc39fcc9521c498fc2800fca96c2314104fdc2a4/variables.tf#L311
But base-ocp is not handling it inside additional worker pools which is now mandatory and the new change with this PR.
The validation in base-ocp will be checked when OCP AI addon is added. Also, I can confirm this after testing it with combination of base-ocp and ocp-ai.
IMHO, even if validations are removed from here, will be gracefully handled in base-ocp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think no validation is required in this OCP AI code if it is already there in base-ocp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have validation here because the plan was successful without any validations when I provided no GPU machine type. This plan ran only for OCP AI code and as no validation passed.

If we rely for validation on OCP AI, the failure will be seen quickly when we start the deployment process during validate but when we depend on validations of base-ocp, it will be validated but after quite some time as individual addons (in this case OCP DA) will be validated and that too depend on order of validation if OCP DA chosen at last, error will be caught very late.

Description
This PR aims to fix the gap in validation by moving GPU configuration to additional worker pool and some minor changes emerged during DA rally syncup.
Issue: #13375
Release required?
x.x.X
)x.X.x
)X.x.x
)Release notes content
Run the pipeline
If the CI pipeline doesn't run when you create the PR, the PR requires a user with GitHub collaborators access to run the pipeline.
Run the CI pipeline when the PR is ready for review and you expect tests to pass. Add a comment to the PR with the following text:
Checklist for reviewers
For mergers