-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] TPU_WORKER_HOSTNAMES with long rayclusters name #2923
Comments
Tbh, i'm not sure where the changes should be done. Wether in this repo or here, or both. |
I already open an issue regarding the pod's naming that were affected by a similar issue (if we consider naming to be the issue). #2227 |
cc @ryanaoleary |
Short term fix for this is just to limit the name of RayJobs or RayServices to 13 chars, since as you noted the generated RayCluster ends up having a much longer name, which is used to build the headless service name and then truncated here. You're right that this is pretty inconvenient though, I'll put out a longer term fix. The possible options I see are:
I'm leaning towards the first one since I can roll that change out faster, especially if you're manually installing the webhook. |
Thinking about it a little more, creating a service informer to watch all the headless services seems expensive and a bit overcomplicated. I can just apply the same truncation in the webhook and it should work:
|
This should be fixed with GoogleCloudPlatform/ai-on-gke#963, once it's merged and the image bumped, you can redeploy the webhook by following the commands here: https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/ray-on-gke/guides/tpu#manually-installing-the-tpu-initialization-webhook |
This is out of scope but is there an automated way? I'm currently using the helm chart to deploy the webhook. |
Once the change is merged an image will be automatically built and pushed here: us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/tpu-webhook, likely with the tag The alternative to manually installing/updating the webhook is to use the Ray Operator Addon which automatically installs this webhook in your cluster. This change would be included in the rapid channel, but will take a bit longer to go through qualification and become available than just manually installing the latest version with helm. |
Understood. Thank you very much. |
I see that the PR got merged but it seem that the new image wasn't automatically pushed to the registry, prev version was
i tried the |
Sorry for the delay, I needed to tag the new release. The new image has been pushed as |
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
I'm trying to run some TPU RayJob but i couldn't find a way to make the simple
jax.devices()
work. So far i understand that i need to install this webhook in order to have multi-host TPU with the 2 required environment variables injected in the containersTPU_WORKER_ID
andTPU_WORKER_HOSTNAMES
.The problem is:
experiment-0353b325-c892
. Which is not a very long name in my opinion.experiment-0353b325-c892-raycluster-ndbdj
. Size almost doubled, why-raycluster-
? We already know it's a raycluster since it's the resource kind. Why not for the extra hash at the end, Kubernetes usually does that on child resources.r353b325-c892-raycluster-ndbdj-headless-worker-svc
. Note that at this point, more than half of the name is a group of constant added on raycluster and a suffix on the service.Anyway just naming stuff. The main issue is the generated
TPU_WORKER_HOSTNAMES
that does not match the headless service name:TPU_WORKER_HOSTNAMES=tpu-pool-0-0.experiment-0353b325-c892-raycluster-ndbdj-headless-worker-svc,tpu-pool-0-1.experiment-0353b325-c892-raycluster-ndbdj-headless-worker-svc,tpu-pool-0-2.experiment-0353b325-c892-raycluster-ndbdj-headless-worker-svc,tpu-pool-0-3.experiment-0353b325-c892-raycluster-ndbdj-headless-worker-svc
So the TPU workers fail connect to each other.
Reproduction script
Part of the RayJob
Script is https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/tpu/kuberay-tpu-webhook/samples/tpu-test.py
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: