[Bug] TPU_WORKER_HOSTNAMES with long rayclusters name #2923

a-pichard · 2025-02-05T20:55:21Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I'm trying to run some TPU RayJob but i couldn't find a way to make the simple jax.devices() work. So far i understand that i need to install this webhook in order to have multi-host TPU with the 2 required environment variables injected in the containers TPU_WORKER_ID and TPU_WORKER_HOSTNAMES.

The problem is:

i named my rayjob experiment-0353b325-c892. Which is not a very long name in my opinion.
Then it creates a raycluster experiment-0353b325-c892-raycluster-ndbdj. Size almost doubled, why -raycluster-? We already know it's a raycluster since it's the resource kind. Why not for the extra hash at the end, Kubernetes usually does that on child resources.
The headless worker service created for the worker to worker communication in the TPU context has it's prefix named cropped: r353b325-c892-raycluster-ndbdj-headless-worker-svc. Note that at this point, more than half of the name is a group of constant added on raycluster and a suffix on the service.

Anyway just naming stuff. The main issue is the generated TPU_WORKER_HOSTNAMES that does not match the headless service name: TPU_WORKER_HOSTNAMES=tpu-pool-0-0.experiment-0353b325-c892-raycluster-ndbdj-headless-worker-svc,tpu-pool-0-1.experiment-0353b325-c892-raycluster-ndbdj-headless-worker-svc,tpu-pool-0-2.experiment-0353b325-c892-raycluster-ndbdj-headless-worker-svc,tpu-pool-0-3.experiment-0353b325-c892-raycluster-ndbdj-headless-worker-svc

So the TPU workers fail connect to each other.

Reproduction script

Part of the RayJob

          numOfHost: 4 
            name: worker
            resources:
              limits:
                cpu: "8"
                google.com/tpu: "4"
                memory: 8G
              requests:
                cpu: "4"
                google.com/tpu: "4"
                memory: 8G
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 4x4

Script is https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/tpu/kuberay-tpu-webhook/samples/tpu-test.py

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

a-pichard · 2025-02-05T20:58:07Z

Tbh, i'm not sure where the changes should be done. Wether in this repo or here, or both.

a-pichard · 2025-02-05T21:04:23Z

I already open an issue regarding the pod's naming that were affected by a similar issue (if we consider naming to be the issue). #2227

andrewsykim · 2025-02-05T21:33:16Z

cc @ryanaoleary

ryanaoleary · 2025-02-05T23:34:29Z

Short term fix for this is just to limit the name of RayJobs or RayServices to 13 chars, since as you noted the generated RayCluster ends up having a much longer name, which is used to build the headless service name and then truncated here.

You're right that this is pretty inconvenient though, I'll put out a longer term fix. The possible options I see are:

look up the service name using the Kubernetes client in the webhook based on the label ray.io/headless-worker-svc=<raycluster-name>
remove the truncation in the RayCluster controller

I'm leaning towards the first one since I can roll that change out faster, especially if you're manually installing the webhook.

ryanaoleary · 2025-02-05T23:53:37Z

Thinking about it a little more, creating a service informer to watch all the headless services seems expensive and a bit overcomplicated. I can just apply the same truncation in the webhook and it should work:

maxLength := 50 // 63 - (max(8,6) + 5 ) // 6 to 8 char are consumed at the end with "-head-" or -worker- + 5 generated.

if len(s) > maxLength {
	// shorten the name
	offset := int(math.Abs(float64(maxLength) - float64(len(s))))
	fmt.Printf("pod name is too long: len = %v, we will shorten it by offset = %v", len(s), offset)
	s = s[offset:]
}

ryanaoleary · 2025-02-06T01:21:19Z

This should be fixed with GoogleCloudPlatform/ai-on-gke#963, once it's merged and the image bumped, you can redeploy the webhook by following the commands here: https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/ray-on-gke/guides/tpu#manually-installing-the-tpu-initialization-webhook

a-pichard · 2025-02-06T09:02:58Z

especially if you're manually installing the webhook.

This is out of scope but is there an automated way?

I'm currently using the helm chart to deploy the webhook.

ryanaoleary · 2025-02-06T18:57:58Z

Once the change is merged an image will be automatically built and pushed here: us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/tpu-webhook, likely with the tag v1.2.2-gke.0. I'll then update the Chart.yaml and values.yaml file so that image is used by default. Before that point, feel free to update the values.yaml file yourself to use the new image and repdeploy with helm once the image has been pushed.

The alternative to manually installing/updating the webhook is to use the Ray Operator Addon which automatically installs this webhook in your cluster. This change would be included in the rapid channel, but will take a bit longer to go through qualification and become available than just manually installing the latest version with helm.

a-pichard · 2025-02-06T19:35:54Z

Understood. Thank you very much.

a-pichard · 2025-02-07T10:04:21Z

I see that the PR got merged but it seem that the new image wasn't automatically pushed to the registry, prev version was v1.2.1-gke.0

> gcloud artifacts docker tags list us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook
...
v1.2.1-gke.0  us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/tpu-webhook          sha256:7501c3e35f034b30c35c93fbabcd9f1e828408cb4d30016b8678bdfa6d4b5bc8
v1.2.1-gke.1  us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/tpu-webhook          sha256:06d1b7f653251f46f4741a746ee36f1f7fe93810892b6ddea41a92525334433a

i tried the v1.2.1-gke.1 thinking it would be the new one with the fix, but it isn't

ryanaoleary · 2025-02-07T20:36:56Z

Sorry for the delay, I needed to tag the new release. The new image has been pushed as us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/tpu-webhook:v1.2.2-gke.0. I'll put out a PR updating the defaults.

a-pichard added bug Something isn't working triage labels Feb 5, 2025

ryanaoleary mentioned this issue Feb 6, 2025

[TPU Webhook] Fix KubeRay headless worker svc truncation bug GoogleCloudPlatform/ai-on-gke#963

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] TPU_WORKER_HOSTNAMES with long rayclusters name #2923

[Bug] TPU_WORKER_HOSTNAMES with long rayclusters name #2923

a-pichard commented Feb 5, 2025 •

edited

Loading

a-pichard commented Feb 5, 2025 •

edited

Loading

a-pichard commented Feb 5, 2025

andrewsykim commented Feb 5, 2025

ryanaoleary commented Feb 5, 2025

ryanaoleary commented Feb 5, 2025 •

edited

Loading

ryanaoleary commented Feb 6, 2025 •

edited

Loading

a-pichard commented Feb 6, 2025 •

edited

Loading

ryanaoleary commented Feb 6, 2025

a-pichard commented Feb 6, 2025

a-pichard commented Feb 7, 2025

ryanaoleary commented Feb 7, 2025

[Bug] TPU_WORKER_HOSTNAMES with long rayclusters name #2923

[Bug] TPU_WORKER_HOSTNAMES with long rayclusters name #2923

Comments

a-pichard commented Feb 5, 2025 • edited Loading

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

a-pichard commented Feb 5, 2025 • edited Loading

a-pichard commented Feb 5, 2025

andrewsykim commented Feb 5, 2025

ryanaoleary commented Feb 5, 2025

ryanaoleary commented Feb 5, 2025 • edited Loading

ryanaoleary commented Feb 6, 2025 • edited Loading

a-pichard commented Feb 6, 2025 • edited Loading

ryanaoleary commented Feb 6, 2025

a-pichard commented Feb 6, 2025

a-pichard commented Feb 7, 2025

ryanaoleary commented Feb 7, 2025

a-pichard commented Feb 5, 2025 •

edited

Loading

a-pichard commented Feb 5, 2025 •

edited

Loading

ryanaoleary commented Feb 5, 2025 •

edited

Loading

ryanaoleary commented Feb 6, 2025 •

edited

Loading

a-pichard commented Feb 6, 2025 •

edited

Loading