Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] TPU_WORKER_HOSTNAMES with long rayclusters name #2923

Open
1 of 2 tasks
a-pichard opened this issue Feb 5, 2025 · 11 comments
Open
1 of 2 tasks

[Bug] TPU_WORKER_HOSTNAMES with long rayclusters name #2923

a-pichard opened this issue Feb 5, 2025 · 11 comments
Labels
bug Something isn't working triage

Comments

@a-pichard
Copy link

a-pichard commented Feb 5, 2025

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I'm trying to run some TPU RayJob but i couldn't find a way to make the simple jax.devices() work. So far i understand that i need to install this webhook in order to have multi-host TPU with the 2 required environment variables injected in the containers TPU_WORKER_ID and TPU_WORKER_HOSTNAMES.

The problem is:

  • i named my rayjob experiment-0353b325-c892. Which is not a very long name in my opinion.
  • Then it creates a raycluster experiment-0353b325-c892-raycluster-ndbdj. Size almost doubled, why -raycluster-? We already know it's a raycluster since it's the resource kind. Why not for the extra hash at the end, Kubernetes usually does that on child resources.
  • The headless worker service created for the worker to worker communication in the TPU context has it's prefix named cropped: r353b325-c892-raycluster-ndbdj-headless-worker-svc. Note that at this point, more than half of the name is a group of constant added on raycluster and a suffix on the service.

Anyway just naming stuff. The main issue is the generated TPU_WORKER_HOSTNAMES that does not match the headless service name: TPU_WORKER_HOSTNAMES=tpu-pool-0-0.experiment-0353b325-c892-raycluster-ndbdj-headless-worker-svc,tpu-pool-0-1.experiment-0353b325-c892-raycluster-ndbdj-headless-worker-svc,tpu-pool-0-2.experiment-0353b325-c892-raycluster-ndbdj-headless-worker-svc,tpu-pool-0-3.experiment-0353b325-c892-raycluster-ndbdj-headless-worker-svc

So the TPU workers fail connect to each other.

Reproduction script

Part of the RayJob

          numOfHost: 4 
            name: worker
            resources:
              limits:
                cpu: "8"
                google.com/tpu: "4"
                memory: 8G
              requests:
                cpu: "4"
                google.com/tpu: "4"
                memory: 8G
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 4x4

Script is https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/tpu/kuberay-tpu-webhook/samples/tpu-test.py

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@a-pichard a-pichard added bug Something isn't working triage labels Feb 5, 2025
@a-pichard
Copy link
Author

a-pichard commented Feb 5, 2025

Tbh, i'm not sure where the changes should be done. Wether in this repo or here, or both.

@a-pichard
Copy link
Author

I already open an issue regarding the pod's naming that were affected by a similar issue (if we consider naming to be the issue). #2227

@andrewsykim
Copy link
Collaborator

cc @ryanaoleary

@ryanaoleary
Copy link
Contributor

Short term fix for this is just to limit the name of RayJobs or RayServices to 13 chars, since as you noted the generated RayCluster ends up having a much longer name, which is used to build the headless service name and then truncated here.

You're right that this is pretty inconvenient though, I'll put out a longer term fix. The possible options I see are:

  • look up the service name using the Kubernetes client in the webhook based on the label ray.io/headless-worker-svc=<raycluster-name>
  • remove the truncation in the RayCluster controller

I'm leaning towards the first one since I can roll that change out faster, especially if you're manually installing the webhook.

@ryanaoleary
Copy link
Contributor

ryanaoleary commented Feb 5, 2025

Thinking about it a little more, creating a service informer to watch all the headless services seems expensive and a bit overcomplicated. I can just apply the same truncation in the webhook and it should work:

maxLength := 50 // 63 - (max(8,6) + 5 ) // 6 to 8 char are consumed at the end with "-head-" or -worker- + 5 generated.

if len(s) > maxLength {
	// shorten the name
	offset := int(math.Abs(float64(maxLength) - float64(len(s))))
	fmt.Printf("pod name is too long: len = %v, we will shorten it by offset = %v", len(s), offset)
	s = s[offset:]
}

@ryanaoleary
Copy link
Contributor

ryanaoleary commented Feb 6, 2025

This should be fixed with GoogleCloudPlatform/ai-on-gke#963, once it's merged and the image bumped, you can redeploy the webhook by following the commands here: https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/ray-on-gke/guides/tpu#manually-installing-the-tpu-initialization-webhook

@a-pichard
Copy link
Author

a-pichard commented Feb 6, 2025

especially if you're manually installing the webhook.

This is out of scope but is there an automated way?

I'm currently using the helm chart to deploy the webhook.

@ryanaoleary
Copy link
Contributor

Once the change is merged an image will be automatically built and pushed here: us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/tpu-webhook, likely with the tag v1.2.2-gke.0. I'll then update the Chart.yaml and values.yaml file so that image is used by default. Before that point, feel free to update the values.yaml file yourself to use the new image and repdeploy with helm once the image has been pushed.

The alternative to manually installing/updating the webhook is to use the Ray Operator Addon which automatically installs this webhook in your cluster. This change would be included in the rapid channel, but will take a bit longer to go through qualification and become available than just manually installing the latest version with helm.

@a-pichard
Copy link
Author

Understood. Thank you very much.

@a-pichard
Copy link
Author

I see that the PR got merged but it seem that the new image wasn't automatically pushed to the registry, prev version was v1.2.1-gke.0

> gcloud artifacts docker tags list us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook
...
v1.2.1-gke.0  us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/tpu-webhook          sha256:7501c3e35f034b30c35c93fbabcd9f1e828408cb4d30016b8678bdfa6d4b5bc8
v1.2.1-gke.1  us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/tpu-webhook          sha256:06d1b7f653251f46f4741a746ee36f1f7fe93810892b6ddea41a92525334433a

i tried the v1.2.1-gke.1 thinking it would be the new one with the fix, but it isn't

@ryanaoleary
Copy link
Contributor

Sorry for the delay, I needed to tag the new release. The new image has been pushed as us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/tpu-webhook:v1.2.2-gke.0. I'll put out a PR updating the defaults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

3 participants