Advanced load balancing optimizations

This page describes how to use a service load balancing policy to support advanced cost, latency, and resiliency optimizations for the following load balancers:

Global external Application Load Balancer
Cross-region internal Application Load Balancer
Global external proxy Network Load Balancer
Cross-region internal proxy Network Load Balancer

Cloud Service Mesh also supports advanced load balancing optimizations. For details, see Advanced load balancing overview in the Cloud Service Mesh documentation.

A service load balancing policy (serviceLbPolicy) is a resource associated with the load balancer's backend service. A service load balancing policy lets you customize the parameters that influence how traffic is distributed within the backends associated with a backend service:

Customize the load balancing algorithm used to determine how traffic is distributed within a particular region or a zone.
Enable auto-capacity draining so that the load balancer can quickly drain traffic from unhealthy backends.
Set a failover threshold to determine when a backend is considered unhealthy. This lets traffic fail over to a different backend to avoid unhealthy backends.

Additionally, you can designate specific backends as preferred backends. These backends must be used to capacity before requests are sent to the remaining backends.

The following diagram shows how Cloud Load Balancing evaluates routing, load balancing, and traffic distribution.

How Cloud Load Balancing makes routing and traffic distribution decisions.

Before you begin

Before reviewing the contents of this page, carefully review the Request distribution process described on the External Application Load Balancer overview page. For load balancers that are always Premium Tier, all the load balancing algorithms described on this page support spilling over between regions if a first-choice region is already full.

Supported backends

Service load balancing policies and all of the features described on this page require compatible backends that support a balancing mode. Supported backends are summarized in the following table:

Backend	Supported?
Instance groups	Zonal unmanaged and zonal managed instance groups are supported, but regional managed instance groups are not.
Zonal NEGs (`GCE_VM_IP_PORT` endpoints)
Zonal NEGs (`GCE_VM_IP` endpoints)	These types of NEGs are not supported by Application Load Balancers and Proxy Network Load Balancers.
Hybrid NEGs (`NON_GCP_PRIVATE_IP_PORT` endpoints)
Serverless NEGs
Internet NEGs
Private Service Connect NEGs

Load balancing algorithms

This section describes the load balancing algorithms that you can configure in a service load balancing policy. If you don't configure an algorithm, or if you don't configure a service load balancing policy at all, the load balancer uses WATERFALL_BY_REGION by default.

Waterfall by region

WATERFALL_BY_REGION is the default load balancing algorithm. With this algorithm, in aggregate, all the Google Front Ends (GFEs) in the region closest to the user attempt to fill backends in proportion to their configured target capacities (modified by their capacity scalers).

Each individual second-layer GFE prefers to select backend instances or endpoints in a zone that's as close as possible (defined by network round-trip time) to the second-layer GFE. Because WATERFALL_BY_REGION minimizes latency between zones, at low request rates, each second-layer GFE might exclusively send requests to backends in the second-layer GFE's preferred zone.

If all the backends in the closest region are running at their configured capacity limit, traffic will then start to overflow to the next closest region while optimizing network latency.

Spray to region

The SPRAY_TO_REGION algorithm modifies the individual behavior of each second-layer GFE to the extent that each second-layer GFE has no preference for selecting backend instances or endpoints that are in a zone as close as possible to the second-layer GFE. With SPRAY_TO_REGION, each second-layer GFE sends requests to all backend instances or endpoints, in all zones of the region, without preference for a shorter round-trip time between the second-layer GFE and the backend instances or endpoints.

Like WATERFALL_BY_REGION, in aggregate, all second-layer GFEs in the region fill backends in proportion to their configured target capacities (modified by their capacity scalers).

While SPRAY_TO_REGION provides more uniform distribution among backends in all zones of a region, especially at low request rates, this uniform distribution comes with the following considerations:

When backends go down (but continue to pass their health checks), more second-layer GFEs are affected, though individual impact is less severe.
Because each second-layer GFE has no preference for one zone over another, the second-layer GFEs create more cross-zone traffic. Depending on the number of requests being processed, each second-layer GFE might create more TCP connections to the backends as well.

Waterfall by zone

The WATERFALL_BY_ZONE algorithm modifies the individual behavior of each second-layer GFE to the extent that each second-layer GFE has a very strong preference to select backend instances or endpoints that are in the closest-possible zone to the second-layer GFE. With WATERFALL_BY_ZONE, each second-layer GFE only sends requests to backend instances or endpoints in other zones of the region when the second-layer GFE has filled (or proportionally overfilled) backend instances or endpoints in its most favored zone.

Like WATERFALL_BY_REGION, in aggregate, all second-layer GFEs in the region fill backends in proportion to their configured target capacities (modified by their capacity scalers).

The WATERFALL_BY_ZONE algorithm minimizes latency with the following considerations:

WATERFALL_BY_ZONE does not inherently minimize cross-zone connections. The algorithm is steered by latency only.
WATERFALL_BY_ZONE does not guarantee that each second-layer GFE always fills its most favored zone before filling other zones. Maintenance events can temporarily cause all traffic from a second-layer GFE to be sent to backend instances or endpoints in another zone.
WATERFALL_BY_ZONE can result in less uniform distribution of requests among all backend instances or endpoints within the region as a whole. For example, backend instances or endpoints in the second-layer GFE's most favored zone might be filled to capacity while backends in other zones are not filled to capacity.

Compare load balancing algorithms

The following table compares the different load balancing algorithms.

Behavior	Waterfall by region	Spray to region	Waterfall by zone
Uniform capacity usage within a single region	Yes	Yes	No
Uniform capacity usage across multiple regions	No	No	No
Uniform traffic split from load balancer	No	Yes	No
Cross-zone traffic distribution	Yes. Traffic is distributed evenly across zones in a region while optimizing network latency. Traffic might be sent across zones if needed.	Yes	Yes. Traffic first goes to the nearest zone until it is at capacity. Then, it goes to the next closest zone.
Sensitivity to traffic spikes in a local zone	Average; depends on how much traffic has already been shifted to balance across zones.	Lower; single zone spikes are spread across all zones in the region.	Higher; single zone spikes are likely to be served entirely by the same zone until the load balancer is able to react.

Auto-capacity draining and undraining

Auto-capacity draining and undraining combine the concepts of health checks and backend capacity. With auto-capacity draining, health checks are used as an additional signal to set effective backend capacity to zero. With auto-capacity undraining, health checks are used as an additional signal to restore the effective backend capacity to its previous value.

Without auto-capacity draining and undraining, if you want to direct requests away from all backends in a particular region, you must manually set the effective capacity of each backend in that region to zero. For example, you can use the capacity scaler to do this.

With auto-capacity draining and undraining, health checks can be used as a signal to adjust the capacity of a backend, either by draining or undraining.

To enable auto-capacity draining and un-draining, see Configure a service load balancing policy.

Auto-capacity draining

Auto-capacity draining sets the capacity of a backend to zero when both of the following conditions are true:

Fewer than 25% of the backend's instances or endpoints pass health checks.
The total number of backend instance groups or NEGs that are to be drained automatically doesn't exceed 50% of the total backend instance groups or NEGs. When calculating the 50% ratio, backends with zero capacity are not included in the numerator. However, all backends are included in the denominator.

Backends with zero capacity are the following:

Backend instance groups with no member instances, where the instance group capacity is defined on a per instance basis
Backend NEGs with no member endpoints, where the NEG capacity is defined on a per endpoint basis
Backend instance groups or NEGs with capacity scalers set to zero

Automatically drained backend capacity is functionally equivalent to manually setting a backend's backendService.backends[].capacityScaler to 0, but without setting the capacity scaler value.

Auto-capacity undraining

Auto-capacity undraining returns the capacity of a backend to the value controlled by the backend's capacity scaler when 35% or more of the backend instances or endpoints pass health checks for at least 60 seconds. The 60 second requirement reduces the chances of sequential draining and undraining when health checks fail and pass in rapid succession.

Failover threshold

The load balancer determines the distribution of traffic among backends in a multi-level fashion. In the steady state, it sends traffic to backends that are selected based on one of the previously described load balancing algorithms. These backends, called primary backends, are considered optimal in terms of latency and capacity.

The load balancer also keeps track of other backends that can be used if the primary backends become unhealthy and are unable to handle traffic. These backends are called failover backends. These backends are typically nearby backends with remaining capacity.

If instances or endpoints in the primary backend become unhealthy, the load balancer doesn't shift traffic to other backends immediately. Instead, the load balancer first shifts traffic to other healthy instances or endpoints in the same backend to help stabilize traffic load. If too many endpoints in a primary backend are unhealthy, and the remaining endpoints in the same backend are not able to handle the extra traffic, the load balancer uses the failover threshold to determine when to start sending traffic to a failover backend. The load balancer tolerates unhealthiness in the primary backend up to the failover threshold. After that, traffic is shifted away from the primary backend.

The failover threshold is a value between 1 and 99, expressed as a percentage of endpoints in a backend that must be healthy. If the percentage of healthy endpoints falls below the failover threshold, the load balancer tries to send traffic to a failover backend. By default, the failover threshold is 70.

If the failover threshold is set too high, unnecessary traffic spills can occur due to transient health changes. If the failover threshold is set too low, the load balancer continues to send traffic to the primary backends even though there are a lot of unhealthy endpoints.

Failover decisions are localized. Each local Google Front End (GFE) behaves independently of the other. It is your responsibility to make sure that your failover backends can handle the additional traffic.

Failover traffic can result in overloaded backends. Even if a backend is unhealthy, the load balancer might still send traffic there. To exclude unhealthy backends from the pool of available backends, enable the auto-capacity drain feature.

Preferred backends

Preferred backends are backends whose capacity you want to completely use before spilling traffic over to other backends. Any traffic over the configured capacity of preferred backends is routed to the remaining non-preferred backends. The load balancing algorithm then distributes traffic between the non-preferred backends of a backend service.

You can configure your load balancer to prefer and completely use one or more backends attached to a backend service before routing subsequent requests to the remaining backends.

Consider the following limitations when you use preferred backends:

The backends configured as preferred backends might be further away from the clients and result in higher average latency for client requests. This happens even if there are other closer backends which could have served the clients with lower latency.
Certain load balancing algorithms (WATERFALL_BY_REGION, SPRAY_TO_REGION, and WATERFALL_BY_ZONE) don't apply to backends configured as preferred backends.

To learn how to set preferred backends, see Set preferred backends.

Configure a service load balancing policy

The service load balancing policy resource lets you configure the following fields:

Load balancing algorithm
Auto-capacity draining
Failover threshold

To set a preferred backend, see Set preferred backends.

Create a policy

To create and configure a service load balancing policy, complete the following steps:

Create a service load balancing policy resource. You can do this either by using a YAML file or directly, by using gcloud parameters.
- With a YAML file. You specify service load balancing policies in a YAML file. Here is a sample YAML file that shows you how to configure a load balancing algorithm, enable auto-capacity draining, and to set a custom failover threshold:
```
name: projects/PROJECT_ID/locations/global/serviceLbPolicies/SERVICE_LB_POLICY_NAME
autoCapacityDrain:
    enable: True
failoverConfig:
    failoverHealthThreshold: FAILOVER_THRESHOLD_VALUE
loadBalancingAlgorithm: LOAD_BALANCING_ALGORITHM
```
  Replace the following:
  - PROJECT_ID: the project ID.
  - SERVICE_LB_POLICY_NAME: the name of the service load balancing policy.
  - FAILOVER_THRESHOLD_VALUE: the failover threshold value. This should be a number between 1 and 99.
  - LOAD_BALANCING_ALGORITHM: the load balancing algorithm to be used. This can be either SPRAY_TO_REGION, WATERFALL_BY_REGION, or WATERFALL_BY_ZONE.
  After you create the YAML file, import the file to a new service load balancing policy.
```
gcloud network-services service-lb-policies import SERVICE_LB_POLICY_NAME \
 --source=PATH_TO_POLICY_FILE \
 --location=global
```
- Without a YAML file. Alternatively, you can configure service load balancing policy features without using a YAML file.
  
  To set the load balancing algorithm and enable auto-draining, use the following parameters:
```
gcloud network-services service-lb-policies create SERVICE_LB_POLICY_NAME \
 --load-balancing-algorithm=LOAD_BALANCING_ALGORITHM \
 --auto-capacity-drain \
 --failover-health-threshold=FAILOVER_THRESHOLD_VALUE \
 --location=global
```
  Replace the following:
  - SERVICE_LB_POLICY_NAME: the name of the service load balancing policy.
  - LOAD_BALANCING_ALGORITHM: the load balancing algorithm to be used. This can be either SPRAY_TO_REGION, WATERFALL_BY_REGION, or WATERFALL_BY_ZONE.
  - FAILOVER_THRESHOLD_VALUE: the failover threshold value. This should be a number between 1 and 99.

Update a backend service so that its --service-lb-policy field references the newly created service load balancing policy resource. A backend service can only be associated with one service load balancing policy resource.

gcloud compute backend-services update BACKEND_SERVICE_NAME \
  --service-lb-policy=SERVICE_LB_POLICY_NAME \
  --global

You can associate a service load balancing policy with a backend service while creating the backend service.

gcloud compute backend-services create BACKEND_SERVICE_NAME \
    --protocol=PROTOCOL \
    --port-name=NAMED_PORT_NAME \
    --health-checks=HEALTH_CHECK_NAME \
    --load-balancing-scheme=LOAD_BALANCING_SCHEME \
    --service-lb-policy=SERVICE_LB_POLICY_NAME \
    --global

Remove a policy

To remove a service load balancing policy from a backend service, use the following command:

gcloud compute backend-services update BACKEND_SERVICE_NAME \
    --no-service-lb-policy \
    --global

Set preferred backends

You can configure preferred backends by using either the Google Cloud CLI or the API.

gcloud

Add a preferred backend

To set a preferred backend, use the gcloud compute backend-services add-backend command to set the --preference flag when you're adding the backend to the backend service.

gcloud compute backend-services add-backend BACKEND_SERVICE_NAME \
    ...
    --preference=PREFERENCE \
    --global

Replace PREFERENCE with the level of preference you want to assign to the backend. This can be either PREFERRED or DEFAULT.

The rest of the command depends on the type of backend you're using (instance group or NEG). For all the required parameters, see the gcloud compute backend-services add-backend command.

Update a backend's preference

To update a backend's --preference parameter, use the gcloud compute backend-services update-backend command.

gcloud compute backend-services update-backend BACKEND_SERVICE_NAME \
    ...
    --preference=PREFERENCE \
    --global

The rest of the command depends on the type of backend you're using (instance group or NEG). The following example command updates a backend instance group's preference and sets it to PREFERRED:

gcloud compute backend-services update-backend BACKEND_SERVICE_NAME \
    --instance-group=INSTANCE_GROUP_NAME \
    --instance-group-zone=INSTANCE_GROUP_ZONE \
    --preference=PREFERRED \
    --global

API

To set a preferred backend, set the preference flag on each backend by using the global backendServices resource.

Here is a sample that shows you how to configure the backend preference:

  name: projects/PROJECT_ID/locations/global/backendServices/BACKEND_SERVICE_NAME
  ...
  - backends
      name: BACKEND_1_NAME
      preference: PREFERRED
      ...
  - backends
      name: BACKEND_2_NAME
      preference: DEFAULT
      ...

Replace the following:

PROJECT_ID: the project ID
BACKEND_SERVICE_NAME: the name of the backend service
BACKEND_1_NAME: the name of the preferred backend
BACKEND_2_NAME: the name of the default backend

Troubleshooting

Traffic distribution patterns can change when you attach a new service load balancing policy to a backend service.

To debug traffic issues, use Cloud Monitoring to look at how traffic flows between the load balancer and the backend. Cloud Load Balancing logs and metrics can also help you understand load balancing behavior.

This section summarizes a few common scenarios that you might see in the newly exposed configuration.

Traffic from a single source is sent to too many distinct backends

This is the intended behavior of the SPRAY_TO_REGION algorithm. However, you might experience issues caused by wider distribution of your traffic. For example, cache hit rates might decrease because backends see traffic from a wider selection of clients. In this case, consider using other algorithms like WATERFALL_BY_REGION.

Traffic is not being sent to backends with lots of unhealthy endpoints

This is the intended behavior when autoCapacityDrain is enabled. Backends with a lot of unhealthy endpoints are drained and removed from the load balancing pool. If you don't want this behavior, you can disable auto-capacity draining. However, this means that traffic can be sent to backends with a lot of unhealthy endpoints and requests can fail.

Traffic is being sent to more distant backends before closer ones

This is the intended behavior if your preferred backends are further away than your default backends. If you don't want this behavior, update the preference settings for each backend accordingly.

Traffic is not being sent to some backends when using preferred backends

This is the intended behavior when your preferred backends have not yet reached capacity. The preferred backends are assigned first based on round-trip time latency to these backends.

If you want traffic sent to other backends, you can do one of the following:

Update preference settings for the other backends.
Set a lower target capacity setting for your preferred backends. The target capacity is configured by using either the max-rate or the max-utilization fields depending on the backend service's balancing mode.

Traffic is being sent to a remote backend during transient health changes

This is the intended behavior when the failover threshold is set to a high value. If you want traffic to keep going to the primary backends when there are transient health changes, set this field to a lower value.

Healthy endpoints are overloaded when other endpoints are unhealthy

This is the intended behavior when the failover threshold is set to a low value. When endpoints are unhealthy, the traffic intended for these unhealthy endpoints is instead spread among the remaining endpoints in the same backend. If you want the failover behavior to be triggered sooner, set this field to a higher value.

Limitations

Each backend service can only be associated with a single service load balancing policy resource.