Estimate and control costs

This page describes best practices for estimating and controlling costs in BigQuery.

The primary costs in BigQuery are compute, used for query processing, and storage, for data that is loaded into BigQuery. BigQuery offers two types of pricing models for query processing, on-demand and capacity-based pricing. Each model offers different best practices for cost control. For data stored in BigQuery, costs depend on the storage billing model configured for each dataset.

Control query costs

To control the costs of individual queries, we recommend that you first follow best practices for optimizing query computation and optimizing storage.

The following sections outline additional best practices that you can use to further control your query costs.

Check the estimated cost before running a query

Best practice: Before running queries, preview them to estimate costs.

When using the on-demand pricing model, queries are billed according to the number of bytes read. To estimate costs before running a query:

Use the query validator

When you enter a query in the Google Cloud console, the query validator verifies the query syntax and provides an estimate of the number of bytes read. You can use this estimate to calculate query cost in the pricing calculator.

  • If your query is not valid, then the query validator displays an error message. For example:

    Not found: Table myProject:myDataset.myTable was not found in location US

  • If your query is valid, then the query validator provides an estimate of the number of bytes required to process the query. For example:

    This query will process 623.1 KiB when run.

Perform a dry run

To perform a dry run, do the following:

Console

  1. Go to the BigQuery page.

    Go to BigQuery

  2. Enter your query in the query editor.

    If the query is valid, then a check mark automatically appears along with the amount of data that the query will process. If the query is invalid, then an exclamation point appears along with an error message.

bq

Enter a query like the following using the --dry_run flag.

bq query \
--use_legacy_sql=false \
--dry_run \
'SELECT
   COUNTRY,
   AIRPORT,
   IATA
 FROM
   `project_id`.dataset.airports
 LIMIT
   1000'
 

For a valid query, the command produces the following response:

Query successfully validated. Assuming the tables are not modified,
running this query will process 10918 bytes of data.

API

To perform a dry run by using the API, submit a query job with dryRun set to true in the JobConfiguration type.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Go API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

import (
	"context"
	"fmt"
	"io"

	"cloud.google.com/go/bigquery"
)

// queryDryRun demonstrates issuing a dry run query to validate query structure and
// provide an estimate of the bytes scanned.
func queryDryRun(w io.Writer, projectID string) error {
	// projectID := "my-project-id"
	ctx := context.Background()
	client, err := bigquery.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("bigquery.NewClient: %v", err)
	}
	defer client.Close()

	q := client.Query(`
	SELECT
		name,
		COUNT(*) as name_count
	FROM ` + "`bigquery-public-data.usa_names.usa_1910_2013`" + `
	WHERE state = 'WA'
	GROUP BY name`)
	q.DryRun = true
	// Location must match that of the dataset(s) referenced in the query.
	q.Location = "US"

	job, err := q.Run(ctx)
	if err != nil {
		return err
	}
	// Dry run is not asynchronous, so get the latest status and statistics.
	status := job.LastStatus()
	if err := status.Err(); err != nil {
		return err
	}
	fmt.Fprintf(w, "This query will process %d bytes\n", status.Statistics.TotalBytesProcessed)
	return nil
}

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.Job;
import com.google.cloud.bigquery.JobInfo;
import com.google.cloud.bigquery.JobStatistics;
import com.google.cloud.bigquery.QueryJobConfiguration;

// Sample to run dry query on the table
public class QueryDryRun {

  public static void runQueryDryRun() {
    String query =
        "SELECT name, COUNT(*) as name_count "
            + "FROM `bigquery-public-data.usa_names.usa_1910_2013` "
            + "WHERE state = 'WA' "
            + "GROUP BY name";
    queryDryRun(query);
  }

  public static void queryDryRun(String query) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      QueryJobConfiguration queryConfig =
          QueryJobConfiguration.newBuilder(query).setDryRun(true).setUseQueryCache(false).build();

      Job job = bigquery.create(JobInfo.of(queryConfig));
      JobStatistics.QueryStatistics statistics = job.getStatistics();

      System.out.println(
          "Query dry run performed successfully." + statistics.getTotalBytesProcessed());
    } catch (BigQueryException e) {
      System.out.println("Query not performed \n" + e.toString());
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Node.js API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

// Import the Google Cloud client library
const {BigQuery} = require('@google-cloud/bigquery');
const bigquery = new BigQuery();

async function queryDryRun() {
  // Runs a dry query of the U.S. given names dataset for the state of Texas.

  const query = `SELECT name
    FROM \`bigquery-public-data.usa_names.usa_1910_2013\`
    WHERE state = 'TX'
    LIMIT 100`;

  // For all options, see https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query
  const options = {
    query: query,
    // Location must match that of the dataset(s) referenced in the query.
    location: 'US',
    dryRun: true,
  };

  // Run the query as a job
  const [job] = await bigquery.createQueryJob(options);

  // Print the status and statistics
  console.log('Status:');
  console.log(job.metadata.status);
  console.log('\nJob Statistics:');
  console.log(job.metadata.statistics);
}

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery PHP API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

use Google\Cloud\BigQuery\BigQueryClient;

/** Uncomment and populate these variables in your code */
// $projectId = 'The Google project ID';
// $query = 'SELECT id, view_count FROM `bigquery-public-data.stackoverflow.posts_questions`';

// Construct a BigQuery client object.
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);

// Set job configs
$jobConfig = $bigQuery->query($query);
$jobConfig->useQueryCache(false);
$jobConfig->dryRun(true);

// Extract query results
$queryJob = $bigQuery->startJob($jobConfig);
$info = $queryJob->info();

printf('This query will process %s bytes' . PHP_EOL, $info['statistics']['totalBytesProcessed']);

Python

Set the QueryJobConfig.dry_run property to True. Client.query() always returns a completed QueryJob when provided a dry run query configuration.

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)

# Start the query, passing in the extra configuration.
query_job = client.query(
    (
        "SELECT name, COUNT(*) as name_count "
        "FROM `bigquery-public-data.usa_names.usa_1910_2013` "
        "WHERE state = 'WA' "
        "GROUP BY name"
    ),
    job_config=job_config,
)  # Make an API request.

# A dry run query completes immediately.
print("This query will process {} bytes.".format(query_job.total_bytes_processed))

Estimate query costs

When using the on-demand pricing model, you can estimate the cost of running a query by calculating the number of bytes processed.

On-demand query size calculation

To calculate the number of bytes processed by the various types of queries, see the following sections:

Avoid running queries to explore table data

Best practice: Don't run queries to explore or preview table data.

If you are experimenting with or exploring your data, you can use table preview options to view data at no charge and without affecting quotas.

BigQuery supports the following data preview options:

  • In the Google Cloud console, on the table details page, click the Preview tab to sample the data.
  • In the bq command-line tool, use the bq head command and specify the number of rows to preview.
  • In the API, use tabledata.list to retrieve table data from a specified set of rows.
  • Avoid using LIMIT in non-clustered tables. For non-clustered tables, a LIMIT clause won't reduce compute costs.

Restrict the number of bytes billed per query

Best practice: Use the maximum bytes billed setting to limit query costs when using the on-demand pricing model.

You can limit the number of bytes billed for a query using the maximum bytes billed setting. When you set maximum bytes billed, the number of bytes that the query reads is estimated before the query execution. If the number of estimated bytes is beyond the limit, then the query fails without incurring a charge.

For clustered tables, the estimation of the number of bytes billed for a query is an upper bound, and can be higher than the actual number of bytes billed after running the query. So in some cases, if you set the maximum bytes billed, a query on a clustered table can fail, even though the actual bytes billed wouldn't exceed the maximum bytes billed setting.

If a query fails because of the maximum bytes billed setting, an error similar to following is returned:

Error: Query exceeded limit for bytes billed: 1000000. 10485760 or higher required.

To set the maximum bytes billed:

Console

  1. In the Query editor, click More > Query settings > Advanced options.
  2. In the Maximum bytes billed field, enter an integer.
  3. Click Save.

bq

Use the bq query command with the --maximum_bytes_billed flag.

  bq query --maximum_bytes_billed=1000000 \
  --use_legacy_sql=false \
  'SELECT
     word
   FROM
     `bigquery-public-data`.samples.shakespeare'

API

Set the maximumBytesBilled property in JobConfigurationQuery or QueryRequest.

Avoid using LIMIT in non-clustered tables

Best practice: For non-clustered tables, don't use a LIMIT clause as a method of cost control.

For non-clustered tables, applying a LIMIT clause to a query doesn't affect the amount of data that is read. You are billed for reading all bytes in the entire table as indicated by the query, even though the query returns only a subset. With a clustered table, a LIMIT clause can reduce the number of bytes scanned, because scanning stops when enough blocks are scanned to get the result. You are billed for only the bytes that are scanned.

Materialize query results in stages

Best practice: If possible, materialize your query results in stages.

If you create a large, multi-stage query, each time you run it, BigQuery reads all the data that is required by the query. You are billed for all the data that is read each time the query is run.

Instead, break your query into stages where each stage materializes the query results by writing them to a destination table. Querying the smaller destination table reduces the amount of data that is read and lowers costs. The cost of storing the materialized results is much less than the cost of processing large amounts of data.

Control workload costs

This section describes best practices for controlling costs within a workload. A workload is a set of related queries. For example, a workload can be a data transformation pipeline that runs daily, a set of dashboards run by a group of business analysts, or several ad-hoc queries run by a set of data scientists.

Use the Google Cloud pricing calculator

Best practice: Use the Google Cloud pricing calculator to create an overall monthly cost estimate for BigQuery based on projected usage. You can then compare this estimate to your actual costs to identify areas for optimization.

On-demand

To estimate costs in the Google Cloud pricing calculator when using the on-demand pricing model, follow these steps:

  1. Open the Google Cloud pricing calculator.
  2. Click Add to estimate.
  3. Select BigQuery.
  4. Select "On-demand" for Service type.
  5. Choose the location where the your queries will run.
  6. For Amount of data queried, enter the estimated bytes read from your dry run or the query validator.
  7. Enter your estimations of storage usage for Active storage, Long-term storage, Streaming inserts, and Streaming reads. You only need to estimate either physical storage or logical storage, depending on the dataset storage billing model.
  8. The estimate appears in the Cost details panel. For more information about the estimated cost, click Open detailed view. You can also download and share the cost estimate.

For more information, see On-demand pricing.

Editions

To estimate costs in the Google Cloud pricing calculator when using the capacity-based pricing model with BigQuery editions, follow these steps:

  1. Open the Google Cloud pricing calculator.
  2. Click Add to estimate.
  3. Select BigQuery.
  4. Select "Editions" for Service type.
  5. Choose the location where the slots are used.
  6. Choose your Edition.
  7. Choose the Maximum slots, Baseline slots, optional Commitment, and Estimated utilization of autoscaling.
  8. Choose the location where the data is stored.
  9. Enter your estimations of storage usage for Active storage, Long-term storage, Streaming inserts, and Streaming reads. You only need to estimate either physical storage or logical storage, depending on the dataset storage billing model.
  10. The estimate appears in the Cost details panel. For more information about the estimated cost, click Open detailed view. You can also download and share the cost estimate.

For more information, see Capacity-based pricing.

Use reservations and commitments

Best practice: Use BigQuery reservations and commitments to control costs.

When you use the on-demand billing model, the only way to restrict costs is to configure project-level or user-level daily quotas. However, these quotas enforce a hard cap that prevent users from running queries beyond the quota limit.

With slot reservations, you specify the maximum number of slots that are available to specific workloads. You can also purchase slot commitments that provide discounted prices for a committed period of time. For more information, see Introduction to BigQuery reservations.

Use the slot estimator

Best practice: Use slot estimator to estimate the number of slots required for your workloads.

The BigQuery slot estimator helps you to manage slot capacity based on historical performance metrics.

In addition, customers using the on-demand pricing model can view sizing recommendations for commitments and autoscaling reservations with similar performance when moving to capacity-based pricing.

View costs using a dashboard

Best practice: Create a dashboard to analyze your Cloud Billing data so you can monitor and make adjustments to your BigQuery usage.

You can export your billing data to BigQuery and visualize it in a tool such as Looker Studio. For a tutorial about creating a billing dashboard, see Visualize Google Cloud billing using BigQuery and Looker Studio.

Use billing budgets and alerts

Best practice: Use Cloud Billing budgets to monitor your BigQuery charges in one place.

Cloud Billing budgets let you track your actual costs against your planned costs. After you've set a budget amount, you set budget alert threshold rules that are used to trigger email notifications. Budget alert emails help you stay informed about how your BigQuery spend is tracking against your budget.

Create custom query quotas

Best practice: Use custom daily query quotas to limit the amount of data processed per day.

You can manage costs by setting a custom quota that specifies a limit on the amount of data processed per day per project or per user. Users are not able to run queries once the quota is reached.

Control storage costs

Use these best practices for optimizing the cost of BigQuery storage. You can also optimize storage for query performance.

Use long-term storage

Best practice: Use long-term storage pricing to reduce cost of older data.

When you load data into BigQuery storage, the data is subject to BigQuery storage pricing. For older data, you can automatically take advantage of BigQuery long-term storage pricing.

If you have a table that is not modified for 90 consecutive days, the price of storage for that table automatically drops by 50 percent. If you have a partitioned table, each partition is considered separately for eligibility for long-term pricing, subject to the same rules as non-partitioned tables.

Configure the storage billing model

Best practice: Optimize the storage billing model based on your usage patterns.

BigQuery supports storage billing using logical (uncompressed) or physical (compressed) bytes, or a combination of both. The storage billing model configured for each dataset determines your storage pricing, but it does not impact query performance.

You can use the INFORMATION_SCHEMA views to determine the storage billing model that works best based on your usage patterns.

Avoid overwriting tables

Best practice: When you are using the physical storage billing model, avoid repeatedly overwriting tables.

When you overwrite a table, for example by using the --replace parameter in batch load jobs or using the TRUNCATE TABLE SQL statement, the replaced data is kept for the duration of the time travel and failsafe windows. If you overwrite a table frequently, you will incur additional storage charges.

Instead, you can incrementally load data into a table by using the WRITE_APPEND parameter in load jobs, the MERGE SQL statement, or using the storage write API.

Reduce the time travel window

Best practice: Based on your requirements, you can lower the time travel window.

Reducing the time travel window from the default value of seven days reduces the retention period for data deleted from or changed in a table. You are billed for time travel storage only when using the physical (compressed) storage billing model.

The time travel window is set at the dataset level. You can also set The default time travel window for new datasets using default configurations.

Use table expiration for destination tables

Best practice: If you are writing large query results to a destination table, use the default table expiration time to remove the data when it's no longer needed.

Keeping large result sets in BigQuery storage has a cost. If you don't need permanent access to the results, use the default table expiration to automatically delete the data for you.

Archive data to Cloud Storage

Best practice: Consider archiving data in Cloud Storage.

You can move data from BigQuery to Cloud Storage based on the business need for archival. As a best practice, consider long-term storage pricing and the physical storage billing model before exporting data out of BigQuery.

What's next