Create datasets

This document describes how to create datasets in BigQuery.

You can create datasets in the following ways:

  • Using the Google Cloud console.
  • Using a SQL query.
  • Using the bq mk command in the bq command-line tool.
  • Calling the datasets.insert API method.
  • Using the client libraries.
  • Copying an existing dataset.

To see steps for copying a dataset, including across regions, see Copying datasets.

This document describes how to work with regular datasets that store data in BigQuery. To learn how to work with Spanner external datasets see Create Spanner external datasets. To learn how to work with AWS Glue federated datasets see Create AWS Glue federated datasets.

To learn to query tables in a public dataset, see Query a public dataset with the Google Cloud console.

Dataset limitations

BigQuery datasets are subject to the following limitations:

  • The dataset location can only be set at creation time. After a dataset is created, its location cannot be changed.
  • All tables that are referenced in a query must be stored in datasets in the same location.
  • External datasets don't support table expiration, replicas, time travel, default collation, default rounding mode or the option to enable or disable case insensitive tables name.

  • When you copy a table, the datasets that contain the source table and destination table must reside in the same location.

  • Dataset names must be unique for each project.

  • If you change a dataset's storage billing model, you must wait 14 days before you can change the storage billing model again.

  • You can't enroll a dataset in physical storage billing if you have any existing legacy flat-rate slot commitments located in the same region as the dataset.

Before you begin

Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document.

Required permissions

To create a dataset, you need the bigquery.datasets.create IAM permission.

Each of the following predefined IAM roles includes the permissions that you need in order to create a dataset:

  • roles/bigquery.dataEditor
  • roles/bigquery.dataOwner
  • roles/bigquery.user
  • roles/bigquery.admin

For more information about IAM roles in BigQuery, see Predefined roles and permissions.

Name datasets

When you create a dataset in BigQuery, the dataset name must be unique for each project. The dataset name can contain the following:

  • Up to 1,024 characters.
  • Letters (uppercase or lowercase), numbers, and underscores.

Dataset names are case-sensitive by default. mydataset and MyDataset can coexist in the same project, unless one of them has case-sensitivity turned off.

Dataset names cannot contain spaces or special characters such as -, &, @, or %.

Hidden datasets

A hidden dataset is a dataset whose name begins with an underscore. You can query tables and views in hidden datasets the same way you would in any other dataset. Hidden datasets have the following restrictions:

  • They are hidden from the Explorer panel in the Google Cloud console.
  • They don't appear in any INFORMATION_SCHEMA views.
  • They can't be used with linked datasets.
  • They don't appear in Data Catalog.

Create datasets

To create a dataset:

Console

  1. Open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the Explorer panel, select the project where you want to create the dataset.

  3. Expand the Actions option and click Create dataset:

    Use the action menu of the project to create a dataset.

  4. On the Create dataset page:

    • For Dataset ID, enter a unique dataset name.
    • For Location type, choose a geographic location for the dataset. After a dataset is created, the location can't be changed.

    • Optional: If you want tables in this dataset to expire, select Enable table expiration , then specify the Default maximum table age in days.

    • Optional: If you want to use a customer-managed encryption key (CMEK), expand Advanced options, then select Customer-managed encryption key (CMEK).

    • Optional: If you want to use case-insensitive table names, expand Advanced options, then select Enable case insensitive table names.

    • Optional: If you want to use a default collation, expand Advanced options, select Enable default collation, then select the Default Collation to use.

    • Optional: If you want to use a default rounding mode, expand Advanced options, then select the Default Rounding Mode to use.

    • Optional: If you want to enable the physical storage billing model, expand Advanced options, then select Enable physical storage billing model.

      When you change a dataset's billing model, it takes 24 hours for the change to take effect.

      Once you change a dataset's storage billing model, you must wait 14 days before you can change the storage billing model again.

    • Optional: If you want to set the dataset's time travel window, expand Advanced options, then select the Time travel window to use.

    • Click Create dataset.

SQL

Use the CREATE SCHEMA statement.

To create a dataset in a project other than your default project, add the project ID to the dataset ID in the following format: PROJECT_ID.DATASET_ID.

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE SCHEMA PROJECT_ID.DATASET_ID
      OPTIONS (
        default_kms_key_name = 'KMS_KEY_NAME',
        default_partition_expiration_days = PARTITION_EXPIRATION,
        default_table_expiration_days = TABLE_EXPIRATION,
        description = 'DESCRIPTION',
        labels = [('KEY_1','VALUE_1'),('KEY_2','VALUE_2')],
        location = 'LOCATION',
        max_time_travel_hours = HOURS,
        storage_billing_model = BILLING_MODEL);

    Replace the following:

    • PROJECT_ID: your project ID
    • DATASET_ID: the ID of the dataset that you're creating
    • KMS_KEY_NAME: the name of the default Cloud Key Management Service key used to protect newly created tables in this dataset unless a different key is supplied at the time of creation. You cannot create a Google-encrypted table in a dataset with this parameter set.
    • PARTITION_EXPIRATION: the default lifetime (in days) for partitions in newly created partitioned tables. The default partition expiration has no minimum value. The expiration time evaluates to the partition's date plus the integer value. Any partition created in a partitioned table in the dataset is deleted PARTITION_EXPIRATION days after the partition's date. If you supply the time_partitioning_expiration option when you create or update a partitioned table, the table-level partition expiration takes precedence over the dataset-level default partition expiration.
    • TABLE_EXPIRATION: the default lifetime (in days) for newly created tables. The minimum value is 0.042 days (one hour). The expiration time evaluates to the current time plus the integer value. Any table created in the dataset is deleted TABLE_EXPIRATION days after its creation time. This value is applied if you do not set a table expiration when you create the table.
    • DESCRIPTION: a description of the dataset
    • KEY_1:VALUE_1: the key-value pair that you want to set as the first label on this dataset
    • KEY_2:VALUE_2: the key-value pair that you want to set as the second label
    • LOCATION: the dataset's location. After a dataset is created, the location can't be changed.
    • HOURS: the duration in hours of the time travel window for the new dataset. The HOURS value must be an integer expressed in multiples of 24 (48, 72, 96, 120, 144, 168) between 48 (2 days) and 168 (7 days). 168 hours is the default if this option isn't specified.
    • BILLING_MODEL: sets the storage billing model for the dataset. Set the BILLING_MODEL value to PHYSICAL to use physical bytes when calculating storage charges, or to LOGICAL to use logical bytes. LOGICAL is the default.

      When you change a dataset's billing model, it takes 24 hours for the change to take effect.

      Once you change a dataset's storage billing model, you must wait 14 days before you can change the storage billing model again.

  3. Click Run.

For more information about how to run queries, see Run an interactive query.

bq

To create a new dataset, use the bq mk command with the --location flag. For a full list of possible parameters, see the bq mk --dataset command reference.

To create a dataset in a project other than your default project, add the project ID to the dataset name in the following format: PROJECT_ID:DATASET_ID.

bq --location=LOCATION mk \
    --dataset \
    --default_kms_key=KMS_KEY_NAME \
    --default_partition_expiration=PARTITION_EXPIRATION \
    --default_table_expiration=TABLE_EXPIRATION \
    --description="DESCRIPTION" \
    --label=KEY_1:VALUE_1 \
    --label=KEY_2:VALUE_2 \
    --add_tags=KEY_3:VALUE_3[,...] \
    --max_time_travel_hours=HOURS \
    --storage_billing_model=BILLING_MODEL \
    PROJECT_ID:DATASET_ID

Replace the following:

  • LOCATION: the dataset's location. After a dataset is created, the location can't be changed. You can set a default value for the location by using the .bigqueryrc file.

  • KMS_KEY_NAME: the name of the default Cloud Key Management Service key used to protect newly created tables in this dataset unless a different key is supplied at the time of creation. You cannot create a Google-encrypted table in a dataset with this parameter set.

  • PARTITION_EXPIRATION: the default lifetime (in seconds) for partitions in newly created partitioned tables. The default partition expiration has no minimum value. The expiration time evaluates to the partition's date plus the integer value. Any partition created in a partitioned table in the dataset is deleted PARTITION_EXPIRATION seconds after the partition's date. If you supply the --time_partitioning_expiration flag when you create or update a partitioned table, the table-level partition expiration takes precedence over the dataset-level default partition expiration.

  • TABLE_EXPIRATION: the default lifetime (in seconds) for newly created tables. The minimum value is 3600 seconds (one hour). The expiration time evaluates to the current time plus the integer value. Any table created in the dataset is deleted TABLE_EXPIRATION seconds after its creation time. This value is applied if you don't set a table expiration when you create the table.

  • DESCRIPTION: a description of the dataset

  • KEY_1:VALUE_1: the key-value pair that you want to set as the first label on this dataset, and KEY_2:VALUE_2 is the key-value pair that you want to set as the second label.

  • KEY_3:VALUE_3: the key-value pair that you want to set as a tag on the dataset. Add multiple tags under the same flag with commas between key:value pairs.

  • HOURS: the duration in hours of the time travel window for the new dataset. The HOURS value must be an integer expressed in multiples of 24 (48, 72, 96, 120, 144, 168) between 48 (2 days) and 168 (7 days). 168 hours is the default if this option isn't specified.

  • BILLING_MODEL: sets the storage billing model for the dataset. Set the BILLING_MODEL value to PHYSICAL to use physical bytes when calculating storage charges, or to LOGICAL to use logical bytes. LOGICAL is the default.

    When you change a dataset's billing model, it takes 24 hours for the change to take effect.

    Once you change a dataset's storage billing model, you must wait 14 days before you can change the storage billing model again.

  • PROJECT_ID: your project ID.

  • DATASET_ID is the ID of the dataset that you're creating.

For example, the following command creates a dataset named mydataset with data location set to US, a default table expiration of 3600 seconds (1 hour), and a description of This is my dataset. Instead of using the --dataset flag, the command uses the -d shortcut. If you omit -d and --dataset, the command defaults to creating a dataset.

bq --location=US mk -d \
    --default_table_expiration 3600 \
    --description "This is my dataset." \
    mydataset

To confirm that the dataset was created, enter the bq ls command. Also, you can create a table when you create a new dataset using the following format: bq mk -t dataset.table. For more information about creating tables, see Creating a table.

Terraform

Use the google_bigquery_dataset resource.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

Create a dataset

The following example creates a dataset named mydataset:

resource "google_bigquery_dataset" "default" {
  dataset_id                      = "mydataset"
  default_partition_expiration_ms = 2592000000  # 30 days
  default_table_expiration_ms     = 31536000000 # 365 days
  description                     = "dataset description"
  location                        = "US"
  max_time_travel_hours           = 96 # 4 days

  labels = {
    billing_group = "accounting",
    pii           = "sensitive"
  }
}

When you create a dataset using the google_bigquery_dataset resource, it automatically grants access to the dataset to all accounts that are members of project-level basic roles. If you run the terraform show command after creating the dataset, the access block for the dataset looks similar to the following:

Access block for a dataset created by using Terraform.

To grant access to the dataset, we recommend that you use one of the google_bigquery_iam resources, as shown in the following example, unless you plan to create authorized objects, such as authorized views, within the dataset. In that case, use the google_bigquery_dataset_access resource. Refer to that documentation for examples.

Create a dataset and grant access to it

The following example creates a dataset named mydataset, then uses the google_bigquery_dataset_iam_policy resource to grant access to it.

resource "google_bigquery_dataset" "default" {
  dataset_id                      = "mydataset"
  default_partition_expiration_ms = 2592000000  # 30 days
  default_table_expiration_ms     = 31536000000 # 365 days
  description                     = "dataset description"
  location                        = "US"
  max_time_travel_hours           = 96 # 4 days

  labels = {
    billing_group = "accounting",
    pii           = "sensitive"
  }
}

# Update the user, group, or service account
# provided by the members argument with the
# appropriate principals for your organization.
data "google_iam_policy" "default" {
  binding {
    role = "roles/bigquery.dataOwner"
    members = [
      "user:[email protected]",
    ]
  }
  binding {
    role = "roles/bigquery.admin"
    members = [
      "user:[email protected]",
    ]
  }
  binding {
    role = "roles/bigquery.user"
    members = [
      "group:[email protected]",
    ]
  }
  binding {
    role = "roles/bigquery.dataViewer"
    members = [
      "serviceAccount:bqcx-1234567891011-abcd@gcp-sa-bigquery-condel.iam.gserviceaccount.com",
    ]
  }
}

resource "google_bigquery_dataset_iam_policy" "default" {
  dataset_id  = google_bigquery_dataset.default.dataset_id
  policy_data = data.google_iam_policy.default.policy_data
}

Create a dataset with a customer-managed encryption key

The following example creates a dataset named mydataset, and also uses the google_kms_crypto_key and google_kms_key_ring resources to specify a Cloud Key Management Service key for the dataset. You must enable the Cloud Key Management Service API before running this example.

resource "google_bigquery_dataset" "default" {
  dataset_id                      = "mydataset"
  default_partition_expiration_ms = 2592000000  # 30 days
  default_table_expiration_ms     = 31536000000 # 365 days
  description                     = "dataset description"
  location                        = "US"
  max_time_travel_hours           = 96 # 4 days

  default_encryption_configuration {
    kms_key_name = google_kms_crypto_key.crypto_key.id
  }

  labels = {
    billing_group = "accounting",
    pii           = "sensitive"
  }
  depends_on = [google_project_iam_member.service_account_access]
}

resource "google_kms_crypto_key" "crypto_key" {
  name     = "example-key"
  key_ring = google_kms_key_ring.key_ring.id
}

resource "random_id" "default" {
  byte_length = 8
}

resource "google_kms_key_ring" "key_ring" {
  name     = "${random_id.default.hex}-example-keyring"
  location = "us"
}

# Enable the BigQuery service account to encrypt/decrypt Cloud KMS keys
data "google_project" "project" {
}

resource "google_project_iam_member" "service_account_access" {
  project = data.google_project.project.project_id
  role    = "roles/cloudkms.cryptoKeyEncrypterDecrypter"
  member  = "serviceAccount:bq-${data.google_project.project.number}@bigquery-encryption.iam.gserviceaccount.com"
}

To apply your Terraform configuration in a Google Cloud project, complete the steps in the following sections.

Prepare Cloud Shell

  1. Launch Cloud Shell.
  2. Set the default Google Cloud project where you want to apply your Terraform configurations.

    You only need to run this command once per project, and you can run it in any directory.

    export GOOGLE_CLOUD_PROJECT=PROJECT_ID

    Environment variables are overridden if you set explicit values in the Terraform configuration file.

Prepare the directory

Each Terraform configuration file must have its own directory (also called a root module).

  1. In Cloud Shell, create a directory and a new file within that directory. The filename must have the .tf extension—for example main.tf. In this tutorial, the file is referred to as main.tf.
    mkdir DIRECTORY && cd DIRECTORY && touch main.tf
  2. If you are following a tutorial, you can copy the sample code in each section or step.

    Copy the sample code into the newly created main.tf.

    Optionally, copy the code from GitHub. This is recommended when the Terraform snippet is part of an end-to-end solution.

  3. Review and modify the sample parameters to apply to your environment.
  4. Save your changes.
  5. Initialize Terraform. You only need to do this once per directory.
    terraform init

    Optionally, to use the latest Google provider version, include the -upgrade option:

    terraform init -upgrade

Apply the changes

  1. Review the configuration and verify that the resources that Terraform is going to create or update match your expectations:
    terraform plan

    Make corrections to the configuration as necessary.

  2. Apply the Terraform configuration by running the following command and entering yes at the prompt:
    terraform apply

    Wait until Terraform displays the "Apply complete!" message.

  3. Open your Google Cloud project to view the results. In the Google Cloud console, navigate to your resources in the UI to make sure that Terraform has created or updated them.

API

Call the datasets.insert method with a defined dataset resource.

C#

Before trying this sample, follow the C# setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery C# API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.


using Google.Apis.Bigquery.v2.Data;
using Google.Cloud.BigQuery.V2;

public class BigQueryCreateDataset
{
    public BigQueryDataset CreateDataset(
        string projectId = "your-project-id",
        string location = "US"
    )
    {
        BigQueryClient client = BigQueryClient.Create(projectId);
        var dataset = new Dataset
        {
            // Specify the geographic location where the dataset should reside.
            Location = location
        };
        // Create the dataset
        return client.CreateDataset(
            datasetId: "your_new_dataset_id", dataset);
    }
}

Go

Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Go API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

import (
	"context"
	"fmt"

	"cloud.google.com/go/bigquery"
)

// createDataset demonstrates creation of a new dataset using an explicit destination location.
func createDataset(projectID, datasetID string) error {
	// projectID := "my-project-id"
	// datasetID := "mydataset"
	ctx := context.Background()

	client, err := bigquery.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("bigquery.NewClient: %v", err)
	}
	defer client.Close()

	meta := &bigquery.DatasetMetadata{
		Location: "US", // See https://cloud.google.com/bigquery/docs/locations
	}
	if err := client.Dataset(datasetID).Create(ctx, meta); err != nil {
		return err
	}
	return nil
}

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.Dataset;
import com.google.cloud.bigquery.DatasetInfo;

public class CreateDataset {

  public static void runCreateDataset() {
    // TODO(developer): Replace these variables before running the sample.
    String datasetName = "MY_DATASET_NAME";
    createDataset(datasetName);
  }

  public static void createDataset(String datasetName) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      DatasetInfo datasetInfo = DatasetInfo.newBuilder(datasetName).build();

      Dataset newDataset = bigquery.create(datasetInfo);
      String newDatasetName = newDataset.getDatasetId().getDataset();
      System.out.println(newDatasetName + " created successfully");
    } catch (BigQueryException e) {
      System.out.println("Dataset was not created. \n" + e.toString());
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Node.js API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

// Import the Google Cloud client library and create a client
const {BigQuery} = require('@google-cloud/bigquery');
const bigquery = new BigQuery();

async function createDataset() {
  // Creates a new dataset named "my_dataset".

  /**
   * TODO(developer): Uncomment the following lines before running the sample.
   */
  // const datasetId = "my_new_dataset";

  // Specify the geographic location where the dataset should reside
  const options = {
    location: 'US',
  };

  // Create a new dataset
  const [dataset] = await bigquery.createDataset(datasetId, options);
  console.log(`Dataset ${dataset.id} created.`);
}
createDataset();

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery PHP API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

use Google\Cloud\BigQuery\BigQueryClient;

/** Uncomment and populate these variables in your code */
// $projectId = 'The Google project ID';
// $datasetId = 'The BigQuery dataset ID';

$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$dataset = $bigQuery->createDataset($datasetId);
printf('Created dataset %s' . PHP_EOL, $datasetId);

Python

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set dataset_id to the ID of the dataset to create.
# dataset_id = "{}.your_dataset".format(client.project)

# Construct a full Dataset object to send to the API.
dataset = bigquery.Dataset(dataset_id)

# TODO(developer): Specify the geographic location where the dataset should reside.
dataset.location = "US"

# Send the dataset to the API for creation, with an explicit timeout.
# Raises google.api_core.exceptions.Conflict if the Dataset already
# exists within the project.
dataset = client.create_dataset(dataset, timeout=30)  # Make an API request.
print("Created dataset {}.{}".format(client.project, dataset.dataset_id))

Ruby

Before trying this sample, follow the Ruby setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Ruby API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

require "google/cloud/bigquery"

def create_dataset dataset_id = "my_dataset", location = "US"
  bigquery = Google::Cloud::Bigquery.new

  # Create the dataset in a specified geographic location
  bigquery.create_dataset dataset_id, location: location

  puts "Created dataset: #{dataset_id}"
end

Dataset security

To control access to datasets in BigQuery, see Controlling access to datasets. For information about data encryption, see Encryption at rest.

What's next

Try it for yourself

If you're new to Google Cloud, create an account to evaluate how BigQuery performs in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Try BigQuery free