Solved: Re: Columnar engine out of memory during populatio...

ek_aampe · 07-09-2024 07:22 AM

I am attempting to manually add a table to the columnar engine in AlloyDB, but am running into an error indicating that I need to increase the population heap size parameter:

db=> select google_columnar_engine_add(relation => 'my_table_name');
WARNING: Error occurred: Columnar engine encountered out of memory during population. Set google_columnar_engine.population_heap_size_in_mb to a bigger value.

db=> show google_columnar_engine.population_heap_size_in_mb;
google_columnar_engine.population_heap_size_in_mb 
---------------------------------------------------
250

db=> SET google_columnar_engine.population_heap_size_in_mb TO 1000;
ERROR: parameter "google_columnar_engine.population_heap_size_in_mb" cannot be changed now

As you can see above, I don't seem to be able to configure this parameter at runtime and I don't see an AlloyDB flag that would control it. The table size is ~4.6 GB which should fit within the available columnar engine memory capacity of 14000 MB (see below):

db=> \dt+ my_table_name;
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description 
-------------------+-----------------+-------+----------+-------------+---------------+---------+-------------
public | my_table_name | table | db_user | permanent | heap | 4663 MB | 

db=> show google_columnar_engine.memory_size_in_mb;
google_columnar_engine.memory_size_in_mb 
------------------------------------------
14000
(1 row)

db=> SELECT google_columnar_engine_memory_available();
google_columnar_engine_memory_available 
-----------------------------------------
12242
(1 row)

So it seems like the issue isn't that the table wouldn't fit in memory. It's more that this internal parameter which controls the heap size available while loading data into memory is being exhausted somehow.

Is there any way to control the `google_columnar_engine.population_heap_size_in_mb` parameter? I can't find any documentation for it, nor does it appear in `pg_settings`.

---

Update:

I've tried adding specific columns one at a time and this succeeds up to a point. None of the columns is particularly large (most consume about 35MB when added to the columnar engine, as indicated in the g_columnar_columns table), but once I've added 16 columns to the engine in this way, attempting to add the 17th column throws the original heap size error.

ms4446

The error message indicating an out-of-memory issue during population suggests that this heap size limit is being exceeded, despite the table fitting comfortably within the overall columnar engine memory (google_columnar_engine.memory_size_in_mb). Incrementally adding columns until hitting a limit further supports this theory. Unfortunately, this parameter is not configurable at runtime or through any known flags.

Reaching out to Google Cloud Support is the best approach for a definitive solution. They can provide insights into the parameter's intended behavior, suggest workarounds, potentially adjust the parameter on the backend, and offer guidance on optimal data loading strategies for large tables.

If immediate adjustments are needed, consider the following workarounds:

Vertical Partitioning: Split the table into multiple tables based on column usage patterns to reduce the amount of data loaded into memory simultaneously.
Horizontal Partitioning: Divide the table into smaller chunks based on a specific column, such as date or range, and load these partitions individually into the columnar engine.

Alternative Loading Methods:

External Tables: Use external tables backed by files in Google Cloud Storage to stage data, potentially offering more control over the loading process.
Custom Functions: Write a custom function in PostgreSQL to load data in smaller, manageable chunks. For example:

CREATE OR REPLACE FUNCTION load_table_to_columnar(table_name text) RETURNS void AS $$
DECLARE
  column_name text;
BEGIN
  FOR column_name IN SELECT column_name FROM information_schema.columns WHERE table_name = table_name
  LOOP
    EXECUTE 'SELECT google_columnar_engine_add(relation => $1, columns => ARRAY[$2])'
            USING table_name, column_name;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

SELECT load_table_to_columnar('my_table_name');

Memory consumption during columnar population can be significantly influenced by data types, with complex types like arrays or JSON requiring more heap space. Additionally, the distribution of data across columns can lead to uneven memory usage during loading, particularly with highly skewed data.

Contacting Google Cloud Support offers the most direct path to resolving the heap size limitation, restructuring the table or employing alternative loading methods can serve as effective interim solutions.

View solution in original post

ms4446

The error message indicating an out-of-memory issue during population suggests that this heap size limit is being exceeded, despite the table fitting comfortably within the overall columnar engine memory (google_columnar_engine.memory_size_in_mb). Incrementally adding columns until hitting a limit further supports this theory. Unfortunately, this parameter is not configurable at runtime or through any known flags.

Reaching out to Google Cloud Support is the best approach for a definitive solution. They can provide insights into the parameter's intended behavior, suggest workarounds, potentially adjust the parameter on the backend, and offer guidance on optimal data loading strategies for large tables.

If immediate adjustments are needed, consider the following workarounds:

Vertical Partitioning: Split the table into multiple tables based on column usage patterns to reduce the amount of data loaded into memory simultaneously.
Horizontal Partitioning: Divide the table into smaller chunks based on a specific column, such as date or range, and load these partitions individually into the columnar engine.

Alternative Loading Methods:

External Tables: Use external tables backed by files in Google Cloud Storage to stage data, potentially offering more control over the loading process.
Custom Functions: Write a custom function in PostgreSQL to load data in smaller, manageable chunks. For example:

CREATE OR REPLACE FUNCTION load_table_to_columnar(table_name text) RETURNS void AS $$
DECLARE
  column_name text;
BEGIN
  FOR column_name IN SELECT column_name FROM information_schema.columns WHERE table_name = table_name
  LOOP
    EXECUTE 'SELECT google_columnar_engine_add(relation => $1, columns => ARRAY[$2])'
            USING table_name, column_name;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

SELECT load_table_to_columnar('my_table_name');

Memory consumption during columnar population can be significantly influenced by data types, with complex types like arrays or JSON requiring more heap space. Additionally, the distribution of data across columns can lead to uneven memory usage during loading, particularly with highly skewed data.

Contacting Google Cloud Support offers the most direct path to resolving the heap size limitation, restructuring the table or employing alternative loading methods can serve as effective interim solutions.

shekhar0413

Hi @ms4446 , I have been going through AlloyDB recently too since we are using it in a POC. Our source database has partitions and since you mention partitioning in your answer it got me interested. I have gone through the AlloyDB documentation but not come across anything about partitions. However, there is a supported extension called "pg_partman" that supports partitioning. I assume that there is no managed partitioning service/strategy that AlloyDB has (like in BigQuery for example). That is, if we were to use this extension, we'd have to manage everything about the partitions ourselves. Is my understanding correct?

Columnar engine out of memory during population