The ML.BAG_OF_WORDS function
Use the ML.BAG_OF_WORDS
function to compute a representation of tokenized
documents as the bag (multiset) of its words, disregarding word ordering and
grammar. You can use ML.BAG_OF_WORDS
within the TRANSFORM
clause.
Syntax
ML.BAG_OF_WORDS( tokenized_document [, top_k] [, frequency_threshold] ) OVER()
Arguments
ML.BAG_OF_WORDS
takes the following arguments:
tokenized_document
:ARRAY<STRING>
value that represents a document that has been tokenized. A tokenized document is a collection of terms (tokens), which are used for text analysis. For more information about tokenization in BigQuery, seeTEXT_ANALYZE
.top_k
: Optional argument. Takes anINT64
value, which represents the size of the dictionary, excluding the unknown term. Thetop_k
terms that appear in the most documents are added to the dictionary until this threshold is met. For example, if this value is20
, the top 20 unique terms that appear in the most documents are added and then no additional terms are added.frequency_threshold
: Optional argument. Takes anINT64
value that represents the minimum number of documents a term must appear in to be included in the dictionary. For example, if this value is3
, a term must appear at least three times in the tokenized document to be added to the dictionary.
Terms are added to a dictionary of terms if they satisfy the criteria for
top_k
and frequency_threshold
, otherwise they are considered
the unknown term. The unknown term is always the first term in the dictionary
and represented as 0
. The rest of the dictionary is ordered alphabetically.
Output
ML.BAG_OF_WORDS
returns a value for every row in the input. Each value has the
following type:
ARRAY<STRUCT<index INT64, value FLOAT64>>
Definitions:
index
: The index of the term that was added to the dictionary. Unknown terms have an index of0
.value
: The corresponding counts in the document.
Quotas
See Cloud AI service functions quotas and limits.
Example
The following example calls the ML.BAG_OF_WORDS
function on an input column
f
, with no unknown terms:
WITH ExampleTable AS ( SELECT 1 AS id, ['a', 'b', 'b', 'c'] AS f UNION ALL SELECT 2 AS id, ['a', 'c'] AS f ) SELECT ML.BAG_OF_WORDS(f, 32, 1) OVER() AS results FROM ExampleTable ORDER BY id;
The output is similar to the following:
+----+---------------------------------------------------------------------------------------+ | id | results | +----+---------------------------------------------------------------------------------------+ | 1 | [{"index":"1","value":"1.0"},{"index":"2","value":"2.0"},{"index":"3","value":"1.0"}] | | 2 | [{"index":"1","value":"1.0"},{"index":"3","value":"1.0"}] | +----+---------------------------------------------------------------------------------------+
Notice that there is no index 0
in the result, as there are no unknown terms.
The following example calls the ML.BAG_OF_WORDS
function on an input column
f
:
WITH ExampleTable AS ( SELECT 1 AS id, ['a', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'd', 'd'] AS f UNION ALL SELECT 2 AS id, ['a', 'c', NULL] AS f ) SELECT ML.BAG_OF_WORDS(f, 4, 2) OVER() AS results FROM ExampleTable ORDER BY id;
The output is similar to the following:
+----+---------------------------------------------------------------------------------------+ | id | results | +----+---------------------------------------------------------------------------------------+ | 1 | [{"index":"0","value":"5.0"},{"index":"1","value":"1.0"},{"index":"2","value":"4.0"}] | | 2 | [{"index":"0","value":"1.0"},{"index":"1","value":"1.0"},{"index":"2","value":"1.0"}] | +----+---------------------------------------------------------------------------------------+
Notice that the values for b
and d
are not returned as they appear in only
one document when the value of frequency_threshold
is set to 2
.
The following example calls the ML.BAG_OF_WORDS
function with a lower value of
top_k
:
WITH ExampleTable AS ( SELECT 1 AS id, ['a', 'b', 'b', 'c'] AS f UNION ALL SELECT 2 AS id, ['a', 'c', 'c'] AS f ) SELECT ML.BAG_OF_WORDS(f, 2, 1) OVER() AS results FROM ExampleTable ORDER BY id;
The output is similar to the following:
+----+---------------------------------------------------------------------------------------+ | id | results | +----+---------------------------------------------------------------------------------------+ | 1 | [{"index":"0","value":"2.0"},{"index":"1","value":"1.0"},{"index":"2","value":"1.0"}] | | 2 | [{"index":"1","value":"1.0"},{"index":"2","value":"2.0"}] | +----+---------------------------------------------------------------------------------------+
Notice how the value for b
is not returned since we specify we want the top
two terms, and b
only appears in one document.
The following example contains two terms with the same frequency. One of the terms is excluded from the results due to the alphabetical order.
WITH ExampleData AS ( SELECT 1 AS id, ['a', 'b', 'b', 'c', 'd', 'd', 'd'] as f UNION ALL SELECT 2 AS id, ['a', 'c', 'c', 'd', 'd', 'd'] as f ) SELECT id, ML.BAG_OF_WORDS(f, 2 ,2) OVER() as result FROM ExampleData ORDER BY id;
The results look like the following:
+----+---------------------------------------------------------------------------------------+ | id | result | +----+---------------------------------------------------------------------------------------+ | 1 | [{"index":"0","value":"5.0"},{"index":"1","value":"1.0"},{"index":"2","value":"1.0"}] | | 2 | [{"index":"0","value":"3.0"},{"index":"1","value":"1.0"},{"index":"2","value":"2.0"}] | +----+---------------------------------------------------------------------------------------+
What's next
- Learn about the
BAG_OF_WORDS
function outside of machine learning.