Bigquery
Legacy vs Standard Sql
Legacy sql – [], udf available in web console. Tables use “:” as separator
Standard sql – backtick is used, separator is . does not support TABLE_DATE_RANGE and
TABLE_QUERY. Can be overcome using wildcard and table_suffix. Supports querying nested and
repeated data.
Standard sql advantages:
▪ Composability using WITH clauses and SQL functions.
▪ Subqueries in the SELECT list and WHERE clause.
▪ Correlated subqueries
▪ ARRAY and STRUCT data types (legacy had repeated and record data types)
▪ Inserts, updates, and deletes (dml)
▪ COUNT(DISTINCT <expr>) is exact and scalable, providing the accuracy of
EXACT_COUNT_DISTINCT without its limitations
▪ Automatic predicate push-down through JOINs
▪ Complex JOIN predicates, including arbitrary expressions
▪ Table wildcards, table_suffix
▪ Stricter timestamp checking
Best practises/Performance
▪ Avoid self-joins, use window function instead
▪ If data is skewed like some partitions are huge, filter early. Use approximate_top_count to
determine skew
▪ Avoid joins that produces more output rows than input
▪ Avoid point specific dml. Batch the dml statements
▪ Sub-queries are more efficient than joins
▪ Avoid self-joins, use window function instead
▪ Use only columns that are needed
▪ Filter using “WHERE” clause so that there are minimal rows
▪ With joins, do bigger joins first. Left side of join must be the bigger table
▪ Low cardinality “by groups” are faster. Low cardinality means that the column contains a lot
of “repeats” in its data range
▪ LIMIT doesnt affect cost as it controls only the display
▪ Built-in functions are faster than js udf
▪ Exact functions are slower than approximate built-in function, use approximate built-in if
possible. For example, instead of using COUNT(DISTINCT), use APPROX_COUNT_DISTINCT()
▪ Ordering on outermost query, not inner. Outer query is performed last, so put complex
operations in the end when all filtering is done.
▪ Wildcards – be more specific if possible
▪ Performance – query time split between stages, can be seen using stackdriver as well.
▪ Each stage – wait, read, write, compute
▪ Tail skew – max time spent is significantly more than average. Some partitions are way bigger
than other partitions. Tail skew can be found out using approximate aggregate function like
APPROX_TOP_COUNT
▪ Avoid tail skew – filter as early as possible
▪ Batch load is free, streaming has a cost. Unless data is needed in real-time, use batch when
possible.
▪ Denormalize when possible. Still use structs and arrays.
▪ External data sources are slow, use it only when needed.
▪ Monitor query performance – using “details” page. Can find out if there is read, compute or
write latency. Query plan shows different stages and shows breakup of time between
different activities in a stage