Google BigQuery offers hassle-free, fully managed and scalable analytical data warehousing at an affordable price and as such many businesses are opting for this ‘NoOps’ approach which doesn’t add additional workload to (often stretched) IT departments.
Apart from a basic knowledge of SQL and data types there is very little technical knowledge required to get started with BigQuery at an introductory level. For this reason, oftentimes adoption of BigQuery is seeded from BI, Finance or Marketing functions rather than IT.
Here are a few tips, tricks, best practices - whatever you want to call them - to help ensure you get the most out of your current or future BigQuery environment.
Create Your Datasets in the Same Location
You can think of a Dataset in BigQuery as a place where you house a number of related Tables, Views or Models. It may seem obvious, but it’s important to ensure (unless there is a valid reason otherwise) that you create all your Datasets in the same location:
'Example' Dataset will be located in Sydney. If you were to create another Dataset down the track you would also want to have the data location set to Sydney. Why? You can only query objects across Datasets if they are in the same Data location. This means you cannot join a table in Sydney with another table located in the US.
Partition Tables On Date
I’ve seen BigQuery PoC’s fail miserably because tables were not partitioned so this is a must for large data sets.
Partition your Tables on Day (Date or Timestamp data types). If there are multiple date fields in your Table use the field you would most likely include in a WHERE clause to restrict your results. If your Table is not partitioned then each time you run a query against this Table the entire Table is scanned regardless of whether or not you have a Date in your WHERE clause. If your Table is partitioned then BigQuery will only process the Partitions which are inside the WHERE clause.
A real life example might be identifying how many orders were placed yesterday. Your query might look something like this:
SELECT COUNT(DISTINCT OrderID) FROM `Example.Orders` WHERE OrderDate = [Yesterday’s Date]
For an OrderDate partitioned Table, only Yesterday’s partition is read by BigQuery and the results are returned quickly and efficiently. If 'Example.Orders' was not partitioned by OrderDate then the entire table would be scanned - this could be months, years or even decades of data.
The advantages being:
- Increased performance and query responsiveness
- Decreased BigQuery charges
In the above example, if the Orders table had one year worth of data in it then running this select statement on a partitioned table would result in 0.3% of the data being read (and therefore 0.3% of the cost) compared to a non-partitioned Table. Big costs savings can be had here.
Create ‘Tables’ with More Than 4,000 Partitions
As of writing, Google only allows a maximum of 4,000 partitions per Table - this is a little over 10 years ( 4,000 / 365 = 10.96) of data. For a lot of businesses this will be more than sufficient, however, Report Simple recently worked with a client with close to 20 years of data which needed to be partitioned. Here’s how we did it:
Split your tables into decades. Below Orders_00 has data from 1st Jan 2000 to the 31st Dec 2009 and Orders_10 has data from 1st Jan 2010 to 31st Dec 2019 both tables are Partitioned on Date.
Create a view which Unions both tables (which should have the same schema) together. Your View query might look something like this:
SELECT * FROM `Project.Example.Orders_00`
SELECT * FROM `Project.Example.Orders_10`
When querying the Orders View with the Partition Date in the WHERE clause the benefits of the Partitions in the underlying Tables are passed through to the View:
Use Google Sheets as a Table Data Source
Tables with an External or Federated data source are a great way to query data in BigQuery which doesn’t actually reside in BigQuery. It’s also a great way to include tactical or ad hoc information from Google Sheets in your BigQuery analysis.
A few use cases might be:
- Email addresses to be excluded as ‘Test’ customers
- Employee names who manage certain products or accounts
- Targets which might be set monthly by someone without BigQuery access or knowledge
The Google Sheet can be maintained by someone without any knowledge of BigQuery and is often useful when data changes on an ad hoc basis or when data needs to be quickly exposed to BigQuery without a formal ETL processes being developed.
Here’s how to do it:
Create your Google Sheet
Create a BigQuery Table sourced from Google Sheets
Query your BigQuery sourced Table and Google Sheets sourced Table
Stream BigQuery Logs back into BigQuery
If you require visibility on who in your organisation is using BigQuery and how much their associated query charges are then listen up!
The BigQuery pricing structure means you only pay for what you use - meaning you pay per Query rather than a set monthly fee. This is great as your BigQuery costs scale according to how much you are using the service, however this does means that your costs can be volatile.
This is a great way to identify spikes in cost, expensive queries or scheduled queries which may need to be paused.
Go to Logging -> Exports -> Create Export
Configure your Logs Export and Create sink:
Your BigQuery Logs will then start streaming into BigQuery Tables. Here’s a useful select statement which can be saved as a View for easy analysis:
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.endTime AS DateTime,
protopayload_auditlog.authenticationInfo.principalEmail as PrincipalEmail,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.queryPriority as QueryPriority,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query as Query,
SUM(protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalBilledBytes) as TotalBytesBilled,
((((SUM(protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalBilledBytes)/1024)/1024)/1024)/1024)*500 as USDCentsCharge
WHERE protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.eventName = 'query_job_completed'
ORDER BY 6 DESC
Use the above query to identify your most expensive BigQuery queries along with who ran the query and when they ran it.
Hope this is useful! Any questions or comments please feel free to leave below, or alternatively you can contact us at firstname.lastname@example.org