The Tableau + BigQuery combination is a strong choice for businesses of all sizes wanting best-in-class visualisations, report interactivity and data exploration coupled with hassle-free, fully managed and scalable analytical data warehousing. Tableau and Google have a shared vision when it comes to extracting insight from data and as a result the two companies work closely to ensure these offerings integrate nicely.
Here are a few tips, tricks, best practices - whatever you want to call them - for your current or future Tableau + BigQuery setup.
Spin up your Tableau VM in the same Region/Zone as your BigQuery Datasets
BigQuery allows you select where you would like your Dataset located. If you create your Dataset in 'australia-southeast1' do yourself a favour and spin up your Tableau Server VM on the Google Cloud Platform in the same location:
Doing this will:
- Ensure effectively zero latency between BigQuery and Tableau (as data transfer is within a single data centre and doesn’t travel over the internet)
- Save on Google Cloud Platform Network charges
Tableau have some useful documentation on installing Tableau Server on Google Cloud Platform here.
Choose your Connection Type Wisely
Tableau has two connection types to BigQuery; (1) Live and (2) Extract. Here’s a short explanation of both.
Live - when a Tableau Workbook is loaded or a filter adjusted Tableau sends a request to the BigQuery API which returns data to Tableau for visualising. This is all done on-the-fly as the report user interacts with Tableau.
Extract - on a schedule (monthly, weekly, hourly etc.) Tableau will ask BigQuery for all the underlying data available to a given Tableau report. This is usually a large table with all the dimensions and measures available within the entire scope of the Tableau report. Tableau then stores a copy of this data in an in-memory analytical engine. When a user loads or interacts with a report, Tableau sends a request to the in-memory engine rather than going all the way back to Google BigQuery.
Connection types are selected per Workbook or per Datasource meaning you can use a combination of both Live and Extract. My opinion is, use Extract wherever possible and fall back to Live wherever Extract isn’t feasible. Here are some examples of when to use each:
Extract:
- The speed of loading workbooks and changing filters is important
- Data doesn’t change too frequently (hourly or longer)
- The scope of your report in narrow and therefore your dataset is small-ish
Live:
- Your data changes frequently and your reporting requirement is real-time
- The scope of the report is too large and it would take too long to extract the data
A rough rule of thumb might be to use Extract unless your dataset is greater than 50M rows. This isn’t because Tableau can’t handle more than 50M rows (it definitely can!), rather, the BigQuery connector is optimised for Live connections and extracting more than 50M rows can take quite some time.
Extract Best Practices
Avoid Related Tables or Custom SQL and Use Views for Extracts
Joining or creating relationships between tables is something every data guru does and you can do this simply and easily in the Tableau Data Connection window. However, the recommendation here is that if you have knowledge of SQL then perform these joins in a BigQuery View which pushes the workload back to BigQuery and saves Tableau’s analytical processing power for loading and filtering Workbooks.
Similarly, writing a query in BigQuery and copy-pasting it into the ‘New Custom SQL’ connector in Tableau is very tempting (I know I’m guilty of doing this), however, I’d recommend taking the time to publish a View in BigQuery and connecting to said View from Tableau. This has an added advantage of keeping all your SQL in one, easily accessible place.
BigQuery has huge analytical horsepower. Even on the free tier your queries can scale to use up to 2,000 nodes and it would be a sin not to utilise this. This is why all transformations on the data should be pushed back to BigQuery rather than being done in Tableau. A great way to do this is by using BigQuery Views.
Shape, Aggregate & Filter
Any Tableau report likely doesn't require access to every piece of data in your data warehouse at the transactional level and therefore it shouldn’t get it. Ensure you shape, aggregate and restrict your data based on scope, relative to the report it’s servicing.
Here’s how to do it:
Shape
Only include the columns you’re actually going to use in Tableau and make sure any columns you do require are calculated in BigQuery and not Tableau. For example, if you need to know if a promotion code has been used for a particular sale but you don’t need to know what specific promotion code was used then:
- Don’t include the PromotionCode column in your dataset (here we are excluding non-essential columns)
- Include a PromotionCodeUsed flag in your dataset by using:
CASE WHEN PromotionCode is NULL THEN FALSE ELSE TRUE END AS PromotionCodeUsed
(here we are pushing calculations back to BigQuery)
Another common calculation that can be pushed back to BigQuery is when you have a Sales and CostOfGoodsSold column and you need GrossMargin. Instead of creating a calculated field in Tableau, do this calculation in BigQuery first. Be careful though because some calculation will need to be done in Tableau, for example, Gross Margin % which might be something like this in Tableau:
SUM(GrossMargin) / SUM(Sales)
Gross Margin % cannot be pushed back to BigQuery as this will not aggregate properly in Tableau.
The general rule-of-thumb being that any calculation that can be done across a row should be done in BigQuery and any aggregation calculated down a column should be done in Tableau.
Aggregate
Ensure you aggregate your data to the lowest required level. Including data down to the Minute or even Second level is useless if the report only shows data down to the Day level. All you will be doing here is make the size of your data larger than it needs to be and giving Tableau more work to do.
Filter
Similarly to Aggregate, including data from 5 years ago is useless if the report only uses data from this year. Again, all you will be doing here is make the size of your data larger than it needs to be and giving Tableau more work to do.
Happy shaping, aggregating and filtering!
Live Connection Best Practices
Partition, Partition, Partition
Because BigQuery is a fully managed data warehousing solution, all maintenance and optimisation is done on-the-fly and behind the scenes. This is great as it means you don’t need to hire a database administrator or add to your current DBA’s workload. Bigquery does offer one very powerful feature to help you manage your data and that’s through the use of partitioned tables.
Key advantages:
- Increased Performance (report responsiveness under a Live connection)
- Reduced Cost of query
Partition your tables on a commonly used date field wherever possible and whenever said date field is often in your WHERE clause or Tableau filter. This will limit the data BigQuery will scan before returning the query results.
For example, to identify how many orders were placed yesterday your query might look something like this:
SELECT COUNT(DISTINCT OrderID) FROM `Example.Orders` WHERE OrderDate = [Yesterday’s Date]
If your table is not partitioned then BigQuery scans the entire Orders table which could be years or even decades worth of data. However if your table is partitioned by OrderDate then BigQuery will only scan yesterday’s data. In this example, if the Orders table had 1 year worth of data (and all things being equal) your query size on a partitioned Table would be 0.3% of what it would be on an unpartitioned Table.
Under a Live Tableau connection and partitioned BigQuery Table reporting yesterday’s Orders is going to (1) return results a lot faster to the report user and allow for more responsive report interaction and (2) save costs by scanning just a small portion of the Table.
On a side note, I’ve seen BigQuery PoC’s fail miserably because Tables were not partitioned so this is a must for large data sets regardless of whether you’re using Tableau or not.
Materialise your Data before Connecting Tableau
When a Live connection is required, wherever possible:
- Don’t use Custom SQL!
- Don’t use Related Tables!
- Don’t use complex Views!
And here’s what to do:
Materialise the data you need at the level of aggregation and scope you require in a single BigQuery table.
Remember with a Live connection queries are sent from Tableau to the BigQuery API on-the-fly so any transformations which need to be done will also happen on-the-fly and will increase the time taken to return results. By materialising (and by this I mean saving a copy of the data as a Table) a Table which is custom-built to your reporting needs, BigQuery is going to have a far better chance of returning results in a timely fashion. Your materialised Table should:
- Only contain the columns required
- Be at the lowest level of aggregation required (this could be Day or Month for example)
- Only contain data within the required scope (i.e. if you don’t need data from 5 years ago then don’t include it)
- Be partitioned on a Date field which will be filtered in Tableau
Use Continuous Filters on you Partitioned Field in Tableau
It’s imperative report users or colleagues self-serving from a Datasource with a Live connection can access the performance and cost benefits of partitioned Tables via Tableau. That is, if in Tableau there is a filter on OrderDate (for example) and your Table in BigQuery is partitioned on OrderDate the SQL which Tableau generates should utilise the partitioned field resulting in BigQuery only scanning part of the Table.
How do you ensure this happens? Use a Continuous Date filter in Tableau. If you’re not sure whether your filter is Continuous or not, check the colour. Green means Continuous and blue means Discrete. Keep in mind that Discrete filters will often lead to Tableau generated SQL which still scans the entire table.
Hope this is useful! Any questions or comments please feel free to leave below, or alternatively you can contact us at info@reportsimple.com.au
Comments