When it comes to Google BigQuery, there are plenty of articles and online courses out there. Most are “tech to tech” explanations—which are great. But they can be intimidating for those beginning their marketing-to-tech journey.
So where exactly do you start? Or, if you’re already using BigQuery, how can you go further and do some really cool stuff with it?
Get started—or move faster—with this marketer-focused tutorial.
What is Google BigQuery?
Google BigQuery is a warehouse for analytics data. It’s a place where you can:
- House your data for $0.02 per gigabyte (equivalent of 256 MP3 files).
- Query your data for $5.00 per 5 terabytes of queries (about 1 million 5-minute songs).
- Create nice (and free!) dashboards in Google Data Studio (or any other visualization tool, for the price of that tool).
The first terabyte of query data and the first 10 gigabytes of storage per month are free.
BigQuery isn’t the only game in town. Alternatives include Amazon Redshift, Snowflake, Microsoft Azure SQL Data Warehouse, Apache Hive, etc.
Oft-cited advantages of BigQuery include:
- No database administration;
- No limitations;
- Quick imports;
- Process terabytes in seconds;
- Data stored in EU or United States;
- Access control;
- Two-step verification;
- Data-loss prevention;
- Integration with other Google tools;
- $300 in test credits;
- Huge community;
- Educational resources (courses, labs, etc.).
Still, why would you go beyond your usual digital analytics tool and try a cloud solution like BigQuery?
What are the advantages of a tool like BigQuery?
When you work with Google Analytics or other digital analytics tools, you usually have control only over data collection and analysis. With a tool like BigQuery, you have more control over every stage of the analytics infrastructure:
It’s not the only difference. While Google Analytics makes it possible to add CRM, back-office, or call-tracking data (via the API or Measurement Protocol), it’s still a suboptimal solution to consolidate your data.
You have little control over the Google Analytics system—if your data is sampled or altered because Analytics wants to, well, that’s your problem.
A (not so) hypothetical use case
Imagine you need a monthly report with data from Google Analytics, your CRM, call tracking software, and some other sources. Every month, you go to each of these tools and search for useful data to fill your report.
But, sometimes, you can’t really access the CRM because you don’t have permissions (i.e. enterprise politics), or you’re at an agency and your client doesn’t want you to touch their CRM. So, you wait for someone to send you the necessary data to integrate into your report, which—as it’s often happened to me—takes time.
Then, you integrate all this data manually, which also takes time. By the 10th of the month, you have everything you need, but it’s kind of late to present these figures and make a decision about the actions to take that month.
Then you think, “We can’t do this anymore—we have to automate!” You propose some tools to your client who says “too expensive,” “too complicated,” etc., to every option. So, you look for the cheapest and simplest solution.
For me, BigQuery was that solution. I thought (and, ultimately, was right) that the amount of client data would never go beyond the free threshold, and that we could connect it to a free and simple Data Studio dashboard.
And that was it—a cheap and simple solution for the monthly reporting struggle.
Another use case
Imagine you want to know how much revenue your campaigns generated…
…and you sell houses.
…and you have a shitty custom CRM that can never connect to your Ads or Analytics platforms.
You know the number of leads, but you can’t connect them to house purchases. The solution is to give every lead and every purchase a userID (like an encrypted email), to pull CRM and Google Analytics data into your BigQuery data warehouse, and then—with a simple SQL query—join the two tables.
You can then say that userID X, who came on January 11 from Google Ads, brought us $500,000 in revenue. Therefore, it’s okay that we spent $500 to get that lead.
Inspired? Here’s how to get started.
Getting started with BigQuery
BigQuery is part of the Google Cloud Platform. To start working with it, you have to create (or log in to) a Gmail account and then go to Google Cloud Console to create a Cloud Project. The creation of these elements is straightforward.
Note: When you enter a Cloud account, it asks you to provide a credit card to get $300 in credits to test the platform. Don’t be afraid—$300 is more than enough for vetting or educational purposes, and they won’t charge you without notifying you that your credits have run out.
If you don’t want to enter your credit card and only want to play with BigQuery and public data (there are plenty of public datasets within BigQuery), you can use a BigQuery sandbox.
Just enter a BigQuery service after creating a Cloud Project and accepting all the terms, etc. You’ll see a “Sandbox” label in the top-left corner.
If you’re using only BigQuery in your Cloud Project, the schema below is a good explanation of your project structure:
- Your BigQuery interface with datasets and tables (covered later);
- Jobs (i.e. tasks), which include every operation in your Cloud Project—query, save, import, export, etc.;
- Team access, where you can give access to specific elements and tasks in your project (e.g., BigQuery dataViewer access).
Accesses are managed via Google Cloud IAM. You can find it in the menu (top-left corner) of your Cloud Project.
You also have the option to create an Organization in your Google Cloud account. Organizations are available to GSuite users (paid Gmail, basically) or Cloud Identity owners.
The Organization can have its own billing account and projects, and it can have access to other projects without access to their billing account:
In our agency, we have an Organization as a GSuite user. In some cases, we create projects for our clients and link them to our billing account. In other cases (when the client already has a project on the Cloud Platform), we just link their project to our organization to work without access to our client’s billing account.
Once the project is created and you’re in BigQuery, you’ll need to know some SQL to start playing with your BigQuery data. SQL is not rocket science; you can learn the basic concepts quickly and find plenty of SQL query examples to tailor to your needs. (Here’s a great tutorial for using SQL in BigQuery.)
Now, let’s look at some important steps for using BigQuery. I divide these into three stages:
- Before. Get data into BigQuery.
- During. Work with BigQuery data.
- After. Build dashboards with BigQuery data.
1. Get data into BigQuery
Build your data schema
Before starting your BigQuery journey, I recommend that you build a data schema. Plan out the datasets, tables, and table fields you’ll need. To do this, ask yourself these questions:
- What do I want to accomplish with BigQuery? For example, “I want to automate my current reports.”
- What data do you have in your current reports? For example, “I have CRM, Google Analytics, and back-office data.”
- What KPIs do you export from your CRM, Google Analytics, and back office? For example, “Sessions per source per month per store, purchases per source per month per store, leads per source per month per store, etc.”
The taxonomy of BigQuery flows as follows:
Projects > datasets > tables and views.
For me, one dataset = one data source. So, to answer the questions above, you would need three datasets (CRM, Google Analytics, back office).
If you have several brands, you can say that one table is one brand of your company. If you want to store previous years separately (because you rarely use previous years’ data) you can have one table per year. That’s for you to decide. You’re charged less for long-term data storage (i.e. data you didn’t change in the last 90 days).
Once you’ve answered all the above questions, you can start building your schema. I did it with a database schema tool. (There are plenty of them on the Internet—and always one that’s absolutely free.).
After building a schema—which, honestly, you can sketch out on paper—start creating your datasets. Click on your project name (e.g., “angular-radar-255111” on the image below). Project names are based on a random project ID assigned by Google Cloud; you can change it.
Next, click on “Create Dataset.”
Usually, you only need to name your dataset and choose a location for your data. Choose an EU location if your client is in the EU (GDPR!). You’ll notice a table expiration of 60 days if you use a BigQuery Sandbox, the free version mentioned earlier.
Use Google services
Google provides some built-in services to import your data into BigQuery. The first one is BigQuery Data Transfer, which can get data from Google Ads, Cloud Storage, Amazon S3, Google Play, and YouTube. It’s free for Amazon S3 and Cloud Storage.
BigQuery also connects to Google Drive (Google Sheets and CSV, Avro, or JSON files), but the data is stored in Drive—not in BigQuery. You can, however, query it from Drive directly.
That has an interesting use-case: Imagine that data must be added manually to Google Sheets on a daily basis. For example, a recruitment agency fills in a sheet at the end of the day with the number of candidates received and candidates placed.
You can get to that data using a Google Sheets link:
Google Analytics 360, Firebase (Blaze plan), and Google Analytics App + Web provide free integration with BigQuery. For other tools and a standard Google Analytics version, you’ll have to use non-Google connectors.
Use non-Google services
Some CRMs provide a native integration with different cloud data warehouses, including BigQuery. In most cases, our clients have custom CRMs, so we had to ask their developers to build a custom connector to Cloud Storage or BigQuery.
For a standard Google Analytics account, there are a bunch of paid connectors available, starting around $100 per 100K monthly visitors. Pulling your Google Analytics data into BigQuery has benefits:
BigQuery is a popular service—it’s not hard to find connectors for just about any ad or analytics platform.
Put your developers to work
Previously, we talked about a solution to create your own connector. There are two options here—to BigQuery directly or, first, to Cloud Storage. Both have API documentation to help your developers.
In one of our use cases, we asked the developers to send two CSV files (one from our CRM and a second with back-office data) every midnight with the previous day’s data to Cloud Storage.
In terms of development, it was the cheapest solution—the dev team had to export only two CSVs, once per day. Then, we used a Cloud Function to pull the updated files from Cloud Storage into our BigQuery tables.
Here’s a code that you can use in your project:
Some BigQuery professionals won’t like this solution. It has pitfalls:
- Cloud function triggering two times for one file. It’s happened to me a few times, and the reason remains unclear.
- Cloud function not triggering at all. Sometimes, it was related to the wrong data format (different from the BigQuery table) in a CSV file; in other cases, the reason was unclear—I just had to re-import the CSV file manually.
- Difficulties with quality control. If there’s a problem with CSV file data, you can’t block it from getting into your BigQuery table, so you have to clean the data afterward using SQL. Only CSV files with the same number of columns and data formats as your BigQuery table will be imported, so—in most cases—the data structure and data format issues are managed.
I chose it because it was the simplest and the cheapest for my client and it works pretty well—for now.
Batch vs. stream processing
There are two ways to send your data to Cloud: batch or streaming. Batch processing sends data once per period (e.g., data from the previous day at 1:00 a.m.). It’s a good option unless you want real-time data.
Mobile applications are a great example—you may want to know in real time if there are issues with your application. Streaming your data is a bit more complicated than batching it.
You may need a Cloud Dataflow and/or additional services to create a streaming pipeline. It would take a separate article to address that subject. We’ll stick to batch processing for now.
2. Work with data in BigQuery
The BigQuery interface
Let’s take a look at the BigQuery interface. Some courses/articles show the old version of BigQuery:
The new interface is similar to the old one:
Here are the component parts:
- Query history. Queries you’ve run previously. It’s especially useful when you run tests but forget to register the best queries you might need later.
- Saved query. Where to find your registered queries. Name them clearly so you can find them quickly later.
- Job history. The history of what happened in BigQuery—imports, exports, task history, etc.
- Transfers. Where you see and configure Data Transfers, a Google service to import Google data (e.g. Ads, Play, YouTube) into BigQuery.
- Scheduled queries. Register queries and run them every hour/day/week, etc.
- BI engines. A new feature that integrates with familiar Google tools like Google Data Studio to accelerate data exploration and analysis. Basically, a BigQuery enhancer.
- Resources. Pin a project to see it at the top of the list. It’s super useful when you have lots of projects but work often on just a few. You can also add a public dataset to play with data if you want to learn the tool but don’t have your own.
Introduction to datasets, tables, views, and queries
It’s easiest to understand the structure of a BigQuery project with an analogy from Google Analytics:
- A BigQuery project is like a Google Analytics account. You (usually) create one per company/brand. So one company = one Analytics account = one BigQuery project.
- A BigQuery dataset is like a Google Analytics property—you create one per data source (e.g., website, application).
- A BigQuery table or view is like a Google Analytics view. You create a table or view to view or subdivide your data.
Within a project, you can create/delete/copy datasets and tables:
When you click on a table, you have options to query, copy, delete, or export:
You can export your table to Cloud Storage, explore it in Data Studio, or scan it with the Google Data Loss Prevention service (all via the “Export” button).
If you find yourself running a particular query often, it’s simpler to create a view. A view is a table based on your query that gets created whenever you work with it. You can use it in Data Studio, which we’ll talk about later.
A query is your SQL code—how you communicate with your BigQuery data. Create a query to get the data you need from the tables you have: SELECT (fields) FROM (your table), LIMIT (quantity of lines).
Of course, this is the simplest example of a query. It can be a long piece of code, but the object of this article isn’t to teach you SQL.
Some best practices for working with data in BigQuery
To use BigQuery more efficiently, here are some tips:
- Use denormalized tables. Try to group your data into only a few tables; don’t create a table for a small number of fields. For example, say you have a table that contains data about “purchases per agency,” and another table that contains the postal addresses of these agencies. To “denormalize” this data, you put all these fields into one table:
- Use preview mode. Before you query your table, take a look at the Preview tab. It’ll help you understand the kind of data you have available in your table before you start creating:
- Use Query Validator. When you create a query, you’ll see a red or green circle in the right corner under your query table. Red means your query contains an error, so fix it before you run it.
- NO * SELECT. Don’t use the “Select all” option (Select *). You can always go to the Preview mode to see all your table data. Remember: You pay for the queries, even if that’s simply consuming your free allocation.
- Use table expiration. If you know that you won’t need a table after some period of time, you can set a table expiration option. Remember: You also pay for storage, so this option can be quite useful.
- Use partition. It’s possible to “split” a table into small subtables using a date column. It usually splits by day. It’s a way to optimize query cost—when you need data for a particular period, BigQuery will search for subtables per day first, then get data from subtables without diving into the full gazillion-row table. You create a partition during table creation.
- Use pricing calculator. BigQuery provides a pricing calculator so that you can see an estimate for data storage or a query.
- Control costs (custom quotas). There are a bunch of possibilities in Google Cloud to restrict its usage (i.e. make sure no one on your team launches a huge, expensive query). You can set limitations for “maximum per query cost,” “per user daily budget,” and “per project daily budget.”
3. Build dashboards with BigQuery data
Google Data Studio
To create a Data Studio dashboard using your BigQuery data, open your existing dashboard or create a new one. Click on “Create New Data Source”:
Choose “BigQuery” from all possible sources. After that, you’ll refine your selection by project and dataset.
From there, you can connect to a table or a view. My choice was a view, as it’s basically a pre-created query with only the data I need. You can also connect directly to a table and do all the magic in Google Data Studio directly.
When you connect to a view or a table, you’ll see the fields available in your data source:
When you click on “Add to Report,” you create a connection between your data source (BigQuery view or table) and Data Studio. So go ahead, you’re ready to create a dataviz with your BigQuery data.
If you’re a GSuite user, you can use a native BigQuery connector to connect with your Google Sheet:
When you click on “Connect to BigQuery,” you’ll have to choose a BigQuery project and then create a query as you would in the BigQuery interface:
One problem: You can’t schedule a query—at least not yet. (The BigQuery connector is new.) You’ll have to refresh the query regularly to fill your Google Sheets table with the newest data.
Once your data is pulled into Google Sheets, you can start creating Google Sheets dashboards. For non-GSuite users, there are some Google Sheets Add-ons (free and paid) that can pull in BigQuery data. At our agency, we use OWOX BI BigQuery Reports, which also lets you schedule your queries.
Of course, you’re not limited to Google Data Studio or Google Sheets. Here are some common data tools that integrate easily with BigQuery:
The list is limited to my own knowledge—I’m sure there are tons of other options.
BigQuery is a great option to start consolidating your data. You have plenty of possibilities to test, learn, and embrace this service. To improve your knowledge of Google Cloud, Google BigQuery, and SQL, check out these courses:
There’s a great BigQuery community out there, too, so don’t be afraid to search for answers or ask questions.
As you progress, you can go further with BigQuery, using its integrated machine-learning models, which include pre-built templates. Segment your audiences based on the potential to purchase, predict customer lifetime value, etc.
If you learn the basics, you’re most of the way there.