Dataform Beginner Guide: Data Transformation in Bigquery [2024]

Table of Contents

What is Dataform? Why use it? Is Bigquery not enough?

You probably heard of the Transform, extract, and load process (a.k.a. ETL). If you never heard of it before, the short version is that this is how we turn raw and unusable data to production-ready tables.

An obvious example of where such tool can be useful is the GA4 raw export. Transforming the messy dataset with nested fields to production ready tables requires a lot of work, which can be handled effectively by Dataform.

We will see in later articles how to start working with the GA4 export and Dataform.

Why use Dataform: the key advantages of a dedicated transformation tool

Dataform was born from the need for scalability that anyone managing large amounts of data understands.

Yes, you can use just Bigquery and plain SQL, if you are just starting out. Your set-up becomes increasingly harder to manage as you scale, without mentioning the cost.

Collaboration and version control

In large (and complex) projects, Even the smallest error can be costly. The simple approach of writing a few paragraphs of SQL in BigQuery, hoping everything goes well doesn’t cut it anymore.

Moreover, chances are you are not working alone, which adds an extra layer of complexity. Who did what and when?

That is why the concept of version control is key here. Keeping track of changes made by teammates or even just your own allows you to write and maintain high quality code.

ℹ️ Using GIT for Version control

Although I’m going to walk you through how to set up GitHub with Data form, this article does a great job explaining it.

Sequencing and managing dependencies

You have 2 tables:

  1. The first makes your raw data into a session level table: in other words, instead of using the default events table, you transform it to a session-based table.
  2. Your second table manages purchases and classify users into tiers based on the purchase amount (gold, silver, etc.)

If 2 depends on data from 2, you need to make sure that data is available first. Only a professional tool like Dataform can give you this degree of control.

Make your code reusable

Your tables, like Lego pieces, can be combined to make different shapes.

Instead of duplicating and updating code over and over again, you write reusable blocks that you can reuse.

Setting the environment for Data Form

If you do not have a Google cloud account, please follow the steps in this guide.

Enable The Dataform API

Go to APIs and Services and enable the data form API.

Next, make sure that the Secret Manager API is enabled as well. It will help us store any sensitive data in safety.

Create a repository

Repositories are like folders for your projects. Before creating our files for queries, we need to have a repository to host them. Create a new repository by browsing to Dataform.

Next, use save the default access. we need it to give permissions to Dataform

Give proper roles

Next, let’s give our newly created repository some basic access. Use the search bar to browse to IAM & Admin.

Look for the address we used earlier to create the repository. The name you should look for is Dataform Service Account.

Click on the pen to edit the access for Dataform.

Next, give the following access to the service account.

Connect Dataform to GitHub

The secret sauce that makes Dataform powerful is version control and collaboration. This is all possible because we can link it to GitHub. Let’s go ahead and Link our Dataform with a GitHub repository.

Go back to Dataform and create a secret using the secret manager API we enabled earlier.

Next, click on CREATE SECRET.

Add the developer token from your GitHub and name the secret whatever you want. Save the changes by clicking on create secret button.


Finally, let’s connect the data form with GitHub. In the settings section of your repository, click on CONNECT WITH GIT.

Fill in the information in the fields.

Make sure to use your own HTTPS link, which you can find in your GitHub repository.

If everything works as expected, you will see the message below.

Phew! We’ve done a lot of clicking, but we are almost done. The last thing we need to do is create and initialize a workspace to finally start working on some queries.

I recommend using your own name to distinguish your workspace from that of your teammates.

Finally, after clicking on your workspace, you can write some actual code in Dataform. Lets tart by initializing our workspace.

New directories and files will be accessible in the side panel, providing useful resources as we prepare our workflow.

We will dive deeper into how to work with the different directories. For now, just remember that:

  • definitions is where we will have our SQL queries + some JavaScript code to save us time and remove redundancies. Below is an example of how to structure this directory.

definitions/
  ├── ga4_sessions.sqlx
  └── gads_conversions.sqlx
JSON

  • includes this file contains JavaScript functions and variables we can leverage in definitions. I added an example of a JavaScript function I use to store BigQuery parameters that store page information.

let page_fields='page_title, page_location, page_referrer';
JavaScript

Final words

Don’t worry if you do not understand everything from the get-go. The next article in this series will expand on how to start using JavaScript and SQL to make your workflow much smoother.

In the next article, we will discuss how to run our first SQL queries and save changes to GitHub. We will also start using some useful JavaScript code to enhance our workflow.

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Go Further