My journey introducing the data build tool (dbt) in project’s analytical stacks
Posted rongfengliang
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了My journey introducing the data build tool (dbt) in project’s analytical stacks相关的知识,希望对你有一定的参考价值。
转自:https://www.lantrns.co/my-journey-introducing-the-data-build-tool-dbt-in-projects-analytical-stacks/
Not sure I remember how, but I had the good luck a few weeks ago to stumble upon posts from Tristan Handy where he mentioned a tool his team built, simply called the Data Build Tool (dbt). I immediately took interest, as I was struggling with ETL (Extract, Transform, Load) at that moment, and had the extreme good fortune of having 2 clients who were willing to let me experiment with it, and even eventually use it for their own data warehouse.
Starting out with such a tool can sometimes be a bit confusing, as you’re not sure how and where to start. You can envision how dbt will fit in your overall analytical stack, but you might need a little bit of guidance to get things started. Although the documentation is really good, I feel like it’s always a good thing to read about someone else’s experimenting before starting my own.
So I decided to put together this little article that just goes over my own experience working with dbt, what I’ve picked up along the way, and how it all fits in our overall data engineering journey. I’m no power user of dbt and this may seem superficial to more experienced users, but the goal here is to help new users by sharing my own journey so far.
Getting started
As I mentioned, I have the good fortune to be working with clients who are really keen on adopting the technologies that will make their analytical stack more robust and scalable. And that was actually one problem I was facing with one project. What had been deployed so far worked perfectly for the needs that were specified, but as new requirements came in, it was obvious that our solution wasn’t scalable. We were using python scripts to do the ETL stuff, and that was neither agile, nor DRY (Don’t Repeat Yourself), nor fun. So we wanted to improve our ETL approach, and dbt just fell on our lap right there and then.
Another project is with a startup that just started investing into their analytical infrastructure. This is pretty rare, but here I was with the opportunity to propose and build an analytical stack from scratch. Of course, I wanted dbt to be part of it.
So this is how I started looking more seriously at what dbt had to offer, how it could be used in a production environment, how we could test it and how it could handle future needs.
What’s interesting with dbt is how instead of doing ETL, you’re really taking advantage of the power of cloud-based data warehouses to do ETL instead.
What we are doing is that instead of adopting a traditional ETL approach that requires the maintenance of scripts that does all 3 steps (extract, transform and load data towards the data warehouse), we are moving to a ELT approach where data is first moved (extract + load) to a data warehouse and transformation is done on the data warehouse directly.
That has the benefit of reducing the complexity of scripts (we are now only dealing with pure SQL scripts), maintenance (all models are simple select statements that are ran through a tool called dbt – data build tool) and integrity (once business logic is added within a model, all further models down the chain will always reuse that same business logic).
What is dbt and how does it fit
I should first mention that dbt is the work of the good folks at Fishtown Analytics. I believe that dbt was built while the co-founders were working for another company and that they kinda kept dbt with them as they built their new company (I might be completely wrong on this one). But the point is that they developed that tool, made it publicly available and are building a great community around it. My hat’s off to them – we should all aspire to bring such excellent tools to our data engineering community.
So, what is dbt?
I like to think of it as SQL on steroids.
Essentially, what we do is that we are building SQL models on top of source tables (coming from your production database, cloud services, website events, etc.) which are then referred to by further models, acting as building blocks towards a final model that’s part of a mart within your data warehouse.
For example, let’s say I want to build a fact table for transactions. But transactions may be coming from different sources. I would build models that refer to one another until you get to that final table. Here’s an example taken from a project I’m working on…
That graphic above might seem like a nightmare, but there is method to that madness (I think, lol). Essentially, you’re looking at models on the left that refer to tables in staging area, a bunch of transformations in the middle, and the final entity at the far right. We’ll talk more about this below, as well as how to generate that graph.
HAVE YOU SEEN OUR PRODUCT ANALYTICS NEWSLETTER?
OUR MISSION IS TO SHARE DATA ANALYTICS’ BEST PRACTICES AND NEW IDEAS WITH PRODUCT OWNERS, SO THEY CAN IMPROVE PRODUCT-MARKET FIT AND ACCELERATE THEIR DEVELOPMENT CYCLE.
Community
Now that we have a high-level understanding of what dbt is and how it fits within your analytical stack, I think we should talk a little bit about its community, because it does make working with dbt such a positive experience.
We should mention first that dbt is totally open source and you can go play inside the beast if that’s your thing.
Once you’ve satisfied your curiosity, you should definitely consider joining the dbt community on Slack. This is a pretty active community composed of members who are really helpful to one another. And the creators of dbt are also always present and ready to help whenever you stumble upon a problem. But what’s pretty cool also is that the discussions will definitely make you a better data engineer, as some of the questions/answers and the resources shared do open up your horizons to many other facets on how to build a modern analytical stack.
One other resource that’s starting to pick up steam is dbt’s Discourse environment. This is where you can find some pretty thorough answers to commonly asked questions. Or at least that’s what the idea of it is. This is still pretty new, but I think there’s a lot of potential with this and there are already some gems that are living there.
And last but not least, dbt’s documentation. This of course will be a resource you refer to often as you start working with dbt, but also as you get more sophisticated with the use of the tool. There are solid documentations here, which pretty much answers all the questions you might have in regards to using the different features of this tool.
So, all in all, I think that this community is as important to the success of dbt as the tool itself. You can always count on the documentation, the already answered questions in Discourse, or the availability of other community members on Slack, to point you in the right direction.
First Steps
Alright, now that we completed that grand tour of dbt, it’s about time we get our hands dirty.
To get things started, I would definitely recommend going through that dbt tutorial provided in the documentation. You’ll learn how to install dbt and create a project.
There are also useful guides, which are to be consulted as you discover some of those aspects of dbt. But I would definitely recommend going through the “Configuring models” guide, as this is essential skill to work with dbt.
I would also recommend that you take 20 minutes of your time and listen to the “Project Walkthrough Video”, which actually confirmed to me that dbt was actually as easy as I suspected it to be. Up to that video, I was convinced that I was missing something that would make dbt complex to use, but nope. If you know SQL, you’ll be confortable with dbt right from the start.
Hopefully, by this point you have a good understanding of dbt, how it fits within your architecture, how to set it up and have created your first project. Now what?
Well, you would need some data of course. Without going over the details, because this is out of scope for this article, let’s assume you have a few source tables in a Redshift/PostgreSQL database.
An example for me was to transform an export of Eventbrite registrations into a clean DW structure to be used in Tableau. My staging model would look something like this…
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
SELECT CAST ( ‘eventbrite‘ AS VARCHAR ) AS source, CAST ( "Order #" AS integer ) AS registration_natural_key, CAST ( "Event ID" AS integer ) AS event_natural_key, CAST ( "Event Name" AS varchar (64)) AS event_name, CAST ( "Attendee #" AS integer ) AS attendee_natural_key, CAST ( "Email" AS varchar (128)) AS attendee_email, CAST ( "First Name" AS varchar (32)) AS attendee_first_name, CAST ( "Last Name" AS varchar (32)) AS attendee_last_name, CAST ( "Job Title" AS varchar (128)) AS attendee_job_title, CAST ( "Company" AS varchar (32)) AS company_name, CAST ( "Ticket Type" AS varchar (32)) AS tickets_type, CAST ( "Quantity" AS smallint ) AS tickets_quantity, CAST ( "Total Paid" AS decimal (6, 2)) AS transaction_amount, CAST ( "Currency" AS char (3)) AS currency, CAST ( "Order Date" AS timestamp ) AS registration_datetime</pre> FROM eventbrite.registrations |
I could then derive entities such as dim_event…
1
2
3
4
5
6
7
8
9
10
11
12
|
WITH registration AS ( SELECT * FROM ref( ‘stg_eventbrite_registrations‘ ) ) SELECT DISTINCT dbt_utils.surrogate_key( ‘registration.source‘ , ‘registration.event_natural_key‘ , ‘registration.event_name‘ ) AS surrogate_key, registration.source, registration.event_natural_key AS natural_key, registration.event_name AS name FROM registration |
dim_attendee…
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
WITH registration AS ( SELECT * FROM ref( ‘stg_eventbrite_registrations‘ ) ) SELECT DISTINCT dbt_utils.surrogate_key( ‘registration.source‘ , ‘registration.attendee_natural_key‘ , ‘registration.attendee_email‘ ) AS surrogate_key, registration.source, registration.attendee_natural_key AS natural_key, registration.attendee_email AS email, registration.attendee_first_name AS first_name, registration.attendee_last_name AS last_name, registration.attendee_job_title AS job_title, registration.company_name FROM registration |
and fct_registration…
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
WITH registration AS ( SELECT * FROM ref( ‘stg_eventbrite_registrations‘ ) ) SELECT dbt_utils.surrogate_key( ‘registration.source‘ , ‘registration.registration_natural_key‘ , ‘registration.attendee_natural_key‘ , ‘registration.event_natural_key‘ ) AS surrogate_key, registration.source, registration.registration_natural_key AS natural_key, registration.attendee_natural_key, registration.event_natural_key, registration.tickets_quantity, registration.tickets_type, registration.transaction_amount, registration.currency, registration.registration_datetime FROM registration |
SinterData provides a tool to view the relationships between our models, which when our data warehouse becomes complex, is a great tool to have at our disposal. For our current example, we end up with the following DAG (directed acyclic graph)…
And that’s it, we just created a first version of our data warehouse.
Good practices
The example above is pretty basic, but as things become more complex, you’ll need to start picking up some best practices. Some of them, you can introduce them right away, some of them you’ll learn of as you go, and some of them will impose themselves upon you.
Here’s a non-exhaustive list of “good practices” I picked up during my first few projects.
Folder Structure
How you structure your data warehouse and how you structure your dbt/models folder are 2 entirely different things. There was some good discussions around folder organization in dbt, but I think in the end it all depends on how you would like to work.
Here’s what works for me. I have 4 main folders:
- Staging – This is where I create my foundational models on top of the source tables that were exported to the data warehouse. I do a selection of fields, typecast, filter and rename the fields if necessary. You know, basic maintenance.
- Transform – This is where the bulk of the work happens. It’s where staging models are being used and transformed to become the final entities. You would join sources, add aggregates, filter, etc.
- Entities – This is where I have my final fact and dimension models.
- FinalModels (I usually have a more business-focused name, such as ‘Marketplace’) – This is where I create the main queries that are being used by BI tools. There are discussions in regards to building your final model within dbt or directly in your BI tool, but I personally like to have the “big ones” here in dbt with all other queries.
In regards to the Transform table, even here I have a specific folder structure underneath. Here’s how it looks:
- EntityName
- Source1
- EntityName_Source1_1_FirstTransformation.sql
- EntityName_Source1_2_SecondTransformation.sql
- Source2
- EntityName_1_MergeSources.sql
- Source1
As you can see, I like to keep it structured by source and sequence. This really helps whenever I have to introduce new features or when debugging.
Testing Suite
Here’s something that I didn’t use first, but that was always in the back of my mind.
I actually was looking for a solution outside of dbt, because I had a specific test scenario that I wanted to do and didn’t know much about the testing framework in dbt to get it working with dbt itself. So I had asked my question on Slack directly and Tristan pointed me out to custom schema tests.
And that kinda propelled me in the world of testing within the confines of dbt. The Data Build Tool already has a few tests that can be used right off the box, but there’s also the dbt-utils package which provides other tests that can be used in your own suite of testing.
Right now, for one project, I have over 60 tests that I run in my test environment after doing changes and that I run automatically every hour in the prod environment. Whenever a test fails in prod, an alert is being sent to our Slack channel.
Amazing!
Data Quality Exploration
Once I’ve made changes in my dev environment, built my models and tested them, I like to use Tableau to actually visualize my Data Warehouse before it gets built in prod. Some sort of Quality Assessment.
I’m not doing anything fancy at this point, but I do like to plot my main entities’ data sets to make sure that no weird anomalies were introduced during development.
The testing suite that was discussed previously safeguards my data at the atomic level, but my Tableau QA workbook safeguards data quality at the aggregated level.
Work Flow
That brings me to my overall work flow. Here are the steps I take when introducing a new feature within a data warehouse:
- Research/Exploration into required fields and their sources
- Write up tests as acceptance criterias
- Build data models in dbt
- Testing in development environment
- Quality assessment with Tableau
- Commit code to git repository
- Pull new code in prod
- Build new ETL in prod
- Testing in prod environment
- Integrate new feature in final BI report
- Push to Sandbox for client approval
Extras
I did briefly touch on those previously, but there are a few extras worth knowing about:
- dbt-utils – This package is provided by the creators of dbt and packs in a few extras that are worth looking at. I personally use only a few of what’s provided (such as the ‘surrogate_key’ SQL helper, as seen in the models above), but there is a lot to explore here and that might also be useful.
- Sinter – Not sure how that service is associated to dbt and Fishtown Analytics, but their service is all about dbt. Essentially, they provide the means to run, schedule and manage your dbt jobs. Simple, yet really well made.
- dbt Graph Viewer – Actually provided by the folks at Sinter, this little tool is something I use frequently. Not only can you see the whole DAG of your project, but you can also focus on only one model and look at their parent and/or children models.
I should also quickly talk about macros, which are potentially a big thing, but that I haven’t gotten around to. Although I did just read that article by Claire Carole on design patterns which talks about maintaining your code DRY, using Jinja’s templating language and the introduction of macros. Anyways, still a whole lot of stuff to discover
The Big Picture
As you can see, dbt is more than just a simple tool. I like to think of it as a core tool within your data warehouse building arsenal.
You can think of it as following the principles of the Unix Tools Philosophy, where the goal is to have an ensemble of tools that do one job only, but do it really well.
But something else that is important in that philosophy. It is for those tools to easily interact together (by the use of the pipe “|” symbol which pipes output from one tool as input to another tool). That dbt keeps really close to SQL is what makes it so easy to adopt, but also so easy to interact with many other tools. It’s a common language that allows for outputs to become inputs for another tool.
But also, the choice of SQL is one that assures interoperability within a modular BI architecture. As is argued by Ajay Kulkarni, co-founder of TimeScaleDB:
“We believe that SQL has become the universal interface for data analysis.”
(Source: Why SQL is beating NoSQL, and what this means for the future of data)
There is a lot that’s been said and written about the modern BI architecture, with cloud-based services as their underlying infrastructure and modularity as a core principle. Having tools such as dbt adds to that philosophy by being itself modular (and replacable), but to actually be very flexible so that it can work with many different inputs (data sources) and outputs (data warehouses).
Here’s to your dbt journey, to the future development of dbt and the community that is being fostered around that project!
以上是关于My journey introducing the data build tool (dbt) in project’s analytical stacks的主要内容,如果未能解决你的问题,请参考以下文章
/*unfinished*/ Summaries and my Random Thoughts of Core Java (10th edition) by Cay S. Horstmann(示例代码
[Journey with golang] 0. Planning for the New year and new journey