Layering Your Data Warehouse
Hello I am Mitchell Silverman! I have worked in data over the last decade at a number of tech companies including WayUp, WeWork, and now at Spotify. I am starting this blog as a means of sharing some of my experiences so that people can learn from some stumbling blocks I have hit throughout my data journey.
I never thought I would be comparing my work in data engineering to the great Mike Myers but Ogres and Data Warehouses have a lot in common. Both are misunderstood by most and both can save the day when called upon.
In a past life I was a data engineer within the Member Experience mission at WeWork. Our team worked on the technology that made our members’ lives easier and turned their office routines into seamless experiences. We owned complex data sets that would weave an intricate web with each other. These data sets are used in different combinations by stakeholders on every side of the business, and therefore have unique data wants and needs from their counterparts
- Business stakeholders: need to understand their team’s performance so that they can tweak what they are working on to make things more effective and efficient. The data models they request are typically to see how many active users we have or if we are getting users to come back more often.
- Technology stakeholders: want to make sure that the new feature that they are testing or recently released is working correctly and that they are able to track success metrics associated with it. The data models they request are typically to see how long it takes a user to do something, or if they have been able to get users to convert to using a new feature.
Most data professionals have run into at least one of these kinds of stakeholders in their careers and many have to balance the needs of both, leading me to write my first blog post about layering within warehouses because it is a key first step to satisfying the needs of both of these stakeholders with the least amount of discrepancies. In order to do this, companies must use the same business logic for multiple kinds of analysis and seems repetitive but idk they are going to need to layer their data (thanks for hanging in there while you tried to figure out why this guy is using a Shrek GIF).
What’s a dbt?
So I am here to tell you a little bit about how we define our layers using the magic of dbt.
For those who are not familiar with dbt: it is an open-source command line tool, created by Fishtown Analytics for data professionals, that allows users to create data models without the worry or fear of creating circular dependencies. It also has many bells and whistles like macro creation, schema testing, and post hooks for your commands to run after your models are run in dbt.
I had some experience with dbt from a previous employer and I knew how much it could help our organization and our data modeling efforts. So when I joined the team, my first big project was to implement DBT and come up with a proposal for how we would structure our new data warehouse. I based my recommendation off of the work that the great people at Fishtown Analytics but tweaked it slightly to fit the needs of our business.
What are “Data Layers”, and why use them?
Layers are important to data warehouse solutions because they allow data professionals to be agile and base all their reporting tables on the same underlying logic. The below layers represent how we layer data tables on top of each other so that we can ensure that all logic is coming from a single source and not being created on an ad hoc basis.
What are the 4 layers in our dbt project?
Depending on your business needs, you might find that a different configuration or number of of layers will fit your needs the best, our particular team had identified 4 layers of warehouse is what we need for our particular use cases
- Root Layer
- Logic Layer
- Dimension and Activity Layer
- Reporting Layer
But Mitchell, why do you need 4 layers? For us 4 layers were key for our business needs. We have multiple data visualization tools, multiple technical teams and business stakeholders using our data and scaling issues to solve.
#1 The Root Layer
The first layer is the landing zone for any initial data ingestion. Dedicating a layer to the root, or source data, helps to safeguard against future changes in data structure and field names.
You can create tables in this layer as either as views or as what dbt calls ‘ephemeral’ tables. This means that the logic you are creating is not stored in your database and taking up valuable store space. The model is only stored as logic to be created as a CTE (Common Table Expression) when called upon as a reference. This layer provides a layer of protection in case your organization decides to change the names of source tables or changes tools to ingest your data.
This layer also serves as a place to cast fields to the correct data types as well as do simple transformations like timezone conversions.
Let’s say for example that your team is using Heap to ingest data from your application and tomorrow you decide to just use Segment instead, or a data source that you rely on upstream is changed overnight and the schema no longer matches exactly.
The Root layer exists to make sure that when that migration happens, you do not need to union every table referencing your Heap tables or change every reference in your downstream layers.
The root layer serves as a pass-through for the rest of your data models to easily be able to put things into dbt syntax and ensure that all tables that use business logic have been cleaned up and are ready to be used for building data models
(Don’t try to fix it upstream)
#2 The Logic Layer
The logic layer is where the heavy lifting of your data transformation is done.
This layer is rightfully named the logic layer because this is where you will begin to institute your business logic. This is the layer where you can also begin to materialize your SQL code into views and tables depending on your needs and the size of your warehouse storage.
This is also where you will begin to introduce complex joins and bring your data sources together. For a data professional, this is the layer where the sausage is made. It is not always the cleanest or the easiest, but you get a wonderful product at the end of it.
The logic layer is where our team will do things like user mapping across different tools and sources and creating our own session definition for our app users. These are data models that will get referenced often throughout the rest of the project and because of dbt, we don’t need to worry about scheduling issues or circular dependencies on these tables.
This layer will also serve as the foundation for our activity and dimension tables and is meant as a data engineering layer. This is where data engineers earn the big bucks and do the heavy lifting to allow for easier joins and aggregations downstream. Data scientists can use this layer but only for very targeted projects that look to analyze one individual dimension or feature.
#3 The Dimension and Activity Layer
This layer is where we begin to surface data models for others to use and create their own analysis
This layer is where we begin to surface data models for others to use in their own analysis and projects. You can also look at this as a traditional star or snowflake schema, with clean dimension and fact tables.
It is also the layer where you build your activity streams to allow for data analysts and scientists to do analysis on your data. This layer exists to make sure that data scientists can focus on analysis and not worrying about needing to do complex joins and unions to get their work to make sense.
The dimensional models in this layer should be created in order to include all relevant dimensions related to an entity. These tables should be slowly changing over time (because they will be rebuilt everyday) and should not include any kinds of aggregations.
The activity streams in this layer are meant to include all activities related to an entity. These tables should use uniform user id’s and session ids where applicable. Make sure to check out this snazzy macro in dbt for faster and easier Union queries.
#4 The Reporting Layer
The reporting layer is needed for all your analysis and visualization needs
This layer is for your metric calculations and any levels of aggregation you might need for reporting. We also include some raw totals in this layer to make this more flexible, but overall we encourage data engineers to capture as many metrics as they can in this layer. This layer should allow for three main functions on your team:
- Allow business analysts to run their own SQL queries without needing to know much more than the SELECT, FROM, WHERE, and GROUP BY functions in SQL.
- Standardize metrics for any data visualization tool that you might be using.
- Complete complex joins for pre-joined data that allows data scientists to run more performant queries.
Some Lessons Learned
On initiating Adoption Across Data Teams
Once we decided to move forward with dbt, our team needed to figure out the best way to adopt the tool across the other data teams and disciplines
Initiate interest via a roadshow. I went on a dbt world tour, scheduling introduction sessions with teams outside of my mission, to teach other data professionals how dbt could help their data pipeline and make their lives easier. Since this roadshow, we have had more teams adopting the tool, seeing the value in it as we prepare to change data warehousing solutions.
On Stakeholders who sacrifice Long-Term Vision for Short-Term “Urgent Requests”
At times it can be difficult to say no to a stakeholder with a two-week deadline for a big meeting that is going to change the course of history as we know it. We ran into this pitfall during the early stages of our dbt project but since then have taken steps to refactor our work and be more design-driven.
It is important to say no or to make your stakeholders be a little bit patient so that you do not sacrifice the design and integrity of your data warehouse for a ad-hoc request that needs to be filled. It’s always helpful to also call out and vouch for the benefits of a well-designed data warehouse that will outweigh near-term output so they understand it’s all in their best interests.
We are currently working on creating a dbt project that is more driven by the entities that matter to the team and less about the reports that we need to output for others. This will allow us to be both agile in our work, as well as bring down wait times for requests that come our way.
It’s totally worth the effort!!
Within a couple of quarters of work, we started seeing the benefits of standardizing our business logic and giving our data scientists the confidence to believe in our data modeling efforts. We also now have robust data dictionaries that allow for easier discovery.
Since writing this original blog post I have now implemented this system at a new company and learned even more about the benefits of dbt, stay tuned for my next blog post about folder structure and adopting Bigquery as a data warehousing tool.