Edit [241010]: Gotten a few feedbacks from people about my writing. I will take some time to rewrite this, maybe to make my point clearer and invite better discussions
Data is just about the most important thing to any business in the recent years. Companies and organizations try to record everything under the sun, hoarding zettabytes of data, and churn out job titles after job titles trying to crunch and make sense of those data: Data Analyst, Data Scientist, Data Engineer, Data Architect, ML Engineer…, you name it. It doesn’t stop there, the Developers need to care about the data in the system they are building, the Managers and Executives need to get some reasonable insights coming out of the data. The more chefs involved in cooking the broth, the messier it becomes.
My perspective is that, everything can be treated as data, and there are no way to model a system that can encompass all kinds of data and all possible data use cases. However, what we can do is to conceptualize the way we look at data, into what I call the three main Data Perspectives. It’s the way we look at data, that gives different kinds of understanding as well as concern about the data.
The Transactional Perspective
This is the way system builders - the engineers - looking at the data and the entities involved. The data is looked at from the perspective of the business process: a sequence of events/instructions that involve some data in between. For an example with an Ecommerce store, it could be fetching a list of SKUs based on a certain filter, or user making an order from the checkout page. With a ride-hailing app, it could be fetching a list of nearby drivers.
One could say, the people building the system care about how the data being Created, Read, Updated and Deleted. Nicely summed up, CRUD, hah.
The complexity of the data models comes from the different access patterns that they need to cater to. The engineers think of the different ways the entity will be accessed and updated, what other entities need to be associated with it, what condition will be used in filtering or search. Those requirements become the base for engineers to come up with database design, tables, indices, key and constraints, relational models, etc.
In the iterative development process of the tech product, there are greater than zero chance (well, a lot greater than 0) that there would be a new access pattern emerged, that is not well-served by the incumbent data model. In the constraints of time and resources, sometimes developers continue to utilize the old data model in a less than ideal way, or sometimes data migration process happens to convert to a new version of the data model.
The Reporting Perspective
The “biz people”, maybe the marketing managers or the executives, might not spare too much time thinking about the nitty gritty details of database design. The concerns generally revolve around “What are we doing? Are we doing okay? How can we do it better?”. The way to approach the question from the data perspective comes from reports, and metrics, the source of all knowledge, insights and decision-making rituals.
Here are where all the jargons and all the impressive-sounding abbreviations come from. How’s my GMV doing this month? What were the CTR for last month’s campaigns and which one did the best? Are we hitting the ARR mark for this fundraising round?
The “Data people” combs through all the data and find a way to build those reports and report on those metrics. It’s not all there is to it, people would like to know the metrics on different levels of granularity: time granularity(year, quarter, month, week, day), geography (region, country, city, or even district), category(product, price range, brand, origin).. you name it. You start from the smallest unit of reporting (the grain), and that’s when the dimensional modeling concepts coming in. Data from the source systems do not just come nicely packaged and ready-to-use, a lot of preparation works need to be done, converting them in to more usable format. “Data people” build reporting and metrics based on the dim and fact tables, construct data cubes for reporting purposes and plug in their favorite BI tools and draw up all those nice graphics. Tada, happy executives (well only if the numbers look nice).
There’s an operational side to this reporting perspective of the data. Sometimes upon inspecting the higher-level data, people discover anomaly on a subset of the data, and they would request data at a lower level to investigate more deeply. Being able to report on the high-level data and being able to dig deeper into data for operational needs are both crucial for the business, from the data reporting point of view.
The Entity-based Perspective
Every single business has a few core entities that they want to observe a lot more closely. User
is always one entity, and depending on the business it will be different: it could be a Merchant
for a marketplace business, SKU
for Ecommerce, or Driver
for the ride-hailing one. Due to the strategy and operation of the business, they want to tag, label, or annotate their entity with different information; it could be for operational purpose, or for research, experimentation purpose.
Let’s take User
as an example, there are many many ways to tag the users
Descriptive details: age, gender, signup date, or maybe location
Historical interactions: basket size, order frequency, favorite brand, last 5 orders, search history
Classification, behavioral or predictive : membership tier, deal hunter or daily user, fraud rate.
Experiment: control group, experiment group
and many others. You can tag someone with their astrology sign or favorite comic series, because why not.
There are no upper limits on the way we tag the entity, and there are no boundaries on what we can use the tags on. You can also tag a subset of users during your experiment and ignore the rest. This way of thinking about the data might not fit very well in a relational data model, but might fit better in a non-relational, document-based structure.
A most common approach to this Entity-based perspective is what people often touted as CDP: Customer Data Platform. In this case, the Customer is the main entity that we want to observe, and we decorate the entity with as much information as possible, and we view everything from the Customer perspective. One of the main use cases for the CDP would be the user segmentation/marketing automation operation. We automatically tag the customer with some metrics/information, and based on some filter conditions we trigger marketing activities (e.g.: distribute vouchers for top 10 users with highest spending in the last month).
The Entity-based perspective is also what powers most of the data science system. We look to find a way to classify the data, or based on the data classified we setup experiments to prove certain hypothesis/algorithm, to tackle an optimization problems. It could be feature engineering, or A/B testing setup, online model serving, fraud detection, you name it.
Combining the views
In the recent years, most businesses have already started in their data journey, and with decently tech-capable businesses, people have already been looking into the data in multiple perspectives. In some certain part of business, it’s already a must-have to incorporate multiple ways of thinking in designing large-scale data system.
For example, with a online fraud detection system, operational system need to retrieve the data and predictive model (which comes from entity-based perspective) to flag and handle fraudulent transactions. Reporting system need to work closely with user classification system to identify strengths, weaknesses or even new business opportunities. For data science infrastructure, one huge initiative is a feature store that take input and output from both online and offline processes.
The main dissonance in the data initiatives of business can arise from this concept of data perspectives: The underlying data is the same, just that people look at data differently, and have different concerns. The problem gets worse when people trying to tackle organization-wide data initiatives without enough care for different data perspectives. Some examples could be data access and security initiative (which may work with operational and reporting system, but might fall short in tackling tag-based system where the tagging can be arbitrarily created and generated), data dictionary, data governance, PII classification and protection.
Conclusion, or not
Business Data at scale is still a hard problem for anyone to tackle. The Three Data Perspectives is just one way that I personally take to approach, and maybe a starting point for looking at Data Architecture. Understanding the way people look at and utilize the data differently, can be the start to build a strong Data foundation for any company.