Data is Good
Written by Luke Ferguson
The more pedantic film fans amongst you will have picked up on how the title of this post is a play on the iconic Gordon Gekko in Wall Street (1987). Others will know it as the often misquoted “greed is good”, but either way you get you the sentiment right?
This isn’t the only film quote that goes through a process of manipulation before being repeated and printed on mugs and t-shirts the world over, as we can see from the following query:
That “SELECT * FROM movie_misquotes;” is a query, a select query to be more specific. But you already knew that right?
Yeah, you know all about data and databases, what a data lake is, the difference between a data lake, data warehouse, a database, and a data dictionary. And data modelling, schemas, CTEs, foreign keys, normalisation, surrogate keys and cardinality, right?
Of course not! And you don’t need to. Don’t worry about it, I don’t even need to know what any of that stuff means. At least I hope I don’t. The only thing round here that needs normalising is my pedantic obsession with classic movie quotes, am I right?
This does slowly and meanderingly bring me to my point though. The writers and actors involved in these movie quotes have no control over how their art (or information) will be remembered, manipulated, and misinterpreted.
This is rarely deliberate; we all have that one friend or family member who is forever misquoting song lyrics — my particular favourite being Monster (What’s That Coming Over The Hill) by The Automatic being reimagined as “What’s that coming over the hill? Is it a bus stop? Is it a bus stop?”
But it’s a challenge that we in data must be extra mindful of. In other areas of the business, tech change is often represented by new processes and functionality (where you click button X to initiate task Y): defined causality that it’s difficult to stray away from, and when you do we implement an un-happy path or an exception process. Nice and clean, easy to manage, everybody’s happy.
In data though, when we deliver new functionality (a new report, additional data points, whatever it is), once an end user consumes that information, we have very little control of where it ends up and how it’s used. Before we know it it’s on an Excel spreadsheet, in the body of an email or printed out and stuck on the wall. The dynamic, live data source that we were so proud of, lives on as a static snapshot.
The intended use of data doesn’t persist in the way that other IT change does. Software engineering teams are often custodians over changes of use and process tweaks, but if a stakeholder wants to use a piece of data in a way other than it was intended there’s very little we can do to manage that.
This risk can be very small, but soon rises when we start to introduce ambiguity. Talking about cost data, for example, what is “cost” anyway? Is it the price we pay a supplier for a product? Does that price include delivery to Dunelm? Does it include the delivery cost out to the store, or the customer? What about labour costs to pick/pack/re-stock the product, or serve the customer? Do we need to consider consumables like packaging? Or overheads like rent, head office staffing costs, insurances? In a way all the liabilities on the balance sheet are “costs”, right?
One way we can mitigate these risks is by providing stakeholders with context around the data they’re consuming, being transparent around the processes that have generated that data and encouraging critical thinking in all data related activities. There’s a drive at Dunelm to become People Led, Data Driven, and in order to do that we need to foster an environment that encourages data fluency.
I’m a Business Analyst in one of our Data teams at Dunelm, and a key part of that role is to act as an intermediary between groups of people to encourage effective communication, collaboration, and understanding. Historically, the role developed out of a need for business users to interact with software teams to ensure that business needs are being prioritised during development and that the end product aligns to the business goals and values. One way I ensure that I’m performing this aspect of my role is through what I refer to as making the complex simple.
There’s no set techniques and tools I use to do this, it often varies based on the situation and stakeholders I’m dealing with. So here goes, how can I demystify some of this data nonsense and perhaps make you all a little more data fluent? How can we make some of this a bit more accessible?
Let’s start with something familiar: an Excel file or workbook, each page consisting of a grid of cells arranged in rows and columns. Generally, there’s a header row that tells you what each column means and various values in those columns relating to each row.
Essentially this sheet is a table. You’re probably familiar with that term — there are features and buttons in Excel that use it, and a database is made up of tables. A good high-level analogy of a database therefore is an Excel workbook, it contains multiple sheets (tables) that are all grouped together within the file (database).
One type of database we use at Dunelm is called a relational database, because the tables in that database can be related to each other. So you might have a table called CUSTOMER that contains the names of all your customers (and is accompanied by a robust GDPR compliant data management policy), and another table called CUSTOMER_ADDRESS that contains all of their addresses. You can relate the two through some kind of identifying value for example a customer ID number.
Relational databases are very good at pulling information together for analysis. They can work with large volumes of data at speed and maintain data integrity. Although you could set up a similar structure within Excel (a CUSTOMER sheet that has a unique column which is referenced from a CUSTOMER ADDRESS sheet), a relational database is much more powerful (yes — even if you’ve made the switch from VLOOKUP to XLOOKUP) than an Excel file. And yes, I have seen some of the macro enabled workbooks we have at Dunelm.
OK so table = sheet, database = workbook. In case you’ve not noticed, we’re getting bigger, so let’s move on.
What is a relational database?
A relational database organizes data into tables (like spreadsheets), where each table represents a different topic and rows in the table represent items within that topic. It’s great for managing data where relationships between different items matter, like linking customers to their addresses.
A data warehouse is essentially, well, a warehouse. Only instead of aisle after aisle of racking consisting of box after similar sized box of flat pack furniture that you wish you’d made a note of the location of because the labels are tiny and none of the end caps tell you what’s in the aisles and the maze upstairs was bad enough but now you’ve got to put a warehouse shift in just to get Billy the book case home and find out there’s a part missing, or you picked two of the same box up because you’re not a qualified picker, you work in an office, you don’t know what you’re doing, I hope this place hasn’t got a self-checkout too, it’s bad enough you have to build the stuff, nobody ever mentioned you staffing the shop too, why didn’t you just order something off Dunelm.com, you could have done that from the sofa, wouldn’t have had to drive all the way out here stressed about fitting it in the car, full of meatballs that you never actually enjoy, let’s be honest… Sorry, back on task… instead of racking where each shelf location might contain a product, a data warehouse’s shelf location will contain a relational database. Now this is an analogy and there isn’t an actual warehouse but if you can visualise it, we might be able to keep the analogy going. We’ve already talked about our CUSTOMER Database, that contains tables related to our customer, we might also have a SALES Database that contains (you’ve already guessed it, haven’t you) sales data, and maybe a PRODUCT Database too. Even an EMPLOYEE Database where you and I might be sat next to each other (again fully GDPR compliant, maybe even anonymised), imagine that?
Imagine a little digital picker in a little digital forklift truck who knows the layout of this data warehouse like the back of their little digital hand. They know where everything is and have access to every one of these databases. You could ask this little digital picker in a little digital voice (to protect their little digital ears) to tell you which customers have bought a specific product. Now, this little digital picker can nip up and down those aisles at the speed of light. They can go to the PRODUCT database and grab the unique product ID, then go to the SALES database to grab all the sales orders where the product ID was sold, and use the customer IDs on those orders to grab a list of customers who made those purchases. Remember this little digital picker is very fast, very small and most of all, most of all, completely imaginary.
But what you can hopefully imply from this little analogy is that those individual databases in the warehouse shelf locations not only contain tables that are relational to each other, but that tables across databases within the data warehouse can also be related back and forth. So we’ve got tables/sheets inside shelf locations (databases/ workbooks) inside (data) warehouses.
What is a data warehouse?
A data warehouse is a large storage space that collects data from many sources within a company, organizing it for easy analysis and reporting. It’s designed to help businesses make decisions by providing a comprehensive view of their data over time.
Where does a data lake come into it? Well it doesn’t, not into this same analogy anyway. Well, not with that attitude… Imagine if you will, that our warehouse sits on the shores of a great lake. What kind of warehouse would sit on the shores of a great lake I hear you ask; I don’t know I’m making it up as we go along. Let’s call it a sea food and ocean salvage warehouse (it’s a really big, saltwater lake). I hope you’re still with me, I’m nearly finished I promise.
OK so this warehouse has little crafts — shipping vessels, trawlers, submarines, random youtubers with magnets tied to the end of ropes (magnet fishing check it out) — that go out into the lake to retrieve things of value to store in the warehouse.
These crafts are going out into the lake, fishing up all sorts of seafood, treasure, shipwrecks, seaweed, tyres, old boots, shopping trolleys etc. They throw all the rubbish back in (remember it’s imaginary, so still sustainable, I think) and take anything of value back to the warehouse.
It really sounds like I’ve lost track now, doesn’t it, but here goes. So far, we’ve talked about structured data, which (you’ve guessed it) is in a structure: it’s formatted, and we know exactly where it is, in a field, in a table, in a database, in a warehouse. We even knew in advance what it was, so we assigned the field a data type. You can only store a number in a number field, a date in a date field and so on. We’ve gone through a lot of effort to keep that little digital picker sane and efficient.
The term data lake stems from the idea that data from various sources can be thought of as streams and rivers flowing into a larger lake. A data lake stores data in its raw format. It can be anything from a structured value (a person’s name, a product SKU, a date) to an image, or an email, or a semi-structured block of text including a person’s name, a product SKU, and a date, amongst other information (as it would be if a sales order landed in the data lake). The power of a data lake lies in its ability to store all this data in its natural state, allowing for greater flexibility in future analysis (I suppose I could have even add a punny “fluidity” comment here to increase the fun). We then have routines that run to bring data out of the data lake, clean it, validate it (“throwing back” the old boots and sea weed) and move it through into a database table in the data warehouse, so it’s ready to be analysed and aggregated into actionable data insights that enable the business to make informed decisions.
Ingesting data through a data lake mechanism enables flexibility and scalability in data management. It serves as a central repository for data to be collated before being processed.
So, the lake acts as a vast holding area that holds raw data (data lake) before it gets cleansed and moved through into tables/sheets inside shelf locations (databases/ workbooks) inside (data) warehouses
What is a data lake?
A data lake is a vast pool of raw data stored in its original format. Unlike the more organized data warehouse, it can hold all types of data (texts, images, videos) until someone needs to use it. It’s like a big digital library where you don’t organize the books until you need them.
OK, so what’s Snowflake?
Snowflake is the data platform we use at Dunelm, it’s a cloud-based data warehousing platform, where all of these components (the data lake, the data warehouses, and the databases) are seamlessly hosted and integrated. There are other examples of data warehouses at Dunelm (e.g. SAP BW), and almost every system will contain some kind of back end database (e.g. Dunelm.com, SAP ECC, Service Now etc.), but the real analytical power comes from the work we do in Snowflake — bringing data from those systems together and modelling it into more accessible formats, enabling powerful analytics and insights (although I would say that because my team is doing some of that work).
In wrapping up, it’s clear that the journey through the realms of data — from databases to data warehouses, and onto the expansive shores of data lakes — mirrors the essence of Business Analysis itself. The role of a Business Analyst at Dunelm transcends merely navigating between these vast data landscapes (other Business Analysis focus areas are available); it’s about distilling complexity into clarity, ensuring that the raw, sometimes chaotic influx of data is transformed into structured, insightful information that powers decision-making across the business.
This blog post, in its essence, is a demonstration of the very principles I advocate for in my profession. By unravelling the intricacies of data management and making these concepts accessible, I’ve not only sought to demystify the world of data for you, but also aimed to showcase the value that lies in understanding, restructuring, and rendering complex information consumable for a diverse audience. It’s a testament to the pivotal role that Business Analysts can play in bridging gaps, facilitating understanding, and fostering collaboration among diverse stakeholders.
Moreover, this endeavour aligns perfectly with Dunelm’s mission to cultivate a culture where being People Led and Data Driven isn’t just an aspiration but a tangible reality. Through encouraging data fluency and promoting a deeper understanding of how data shapes our decisions, strategies, and ultimately, our success, we take significant strides towards realising that.
In essence, the ability to make the complex simple, to guide others through the dense fog of data ambiguity to the clear shores of insight, is not just a part of what we do — it’s the cornerstone of the value the Business Analysis Chapter brings the business and the broad community of stakeholders we serve. By illustrating these principles through the analogy of data structures and their management, I hope not only to have illuminated the path to greater data fluency but also to have exemplified the transformative power of Business Analysis in action.
Data, in its myriad forms and functions, holds the key to unlocking potential, driving innovation, and fostering growth. And if this blog post has, in any way, made that key more accessible to you, then it has achieved its purpose. Here’s to making the complex simple, to navigating the vast data lakes with ease, and to empowering each and every one of us to harness the true power of data at Dunelm.