A reflection on distributed Data Labs
Since data lies at the heart of many corporate strategies, a natural first step is to set up a Data Lab. The purpose of such a lab is to experiment with new technologies and to address a range of business use-cases.
First things first, how is a Data Lab organized ? As you can see on the image below, a lab is composed of 3 main teams. Most of the team members are “technical” (Data Scientist, Data Engineer, Data Analyst…) and report to a “product owner” who aligns data technologies with the business vision.
There are obviously many reasons for success and failure but one interesting starting point for evaluating success is the position a Data Lab has in the organization.
Here is a typical chain of thought: · My data is of critical importance · My data is centralized in a Data Lake · I build a centralized Data Lab
Then, after many millions are invested (sometimes even tens of millions) at some point senior management will ask some painful questions: what is actually my return on investment? What are the AI use-cases that are in production and adopted by end-users?
Since answers are hard to provide, management changes, political battles continue, budgets are adjusted and by the end of the day the centralized Data Lab just lingers on or is even quietly phased out.
Let’s reflect a little bit on the reasons for failure:
- Is this a People problem? Possible, when people and teams are not aligned
- Is this a Technology problem? Maybe, depending on technology choices
- The main problem of the centralized Data Lab is its place within the Organization.
Here at Saagie we have the privilege to work with a number of large corporate accounts and we have come to the conclusion that Distributed Data Labs is a concept worth considering.
Large Corporations have complex organizations with different geographies, business lines and infrastructure deployment preferences. The distance between a centralized Data Lab and decentralized business unit is simply too large. It is very challenging to set up the pluri-disciplinary teams that make AI projects a success.
The alignment between different teams and competences is a critical success factor. Moreover, we have learned that business should be in the driving seat albeit with a strong technology component.
A distributed Data Lab is by definition very close to the business, which means there can be several Data Labs, depending on the different business units and geographies. But don’t get me wrong, this does not mean that a distributed Data Lab is disconnected from the rest of the enterprise. Much on the contrary, IT should endorse the role of a data broker and make sure that Data Lab experiments actually get in production. While IT keeps control, Data Labs should have as much autonomy as possible and be free to experiment with business use cases and also technologies.
A second lesson we learned is that next to the technical aspects, distributed Data Labs need some form of centralized governance (defining priorities of different use cases, alignment with company strategy, exchanging best practices between Data Labs…).
That is the role of a Data Office under the responsibility of a CDO. But that is a topic we’ll surely elaborate in another post.
And the last lesson is probably the most important one. What really drives AI projects is embodied in a single word: Trust. IT should trust business and data Labs with an appropriate level of autonomy. Business and data Labs should trust IT as the sole data supplier.
And that means: No more shadow IT.
Moving beyond the traditional conflicts of interest is possible. Cross-fertilization is achieved when business has the possibility and the tools to take the lead on improving data quality, suggesting additional data sources, declaring tags on datasets, contributing to the enterprise data catalog or even declaring personal data.
Only then you are starting to become Data centric, AI driven or whatever buzzword you have in mind.
Easier said than done?
During the last three years we build at Saagie a Data Fabric which is essentially an abstraction and orchestration layer allowing to standardize the integration and deployment of a wide range of data technologies (open-source and commercial), relieving complexity all the way from infrastructure, data lakes, Data Science to containerized applications.
This helps Data Labs to create and share projects in an agnostic manner, meaning independently from the selected infrastructure or the data Lake storage technology. In a hybrid world, globally operating multi-disciplinary teams need a high degree of self-service and autonomy while IT remains firmly in control.