DataOps to industrialize Data projects
Only half of all AI projects have been deployed in production. Why? Because it often takes longer than it looks – between 12 and 18 months. That is what comes out of the many studies of Gartner, Capgemini or BCG: the need to industrialize. Let’s take a look at the past and what could be improved to help make data projects come to life.
A bit of history
Data processing used to be linked with IT, and that was not so long ago. Big Data was not so big and analyzing it actually meant analyzing structured data. As for storage, it was basically SQL databases or data warehouses that were fed with data.
Everything changed around 2010 with Data Lakes. Tools allowed us to look for data where they are and sources multiplied. Unstructured data (images, texts, audios…) could be processed, and in larger quantities. That’s what we now call Big Data.
Along with Data Lakes came Data Labs; along with Data Labs came Data Scientists and Data Engineers. And here we are! All of these changes brought freedom to the teams but deployment remains a central issue as Analytics ecosystems don’t share the same IT delivery criteria than software development frameworks. Converting an idea into a POC becomes possible, but getting it deployed is still complicated.
The main issues
Infrastructure and security issues often come up, but we decided to talk about the ones that often cause the most problems and can be pretty hard to address:
- the technological challenge: finding use cases is good, having the technologies to support it is better. The main problem here is that there isn’t a single tool that allows to manage such a project. A lot of technologies are involved and you don’t only have to choose, you also need to make them all work together and follow their evolution as new versions are released very frequently.
- the human challenge: once you find the right technologies, you need to make it fit with the teams: IT Ops maintain the infrastructure, Data Engineers prepare data, Data Scientists make them valuable. Their jobs are quite different and having them collaborate can be tough.
A new approach is getting more and more attention as it seems to answer numerous challenges we mentioned above. Although it has only been used rather recently, it is based on two well known approaches:
- Agility: it is about setting up use cases that can be quickly deployed to reinforce confidence within the teams by demonstrating value.
- DevOps: it is what became the link between the development team which want an innovative project they can freely work on, and the operations team which needs stability and security. How? By “simply” having them work together in one unique team. It encourages collaboration and iterations to, eventually, automate jobs deployment.
DataOps is a mix of both, but harder to set up as it is applied to data and not just software development, which means both teams – Data and IT – don’t even work for the same department.
The main goal of it is to ease and accelerate delivery, and doing it while having new profiles involved in the process, managing different environments and orchestrating a lot of technologies. A game changer once you manage to properly apply it within the firm, and having the right tool might help you do it.
The solution: the DataOps orchestrator
“We often hear “data is the new oil” but data may be important, the key is what you use to make it valuable. What you need is a refinery, because people don’t want oil, they want gas.” Adrien Blind, Tech Evangelist at Saagie
What if the refinery was the combination of an approach – DataOps – and a technology. The later can be a (DataOps) orchestrator and has a precise list of features:
- it is a true technological toolbox that allows managing the whole data cycle (preparation, processing, modeling and visualization) while being operational (stable infrastructure)
- it doesn’t only gather these technologies, it needs to organize them by tasks (preparation, processing…), so it needs to have an orchestration feature
- it has governance features so your team is allowed to manage data access in a secured way
- ergonomics and usability also matter as business teams are involved. So, it needs to have a simple and intuitive interface that everyone can use, no matter how technical they are
Why is it so valuable? Because having such a platform allows to manage processing data pipelines and deploy them in an automated and possibly scheduled way – all of that from an environment to another.