Quantum AI – Part 4 – The key role of DataOps
This article is the fourth of our quantum AI column. Previously, we have seen that:
- Chapter 1: quantum computing will significantly accelerate the execution of some machine learning algorithms and cryptographic processing;
- Chapter 2: quantum phenomena (superposition and entanglement) responsible for parallel computing can only be exploited for a very short time under strict isolation conditions (decoherence problem);
- Chapter 3: Despite all its promises, quantum AI will probably not be able to give birth to an artificial consciousness or a so-called strong AI;
In this new chapter, I want to share my point of view on the theoretical implementation of a quantum production device in a company. This can for example be a decision-making process (BI) or machine learning based on an infrastructure of conventional and quantum machines. The interest is of course to take advantage of quantum processors to solve today insoluble problems, whatever the servers used. Nevertheless, it is clear that such a device can not be done simply without using a DataOps approach. Decryption.
Towards an inevitable hybridization of infrastructures
The initial problem is that when real quantum computers come into being, it is very likely that they will be in the hands of big American players such as IBM, Google or Microsoft, for the simple reason that this new equipment will be expensive … So whatever your urbanization policy, you will have to deal with the cloud to experiment with calculations on quantum virtual machines. In the absence of a “full cloud” policy, it will therefore be highly desirable to be able to rely on hybrid infrastructures (cloud and on-premise) – or multi-clouds at the pinch – to limit the risk of dependence on a provider.
Today, there are already similar issues of resource allocation. For example, we may want to provision a cloud experimentation environment (for testing cognitive services, for example) and maintain an on-premise production environment. However, while exploiting HPC servers in a cluster is now easier thanks to the container orchestrators, ofwhich Kubernetes is the most emblematic representative, the parallel exploitation of several disjoint and heterogeneous clusters proves to be extremely dangerous. The complexity is indeed to be able to use different environments without losing the thread of your analytical project, namely the location of data, the continuity of treatment pipelines, the centralization of logs …
We touch here a well-known problem in addition to Atlantic. The agility of the infrastructures and the orchestration of the treatments on heterogeneous computer grids are some of the main issues that theaddresses DataOps in the long term.
In the rest of this article, I will call “supercluster” a logical set of heterogeneous and hybrid clusters. For example, this might be the bundling of an on-premise environment running a commercial distribution of Hadoop or Kubernetes, coupled to AKS cluster in Azure and to an EKS cluster in AWS or GKE in Google Cloud.
What is DataOps?
Before continuing on the management of superclusters, it is necessary to define the term DataOps, which is quite rare in Europe and more particularly in France. DataOps is an organizational and technological scheme inspired by DevOps. It involves bringing agility, automation and control between the various stakeholders of a data project, namely the ISD (IS operators, developers, architects), the Data center (data product owner, data scientists, engineers, stewards) and trades.
The aim is toindustrialize the analytical processes by making the most of the diversity of the vast technological ecosystem of big data and the multiplicity of skills of each actor. In summary, I believe there are 9 functional pillars within the scope of a DataOps approach:
– DevOps Principles
– CI / CD
– Sprints process data (learning, testing, inference)
|Trust & Share
– Virtualization of Data SourcesData
-Repository and Metadata Management
– API Exposure
Conditional Branch Overhead Workflows
– Experiment Design
– Active and Incremental Learning (Data Science)
– Project isolation
– Entitlement management
– Encryption of communications and data
– Batch mode
– Streaming (continuous)
– Code and template management
– Simplified re-use of Generic Artifacts
– Monitoring Quality and Effectivenesslogging
– Versioning and
Notifications and notifications
|Clusters & Containers
– Self-provisioning of physical servers or VMs
– Intelligent load distribution
These principles apply to each stage of the data life cycle (also known as the data value chain or data pipeline).
The approach DataOps is vast and complex, but it only allows to create a sufficient level of abstraction to orchestrate the analytical treatment in a supercluster, as summarized in the following diagram we will go in the rest of this article:
Step 1 – Building a pipeline by abstraction levels
The DataOps interface (vertical block at the center) has four technology components: CI / CD tooling, shared files and artifacts, data virtualizer, and meta-orchestrator. The first stage of implementation of this interface consists in cutting the analytical process (pipeline) into different independent links roughly corresponding to the life cycle of the data: extraction, cleaning, modeling …
This approach has a double interest:
- First, it allows to take advantage of different programming languages (a fortiori with different frameworks) according to the link considered (we do not use the same tools in ETL as in machine learning for example);
- Secondly, this division makes it possible tooptimize the distribution of loads (which we will see in step 4).
But this division does not stop there: even within a specific activity (a link), it is recommended to split its code according to the different levels of functional abstraction. This amounts to imagining a multi-stage pipeline: the first pipeline consists of business bricks (eg “segment customers”) and each of these bricks uses a sub-pipeline that connects steps a little more basic (ex: “Detect missing values”, “execute a K-Means”, etc.) and so on. Below is an example of pipeline and sub-pipelines. Note that the higher the pipeline, the more abstract it is; each dark cell uses a lower abstraction sub-pipeline.
This division into levels of abstraction can also take place directly in the codes (via functions and classes) rather than in the pipeline. But it is important to keep a certain breakdown in the pipeline directly because it will isolate and orchestrate fragments of algorithms that may or may not benefit from quantum acceleration (see step 4).
Indeed, we must remember that only certain steps of an algorithm can benefit from the contributions of quantum computing (see first article of quantum IA). This is usually the inversion of matrices, search for global extremum, modular calculations (the basis of cryptography), etc. Beyond knowing if quantum can or can not speed up some treatments, this breakdown of codes makes it possible to reduce the cloud bill by limiting the use of quantum VMs to a minimum (because their hourly cost will probably cringe).
Digression – An example of a classical algorithm partially converted to quantum
Within the framework of the DBN (Deep Belief Network), it is possible to isolate the unsupervised pre-training stages of RBM stacks and the “fine-tuning” phase (adjustment end). Indeed, some researchers have been interested in a quantum acceleration of the sampling steps in the case of a convolutive network of deep belief (yes, the name is quite barbaric). The goal is to compare the performance of quantum sampling compared to classical models such as the CD (Contrastive Divergence) algorithm. This study shows that the quantum can boost the pre-workout phase, but not the discrimination phase! Hence the importance of properly decomposing the steps of the algorithms, to avoid unnecessarily soliciting a quantum machine on classical calculations that are long and non-transposable quantially speaking.
Moreover, beyond tariff optimization, splitting codes into abstraction levels is also and above all an essential methodology in writing scripts. An interesting article on this subject shows that the management of abstraction (consisting of distinguishing “what” from “how”) is a good development practice that encompasses many others.
Step 2 – Integration with repository and CI/CD tools
Now that the codes (and other artifacts) of the general pipeline are carefully split by abstraction levels, they should be centralized in the repository. This approach is usually accompanied by standardization of codes. The goal is to be able to reuse them easily in different contexts.
The generic and repeatable character of a code can be obtained by means of a double “variablisation”. The first is an intuitive generalization of the code by creating variables relating to data processing (via classes, methods, functions …). The second is the creation of environment variables, that is to say that the setting of the code is done dynamically according to the environment (in the infrastructure sense) in which it runs. For example, the variable “password” can have multiple values, each of which is linked to a specific cluster.
As for the automation of testing and deployment of these scripts, the DataOps solution can either integrate CI / CD functionality, or connect to existing tools such as Maven, Gradle, SBT, Jenkins / JenkinsX, etc. These come to recover the centralized binaries in the repository to integrate them into the processing pipeline. The codes then become “jobs” that will run in dedicated clusters. The pipeline must finally be able to log the versions of the jobs that compose it to keep track of all previous deliveries and possibly proceed with “rollbacks”.
Step 3 – Data virtualization
The penultimate step is storage abstraction. Indeed, since the purpose is to exploit scattered infrastructures – which already requires a huge programming effort to make the codes generic – it is better not to have to take into account the exact location of the data or the need to replicate them with each treatment.
This is typically the role of a data virtualizer that allows an implicit connection to intrinsically different storage sources and facilitates memory management to avoid futile data replication. In addition, data virtualization solutions provide an undeniable competitive advantage in the implementation of cross-infrastructure data governance. By this I mean the implementation of a transverse and unique repository with metadata management and authorizations.
The data virtualizer intervenes at the time of reading the data (to perform the processing) and also at the end of the chain to write the intermediate or final results in a local or remote database (or cluster).
Step 4 – Advanced load balancing
Now that all the data is made available (via the virtualizer) and the codes are standardized and broken down into coherent functional units, the idea is to orchestrate the associated processing within the supercluster. In other words, we try to execute the algorithmic containers in the appropriate clusters.
Each cluster is governed by a solution that acts as a scheduler / dispatcher of tasks and as a container and physical resource manager (eg Kubernetes). Today, all cloud providers offer Kubernetes-as-a-service offerings in their virtual clusters.
The DataOps solution needs to go a step further and play the role of “meta-orchestrator”. The latter aims to distribute jobs among the orchestrators (Kubernetes) underlying each cluster. The meta-orchestrator is therefore an additional layer of abstraction to Kubernetes. When a quantum acceleration is needed, the meta-orchestrator is responsible for redirecting the algorithmic container to one of the Kubernetes orchestrating the quantum VMs. Thus, the meta-orchestrator ensures that only the relevant jobs are routed to the quantum VMs while the others can run on-premise or on cloud clusters composed of traditional VMs.
In summary, the rise of quantum machines in the cloud will encourage companies to optimize the way they orchestrate their analytic and hybrid environments. The DataOps interface (which can be a unified software solution), in addition to automating the deployment of complex pipelines, oversees the routing of certain (quantum-formally compatible) codes to Kubernetes clusters with quantum VMs. Companies would be able to control their costs (especially the cloud bill) by requesting the right clusters at the right time, especially when it comes to using HPC servers (GPU, TPU, NPU, QPU, etc.). ).
Quantum AI Column
This article is part of a column dedicated to quantum AI. Find all posts of the same theme:
- Part 1 – Ending impotence!
- Part 2 – The die has been cast
- Part 3 – Rise of the AIs