Our (extended) Big Data / AI recap!
The Big Data ecosystem is still on the rise ! Data is now in our companies as well as our everyday life and coffee machine conversations. Since Matt Turck’s article “Great power, great responsability” published in June, we thought now seems a good time to deliver our extended 2018 (we added the first 2019 quarter) Big Data / AI recap.
The Big Data / AI market
Big Data was a 141 billion dollars market in 2016. A huge figure you think? Well, not that big when you think about the fact that 3 out of 4 companies plan to initiate a Big Data / AI project by the end of 2020. As a result, the market is estimated to be worth more than 200 billion dollars in the next two years.
The Big Data ecosystem is more diversified than ever with a huge number of startups – more than 1500 from 70 countries. On the other side, more and more firms – whatever their size – launch Big Data / AI initiatives that are very promising.
And such projects require storage. Cloud providers continue their momentum as people are more and more choosing hybrid and cloud providers over datacenters.
The cloud market is still very competitive. Amazon and AWS keep the lead but Microsoft’s Azure is getting closer. As shown on the Synergy Research Group’s graphic below, the gap between the market giants and the rest tends to grow.
The cloud providers leaders (AWS, Azure, Google Cloud Platform, IBM) keep competing by offering various Big Data, Data engineering and Machine Learning tools through their platforms (Amazon Neptune, Google AutoML, etc.) with aggressive prices to enroll developers.
As they keep expanding their scale and and the performance of their tools, the Big Data landscape is more and more impacted and startups are having a hard time competing.
As mentioned in Matt Turck’s panorama, the market is getting concentrated market, which was inevitable giving how fragmented it is. There will surely be a slow restructuration with tech giants acquiring startups or others big firms as it has already began :
Smaller but important acquisitions took place within the market as:
- Qlik with Attunity, Podium Data or CrunchBot
- Tableau with Emperical systems
- DataRobot with Nexosis and Cursor
And there are more if you look at the external acquisitions (Big Data / AI companies purchased by actors from different sector). Salesforce acquiring Mulesoft, Roched doing the same thing with Flatiron Health or Appnexus bought by AT&T. Interesting : before being part of IBM, Red Hat had acquired CoreOs. Apple keeps adding startups with Lattice Data, SAP too with Gigya.
The fact that startups have trouble competing is not entirely true. Many actually grow, especially when they do real time processing, data governance or Data Fabrics. AI also brought attention to Databases, DevOps tools and platforms that help deploying projects.
The need for operationalization
Only 5 to 15% of Big Data / AI projects are deployed. That is what comes out of the 2018 Gartner’s “CIO Survey”, 2017 BCG Henderson Institute “Putting Artificial Intelligence to work” or 2016 CapGemini & Informatica Survey “The Big Data Payoff: Turning Big Data into Business Value”. The need, as we watched it grow over the last few years, is now about industrialisation.
Machine Learning Engineer : the new kid on the block
That is why a new profile has emerged : the Machine Learning Engineer. He is the missing link between Data Engineering and Data Science. His job, to put it simply, consists in helpling deploy algorithms.
His profile is a combination of Data Scientists and Data Engineers skills. He has a Data Engineer background or very much alike, but completed with mathematics or AI courses to manage Data Science and its infrastructure. His Data Science knowledge allows him to best optimize algorithms to implement them and his programming skills to actually do it.
He understands engineering methods and follows the agile approach. As he is the one who supervises the production launch, he also needs to master his environment to make sure it is safe.
It was DevOps, now it’s DataOps
DevOps got a lot of hype in the last few years, but it is linked to computing. As new considerations emerged with data processing, the approach got “replicated” and called DataOps.
To define it best, we could say it is a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization.
IT teams (IT Ops managers, IT Architect & Coders), Analytics teams (Data Engineers and Data Scientists) and Business (Data Analyst and Data Steward) need to work closely together, now more than ever. To make it easier, collaboration and iterations are key. The goal of DataOps is to deliver value faster by creating predictable delivery and change management of data and data models.
The Deep Learning hype
Artificial Intelligence is now a staple. Machine Learning got a lot of attention in 2017, but 2018 brought along a lot of Deep Learning blogs and articles which showed a growing interest.
As you can see, Deep Learning is a subset of Machine Learning, which makes them both subsets of AI. Now that you know that (you probably already knew it but play along), you may guess that they are pretty similar.
Deep Learning also aims to have a machine learn a task so it can reproduce it on its own, but actually goes even beyond that. Apart from the fact that you may not need feature engineering, Deep Learning implies more differences. To put it simply, let’s first explain both.
Take Machine Learning:
It consists in creating an algorithm which can train itself using structured data – without any human intervention – in order to produce desired output (many involving prediction by identifying patterns).
Now Deep Learning:
Imagine a similar algorithm, but using numerous layers. Each layer provides a different interpretation to the data (structured or not) it is using – but will require larger quantities of data. These algorithms networks are called artificial neural networks. The way they operate is an imitation of how our brains work (at least it is inspired of it).
It explains why Machine Learning is essentially used to find correlations or differences in datasets to detect trends, anomalies or even predict ; when Deep Learning is more useful to text or image recognition.
The growing interest for Docker
The main point of using Docker is the ease of deployment, which is one of the main need of the market.
Docker embeds applications using virtual containers (hence the name!) so it can run on – virtual or not – servers. It helps both infrastructure management and – when it comes to data projects – industrialisation. Why ? Because moving a container from a test environment to a deployment environment is easy. That’s how Docker was able to differ from VMs and, in the end, succeed.
Back in March, while Docker was celebrating their 5th anniversary, they also announced that 37 billion of their container images had been downloaded. If you go check out their website today, it is now up to… 80 billion! That’s why we can say the interest for Docker is surely growing.
The massive use of Kubernetes
How does one talk about Docker without mentioning Kubernetes. Because they do not only share complementarity, but also impressive growth.
As a reminder, Kubernetes is an open source technology developed by Google in 2015. It is used to scale multi containers applications deployment. It allows you to manage and organize the way you run applications via servers clusters.
Even if Kubernetes runs with any containers system, the fact that it works with Docker is no surprise, and can explain why they both had success. Since 2017 and Docker announcing Kubernetes being part of their offer despite the fact that they have an orchestrator of their own (Swarm), figures have kept growing.
According to Datadog, 85% of the firms who chose to manage containers on Google Cloud Platform use Kubernetes. And numbers also increase when it comes to Azure or AWS, which confirms the K8s’ hype.
Time-consuming and complex to integrate, Hadoop is no longer the success we all remember. Firms that invested massively in Hadoop-powered Data Lakes are now facing serious challenges to deliver business value or to transform data-centric organizations.
The main reason for the lack of success is that the Hadoop Distributed File System (HDFS) is only a building block of a fast-moving and complex eco-system of literally tens of different technologies, comprising storage, scheduling, real-time processing, query, Deep Learning compute and Analytics. These are cutting edge technologies, tend to be open source driven and change dramatically fast.
The merger between Cloudera and Hortonworks only confirmed Hadoop’s decline. Now, Cloud based SQL databases or easier ones like Amazon S3 are set to take the lead.
AI’s dark side revealed
Data technologies (Big Data, Data Science, Machine Learning, Deep Learning) keep on their innovative path and are more and more used in businesses. Articles, white papers and infographics are published every day and prove that firms understand the stakes of being data driven and are ready to take the step forward.
But on the other side, AI and mostly data use have shown their downsides and how dangerous they could be if placed in the wrong hands.
So why does everyone seem to suddenly care and carefully reads all about AI being a threat to humanity or their privacy being taken away? Because they now have reasons to. A few reminders of what happened last year :
- The access of 87 millions of Facebook users personal data being given to Cambridge Analytica (and palantir?) to be politically used.
- The chinese Citizen Score: when China decides to grade its citizens and makes the show Black Mirror so much more realistic
- Google and its controversial partnership with the US military in the mass surveillance project called Maven.
We all knew about the two sides, but we now may have to prepare for other scandals. That’s why this sentence now makes perfect sense: “with great power comes great responsibility”.
The oaths: the ethical answer
Facing scandals, tech personalities took the oath initiative (so did we). The first to take it were Microsoft Brad Smith and Harry Shum who took part of a book about the sector future and the ethical aspects of technologies that need to be promoted. Then, from Allen Institute’s CEO Oren Etzioni to Sundar Pichai (Google), everyone calls for calm and try be be reassuring about personal data and reminds that AI is only a useful tool if it serves humanity.
In addition to Silicon Valley leaders, other initiatives exist. Many aim to have data professionals take an oath, the same way doctors do. If you’re interested in taking responsibility, just search for AI oath and find the right one for you. MIT and Harvard also understood the necessity to appease and set up a joined course called “Ethics and Governance of AI” so students learn how to master technologies but also the consequences of their actions.
GDPR: the legal answer
Oaths represent an important first step but the field needs to be regulated. Europe set an example with GDPR. The General Data Protection Regulation (GDPR) is a set of rules adopted by the European Parliament in April 2016 to replace an outdated directive that was more than 20 years old.
As long as EU citizens personal data are stored or processed, the company needs to comply with the GDPR. Same if the company is located within the EU. The directive is now effective (since May 25th, 2018) and companies who are not in accordance with it yet now have a little more than a year to get it done.
It is only a start but others now follow the European move. In the US, both the states of California and Oregon have passed regulations in the last few months to protect their citizens’ personal data.
Between the growing need for industrialisation, the decline of some technologies and the increasing interest for others, the last months big on data (pun intended). After showing off business opportunities, data processing is now getting people wary. Not everyone agrees on the moral aspects, scary and concerning for Elon Musk, Bill Gates or Stephen Hawking – full of possibilities for the optimistic Mark Zuckerberg and Sundar Pinchai. Ethics and regulations are essential but can not be a barrier to innovation. One thing that is now sure, companies dealing with data who offer a secured personal data management will eventually prevail.