Data Lake for Dummies
A Data Lake… what is that?
The advent of Big Data and its constant evolutions make the companies more and more demanding in powerful technologies of data analysis. The Data Lake has been created in order to face these needs. It is a computer system able to store in one place all the data present in a company. It tends to gradually replace its ancestor the data warehouse. What are the differences between the two systems? The first one is the nature of the data which each structure is able to ingest and to treat: only structured data for data warehouse whereas the data lake can treat all types to data, which brings additional flexibility. Data warehouse has a fixed and vertical structure i.e. the data are sent towards thematically-specialised datamarts, which finally send to the end user. The lake of data has a more flexible structure and makes the data more malleable.
The advantages of data lake
Thanks to data lake, the user will be able to materialize his need, to extract the data related to this need and to combine them in order to drive them to the right way. The great evolution of data lake is that it makes the data processing more operational with its ability to react to the data in real time. The flexibility made possible by the lake of data as well as operational dimension will make it possible the companies to focus on their value proposition and the innovative solutions they can set up, their cycle of innovation is optimized and accelerated. Data lake is a structure enabling to treat very huge amount of heterogeneous data even if one does not know yet the use of it. The ingested raw data remains virgin and that makes it possible to open the field of possibilities concerning its analysis. Moreover, thanks to data lake, it is possible to combine the internal data of the company with external data such as the weather, the pollution, the traffic, the number of bicycles circulating in Paris, etc… in order to get a powerful tool of prediction of behaviors.
How does a data lake operate?
As explained previously, a data lake considers all the data present into the company, whatever their nature and their source: social networks, CRM, geolocation etc But what could be better than a diagram to explain and democratize the operation of one data lake!
A data lake can be deployed in two different ways: the classical way « one premise », i.e. on a physical datacenter, but also in Cloud. Cloud seems to be the most optimal solution because it enables to adapt the infrastructure according to needs and to reduce infrastructure costs. It also makes it possible to have more choices in terms of components of applications and an interaction in real time.
It is important to note that the concept of data governance naturally comes to be added to a data lake. Indeed, the ingested data is various and combinable natures ad infinitum, that’s why it is important to exploit all their wealth in order to the processes, create and improve the customer experiences. Data governance is a great mix of several competences: technologies, data science, digital marketing, management of project, etc To be able to enrich the data stored in data lakes, it is essential to have a panel of integrated tools which will be able to help to the installation of analytics solutions and business applications.
In conclusion, data lakes tend to adress the following challenges: not only the data storage and processing, but also additional competences like visualization, date-science, data governance, and capacities for treatment in real time.
The uses of a data lake
The early adopters of data lakes were marketing and the media, but today all the sectors (industry, services, medicine, etc) and jobs (dated scientists, developers, etc) are concerned. The business applications of data,lake today are very varied: from sales forecast to the inventory control via predictive maintenance, projects of segmentation and prediction of the behaviors or the adaptation of medical cares.
Most known is Hadoop, an open source framework in Java language which consists of a core of storage, Hadoop Distributed Spins System (HDFS) and of a part of data processing, MapReduce. Hadoop has an infrastructure able to extend ad infinitum and makes it possible to treat more quickly and effectively the whole data thanks to their fractionation within nodes (independent computers within the cluster). Hadoop is today the best global system of storage.
Spark is also an open source famework in Scala language. Contrary to Hadoop, Spark does not work by stage but is able to act on the whole data at the same time. However, it does not have a management system of file and is obliged to pass by a third (HDFS, Cassandra, etc).
Saagie, the heron which overhangs the lake (of data)
Thanks to Saagie, your data are gathered at the same place, they are released, and that makes them easily available. Where the heron flys over the lake to go to pick its food, Saagie will draw from data lake to leave them the precise data that you need in order to exploit them in real time and to withdraw from it all the added-value for your company. Saagie has a ready to use platform, completely flexible according to your needs, which accompanies you from the extraction of your data to the business applications which result from this, via storage and the treatment of the data. Spreadable in Cloud (Saagie Kumo) or on your own data center (Saagie Su), the platform is completely flexible and adaptable to your organization. It adapts itself to all existing tested technologies.