Before the era of big data, there were already two similar roles. We can call them the “ancestors” of the two current professions: the data analyst (who, as the name suggests, analyzes data), and the BI developer. With the development of AI and the emergence of data science, their roles have evolved and have clearly distanced themselves, but their cooperation remains essential.
The sexiest job of the 20th century...
The 2012 sentence is already a bit outdated, but data scientists continue to be one of the most sought-after and “spoiled” profiles in the industry. Nevertheless, data scientists would not be as important without data engineers.
Data scientists build models using mathematical tools, machine learning and business knowledge. They also use programming languages.
But before building the model, the data must be purified and prepared for use. Who is in charge of this? It is the data engineer, who builds an environment adapted to the data flows, and makes it available to the data scientist.
As you can see, data engineers are a vital component of the data science team, and it’s a highly sought-after profile in a data project.
Who is the Data Engineer?
To get to know it a little better, let’s define its role. The data engineer builds the data structures and technological architectures needed for the acquisition, analysis and large-scale implementation of applications that use massive amounts of data.
It must be able to model and build data warehouses and define how data is integrated and transformed so that it is ready for analysis (ETL: Extract, Transform and Load). In this way, it builds the pipeline that is then passed on to the data scientist so that he can put his models into production, with the guarantee of a continuous flow between servers and applications.
He must ensure that the data is easily accessible, that the flow runs smoothly, and ideally that it is optimized taking into account the company’s data ecosystem.
Does a data engineer need to have in-depth knowledge of the models developed by data scientists? The answer is no. On the other hand, even if they are not necessarily experts in machine learning or machine learning, they need to know the basics of these models in order to know what the corresponding architecture to develop is.
As you can see, data engineers are in charge of the big data infrastructure underlying the analyses made by the data scientist.
Regarding the technological tools used by the data engineer, the most important are, among others: Hadoop, MapReduce, Hive, Pig, NoSQL, SQL, DashDB, MySQL, MongoDB and Cassandra.
Data Scientist, a not so well-known job
Data scientists are like the alchemists of the 21st century. They convert raw data into precise ideas. They apply statistics and machine learning with an analytical approach to solve business problems.
In addition, a data scientist must have a mastery of programming, modeling algorithms, and processing large volumes of data. He must also have some experience and knowledge of the industrial and commercial issues of the project. Finally, the data scientist must know how to interpret and show their results in order to be able to popularize their work. To do this, they need good visualization and a vocabulary adapted to a non-technical audience.
The central column of a data scientist’s work revolves around the development of machine learning models. To do this, he must master statistical programming libraries such as Pandas, and know the details of programming languages. Python or R. Jupiter have no secrets for them. They are also able to launch Spark clusters and deep learning developments with Tensorflow.
However, to be able to communicate well with data engineers, they must be able to work with SQL and noSQL databases, and know the basics of Hadoop.
Collaboration is strength
Since data scientists are involved to a greater or lesser extent in every step of building a data science system, they must be able to work with other professions in addition to data engineers.
In a data team, we see more and more specialized professions: Chief Data Officer (in charge of organizing the project with a customer vision), the big data architect (the infrastructure modeling expert), the business analyst (who popularizes the results and transforms them into business knowledge for the customer), etc.
Keep in mind that 85% of data projects do not go into production. This is often a problem in managing different profiles in the team and the work done together. Since data scientists and data engineers are the two pillars of these projects, you have every interest in investing in these profiles.