In the past years, issues related to data analysis and data exploitation – through BI and Machine Learning – have become very significant in the business world. As a result, many companies have emerged or adapted to meet this new and growing need. However, we sometimes forget to mention another notable consequence of this trend: the appearance or specialization of numerous profiles that enable the success of these data projects, such as the jobs of Data Scientist, Data Engineer, Data Analyst, Chief Data Officer, etc.
This job description is for those who want to see more clearly in this midst of curious naming, but also for those who want to set up a data team for their company. Because yes, the data engineer is probably the first person you will need.
What Does the Data Engineer Do?
Data Engineers are in charge of creating and maintaining the environment that allows almost every other function in the data world to operate. By data engineering, we mean that they are responsible for developing, building, maintaining, and testing architectures, such as databases and processing systems. Basically, they are the gatekeepers of the data production chain and ensure that it runs smoothly, from data extraction to data visualization.
Data Engineer vs. Data Scientist
Even though the two roles are now separated into two distinct missions, the boundary between the Data Engineer’s and Data Scientist’s missions is still sometimes blurred. This is not surprising when you consider that many data scientists and engineers wear both hats or have evolved from one to the other. But this is also a good reason to remember the different tasks that fall to both jobs, and which differentiate them.
In simple terms, the work of the Data Engineer comes before the work of the Data Scientist, and serves to put it into production. The engineer focuses on setting up the data pipeline and ensuring that others can do their job properly by maintaining it. Generally speaking, the Data Engineer is therefore much more concerned with the infrastructure and architecture that generates the data and sorts it so that it is usable. The Data Scientist then uses this data to apply algorithms and detect trends. This is why the position of Data Engineer is essential, just like that of Data Scientist or Data Miner, and it is this interdependence that makes their complementarity. The gap between their missions has also given some people ideas, since the role of the machine learning engineer is also gaining in popularity, and his job is precisely at the crossroads of that of data engineers and data scientists.
Building the foundation to implement your data strategy
You don’t start building a house from the roof. The same applies to data analysis: before you start calculations, you need to be able to access the data.
In a data project, we often consider that the value lies in the algorithm that automatically and rapidly transforms massive data into valuable information: product recommendation, translation, facial recognition…, in fact, the data scientist – in charge of developing this algorithm – seems to be the key player in the project.
This is partially false, for a very simple reason: the added value lies, for the most part, in the data itself. Indeed, machine learning algorithms are often open source, but the data, on the other hand, is carefully guarded; there is a reason for this. Therefore, by being caricatural, if you have precise, relevant and well-documented data, the added value of the data scientist will – logically – be reduced. Quite simply, the data scientist will never be able to do better than what the data allow him to do.
To make a quick analogy, we can consider a Formula 1 team: the driver is the center of attention, yet he will not be able to go beyond the technical limits of his car. It is safe to bet that a driver would be unable to drive 100 meters without the army of technicians preparing his car and the engineers designing it. On the opposite, an engineer who has spent months designing a car, even without driving skills, will surely be able to finish a race, even if he finishes last.
That’s why the data engineer is essential: he or she is the one who creates, maintains and improves the information systems that allow the other members of the data team to do their jobs. Without a data engineer, your data scientist is likely to spend more time handling data than analyzing it.
What are his skills?
The Data Engineer is very focused on the company’s data management infrastructure, so the required skills are predictably focused on the data architecture:
- In-depth knowledge of SQL and other database languages: the Data Engineer must have mastery of database management tools and a good knowledge of RDBMS (SQL, DB2…). A mastery of other querying technologies such as Cassandra or Bigtable, are interesting depending on the technologies used by the company, especially since large companies are often not satisfied with a single querying technology;
- Data storage and ETL tools: mastering the data storage tools (Hadoop) and ETL (Talend, Nifi…) on the market is essential;
- Hadoop-based analysis (Hbase, Hive, etc.): a good understanding of Apache Hadoop-based data analysis is more and more common in the Data Engineer profession, as knowledge of Hbase or Hive is often considered a must-have on middle and senior Data Engineer positions. Junior Data Engineers will be asked for less;
- Mastering the Code: knowledge of one or more programming languages is a definite asset and is even becoming a prerequisite. Familiarity, if not expertise in one of the following languages will be required: Python, C / C ++, Java, Scala, Perl, or other similar languages;
- Machine Learning, Deep Learning, and Artificial Intelligence: Although this is primarily the area of expertise of the data scientist, a certain level of understanding in these areas is obviously an asset in order to be able to work in collaboration with Data Scientists. For this reason, some knowledge of statistical analysis and data modeling can be useful, as well as knowledge in machine learning. All of these extra skills also help to stand out, because being able to “put on both hats” is invaluable for a company;
- Various operating systems: UNIX, Linux and Solaris.
Among the responsibilities of data engineers are the following:
- design and management of databases and/or data lake ;
- gathering from different sources and matching;
- Setting up a pipeline to automate the various stages of data acquisition, from extraction to storage;
- Creation of tools allowing access to the data;
- Managing the scalability of the infrastructure (horizontal and vertical) in a seamless way for other stakeholders.
It is difficult to make a list, even an indicative one, of the skills strictly speaking. Indeed, the diversity of cases would make this list too long. Here is a small explanation to convince yourself: it is recommended that a data engineer should have knowledge of the language used to analyze data, so it may be R, Python or Java (or many others). On the other hand, he or she will also need cloud service skills if the company has chosen AWS, Microsoft Azure or Google Cloud. But also a good knowledge of storage technologies adapted to the data being used: SQL-type database, or an Elastic Search cluster…
If he would use Saagie, this work could be less complex, but we already realize that a job offer requiring a mastery of all these technologies would not really make sense. Finally, beyond a precise list, the data engineer’s best skill is surely to learn how to quickly use an unknown technology (without becoming an expert, of course) in order to be able to integrate a new data source if necessary. Thus, he is able to quickly adapt the necessary work infrastructure to the other members of the data team to meet the business needs.