Why are data scientists not developers (and vice versa)?
However, data scientists sometimes have a hard time explaining what their job consists of, trying to explain that their typical day of work is made of mathematics and code.
At some point in the conversation, one may often conclude “well, so you’re some kind of developer, right?”. If for you, reader, this shortcut seems inaccurate, then you can probably stop your coffee break here. Otherwise, take some minutes to read this post, it could be especially useful if you have (or intend to have) data scientists in your team or company.
Data scientists and developers have different work approaches
When you hire developers, you usually know what kind of software you need. Even if the way is unclear, you have a good understanding of your final goal.
The role of the developers will then be to create the path from your starting point to this final goal:
- defining technical specifications
- choosing technologies that best suit the project, the team skills and the company infrastructure
- splitting the project into modules (services) and actually implementing it.
To make it short, you have a point A, a point B and you need someone to draw you the map. So, your development team is telling you everything is under control – just like google maps does when you friend told you last time “don’t worry, I know the route” while driving to the airport… (Dear developer friends, I hope you will agree with this comparison).
On the other hand, the data scientist usually doesn’t have a precise goal in sight. He is provided with data (structured and labelled in the best case) and a direction to take.
There is no point B, but a compass that shows the direction. In the best case, the data scientist will be able to produce useful insights for the business and create value out of these data.
2.Data scientists and developers produce different workflows.
Enough with analogies, this difference has an obvious clear impact on working methods.
The workflow of the developer can be summed up by the branching scheme of a git repository (after all, git is one of the must-have tools in software development today):
Workflow of developer team (copyright: Atlassian)
To explain this scheme briefly:
- the master branch contains the production ready-version (i.e., a version of the software that is operational)
- develop branch contains software with ongoing modifications, which are implemented feature by feature
- each feature has its dedicated (and temporary) branch which is merged back in develop branch when the feature is ready
Finally, when enough features have been implemented, a new version, prepared for deployment can be released on the master branch.
This is a simplified representation of the workflow of most dev team today: a developer implements each feature, and finally, the tech lead is monitoring all the merging process.
Data scientist’s workflow
For data scientists, forget about the master or develop branch, the workflow instead is more “horizontal”: each branch can be potentially a master branch, as soon as interesting results appear.
During the research process, data scientists will take several directions; many will be dead ends, while others will prove interesting results. In both cases, all directions are interesting!
Indeed, the work done in one direction (even a dead-end one) can often be reused in another: the data scientist likes to create patchwork made of several pieces of code.
As a result, the code can look especially “ugly”. Taking time to refactor the code once in a while for a data scientist is essential, in my opinion. Otherwise the code becomes more complex than the problem itself.
How to make data scientists and developers work together?
Data scientists are very good at producing unexpected results that can potentially bring a lot of value to your product or your tools. But between the beautiful matplotlib (popular visualization library in python) sales prediction curve on the local machine of your data scientist and the same result working with daily update and without bugs on the intranet of your company, there is a gap.
It is essential from the beginning to connect data scientists with the software engineer team. Data scientists should have a good understanding of what are the limits of the existing infrastructure and services. There is nothing more frustrating than dedicating months to a project and finally realizing that it cannot be integrated for technical or security reasons.
Despite the fact that git is not designed for data scientist workflow, it remains a very powerful collaborative tool.
When a concrete use case emerges, data scientists can start working with the developer to implement it. At this point, using tools like git is essential, even if git doesn’t suit perfectly data scientist workflow.
Having data scientists able to use these tools will be a huge step forward in good teamwork. Of course, you would ideally recruit a cross-disciplinary profile in your team, able to understand problematics on both sides.
The good news is that data science is growing fast, and a lot of tools are being developed (including Saagie’s solutions) to help developers and data scientists work effectively together. These tools offer an infrastructure for data scientists that help them to manage their project and make integration easier for developers.
As a final thought, it is essential to mention that even inside of data scientist teams, you can have a very different profile, and it can also be the key to success for your projects. Beyond management, a big challenge for you will be to understand which data analysis can really have a positive impact on your business.