Test Saagie in a few clicks with our interactive demo!

logo saagie red
illustration blog open source data project

Put Open Source in your Data Projects

It has become impossible to talk about Data without mentioning open source. Just take a look at the different platforms that offer Big Data solutions, the vast majority of which are open source oriented. For good reason, technologies such as Cassandra, Hadoop, Apache Spark, Talend and many others now offer high quality services for building data projects and have quickly become the most adopted choice by companies. Find out what open source is and why it’s coming.

Definition of Open Source

Let’s start with the basics with a reminder of the definition. It started in the 90’s and has since undergone a lot of evolution. Wikipedia describes it as follows: “open source, or “open source code”, applies to software whose license meets criteria precisely established by the Open Source Initiative, i.e. the possibilities for free redistribution, access to the source code and the creation of derivative works”.

But the very essence of open source goes much further than simply accessing the source code of a software program.

Open Source vs. Proprietary

Unlike proprietary technologies, open source is a model in its very conception. It is about people who have collaborated for one goal, that of building a technology that can be used by everyone. It shows the ability of people to cooperate for a common goal, for which they may not necessarily derive personal benefit. In order to get a clearer picture and make a choice between the open source model and the proprietary technology model, let’s look at their main differences:

Open source versus proprietary comparison

The Main Principles and Culture

So let’s go back to open source by looking at its main principles:

Collaboration: in the context of open or inner source, it has no limits. Developers around the world share their work with a wide audience, not just with their manager or team. Everyone is welcome and decisions are based on merit, not imposed.

Communication: it’s written and open to all. The goal is for everyone to be able to take part in and contribute to the project, with no prior training requirements.

The quality guarantee: this consists in setting up processes to ensure a high level of quality. The code is available to everyone and therefore optimized, tested and verified by hundreds if not thousands of contributors.

What is called "inner source" is only different from the fact that it only applies within the company itself. The notion of openness is not lost, but only extends here to all teams, allowing the company to benefit from the creativity and ideas of its employees without compromising the private aspect of its activity.

The Open Source Technologie

Not only at Saagie do we believe in open source. Many specialists like Kafka creator Jay Kreps, now at Confluent, and Mike Tuchen, former CEO of Talend, see the future of Big Data in open source technologies.

It has almost become the default choice in certain areas such as the “big data stores”, advanced Data Science, and Machine Learning languages and frameworks.

Open source software has always competed against so-called proprietary software which has the disadvantage of “lock in”, i.e. the difficulty to change it if you want to.

Ovum’s white paper highlights this risk and concludes that the most suitable practice for companies would be a kind of hybrid with the right balance between open source and proprietary. In fact, many companies say they are ready to turn to open source, assuming they have the proper support.

Why is it Trending Right Now in Open Source?

It is more than a trend, it is now undeniable, and no longer only affects software development. It opens the perspectives of a society more focused on collaboration and transparency, a subject at the heart of the problems of our current societies. But if all this is all well and good, there are much more concrete interests than those mentioned above.

The Price Criteria

The first – and most obvious – reason is cost, since it is zero. Open source is essentially free. There are no licensing, subscription, or copyright costs associated with the source code. The company can therefore significantly reduce its expenses but also the time spent to release a product.

The less time and cost you spend on your product, the more you save your customers.

The more you save in time and costs, the lower the price of your product will be.

But even though it may seem completely free, it still involves costs related to integration, support and updates. These few constraints are often managed by IT or through annual subscriptions set up by the suppliers of these technologies themselves. Depending on the maturity of your open source project, security or scaling costs may be added.

In spite of these constraints, the cost remains the first advantage of open source. But beyond this issue, there are other reasons that may lead you to turn to it.

A Community-Wide Access

Behind every open source technology there is a community united around one objective: to achieve the most powerful, functional, reliable and secure solution.

Developers in these communities are passionate about what they do and are motivated by the recognition that other members show for their work. The resulting solutions – even if they spend less time developing them – are therefore often of better quality.

For example, the scikit-learn Python library is free and used all over the world to address Machine Learning issues. It offers a large choice of pre-packaged algorithms that are the delight of many Data Science projects. Launched in 2007 as a Google summer internship project by David Cournapeau, the library as we know it today is the result of many hours of work from the community.

The Liberty

One of the differences of open source is also that we can decide to change technologies whenever we want.

Another advantage is also the fact that we can “build” directly on the work of others since we have access to the architecture of the solution in question. We can therefore reuse code and save time, skills and, once again, costs.

Support Reactivity

Thanks to the size of the different communities, it is easy to find a solution to your problem since many tools, plugins or pieces of code are available. There is no longer the need to wait for a release announcement to add or access new features.

Moreover, it is not uncommon to want or need new features and to realize that they already exist.

However, some still complain that there is no number to call in case of need or problem. Even though open source technologies benefit from extensive access to documentation, wikis, forums and active communities, it is still possible to opt for paid support. This type of support can be more personalized and will therefore be able to respond more quickly to specific needs.

It is now time to take a look at the open source programming languages Python and R which are widely used in Data Science and to look at their major differences.

Which Programming Language to Use When Launching a Data Science Project?

Python and R are free languages whose code is entirely community-based.

Both languages are widely used among data scientists, developers and data miners. Python was voted language of the year in the Tiobe index (an indicator of popularity of programming languages) and R is ranked 12th.

R

This language focuses on data analysis, statistics and graphical models. It presents itself as user-friendly. It is widely used by academics and researchers, but also increasingly by statisticians.

“The closer you are to statistics, research and Data Science, the more you will go to R.” Theuwissen

Python

Python helps to be productive and have readable code. It is more appreciated by engineers who have more deployment constraints. Its adoption by developers is very important.

“Those who work in an engineering-like environment will tend to prefer Python.”

At Saagie, we have opted for the choice of open source. Our solution has been designed to integrate the widest possible range of open source technologies (R and Python we mentioned, but also Spark or Scala – and many others!), always up to date in order to easily develop and deploy applications for business use.