The Saagie Platform provides the best-of-breed data technologies.
All these technologies are preconfigured, tuned and ready-to-use. They're not only big data technologies but also classical data frameworks.
Apache Hadoop Distributed File System
Hadoop Distributed File System is a distributed, scalable and portable file system developed by the Apache Software Foundation.
You can store terabytes of data just by adding thousands of commodity servers. It also manages server failure, as every piece of data is replicated at least three times.
Impala is an open source analytic data query engine that runs on Apache Hadoop.
We love Impala because it's the fastest SQL query engine. You can now process terabytes of data with minimal skills in SQL.
Hive is a data warehouse software facilitating querying and managing large datasets residing in distributed storage.
Hive is the most stable SQL query engine.
Ready for data exploration? Drill allows you to query from heterogeneous data sources using a single SQL request. Query from HDFS, Mongo, Hive or Elastic Search.
Spark is an open source cluster computing framework developed by the UC Berkeley AMPLab, the Apache Software Foundation and Databricks to process huge data volumes.
Spark is the best tool to distribute machine learning over a cluster of servers. The other benefit is that you can cover the whole data pipeline with a single technology. We have been supporting every version of Spark since 1.5.
Sqoop is an online command-line interface application for transferring data between relational databases and Hadoop developed by the Apache Software Foundation.
If you need to import a SQL database from Oracle, SQL Server, MySQL or PostgreSQL simply use Sqoop, and your data will be exported to your datalake.
Talend is a complete set of open source software to extract and integrate data.
Don't forget that big data is just data. If you are a data consultant, Talend will be your best friend to ingest data or calculate data aggregation.
Java and Scala jobs offer the ability to process content in the JVM.
Don't forget that big data is just data. If you are a developer, you can write data ingestion & data aggregation in Java or Scala. Java 7 and 8 are supported.
R is a programming language and a statistic data analysis environment.
Use R to set up your tailor-made algorithms and statistic calculations. If you didn't notice R has been rising up for the last three years and will replace SAS in many companies.
Python is a programming language that lets you work more quickly and integrate your systems more effectively.
Python has been used for years in data science laboratories in universities. Python provides you with the most complete and stable data science libraries.
We provide several versions of the Jupyter notebook bundled with the best of breed of each language ecosystem (Python, R, Scala, Spark, Ruby, Haskell & Julia).
Notebooks allow you to test processing and machine learning algorithms over the real datalake. You can share your notebook files (including charts, maps) with your teammates to get feedbacks.
Docker allows you to deploy dedicated dataviz applications or APIs. You can also deploy specific processing (Fortran, C++, Golang, Rust) or any Docker file as notebooks or special data applications.
Two benefits: firstly we maintain your docker alive so you can focus on your code, and secondly you can test anything that fits into Docker on the Saagie Platform.
MongoDB is a cross-platform document-oriented database.
We love using Mongo DB as a datamart because the schema is flexible and it's easy to use for dataviz app developers.
MySQL Community is the world's most popular open-source database.
Sometimes you just need a plain simple old SQL DB to store your results.