MapReduce is a framework for processing massive amounts of data in a parallel or distributed way either on a single machine or on a cluster of multiple nodes. It is the programming model widely used in the Hadoop ecosystem for data processing. MapReduce breaks down a given job into a number of map and reduce tasks, spanning them over a cluster of computers, and has built-in fault tolerance to recover any failed tasks. MapReduce abstracts the underlying complexity inherent in distributed computing and presents the user with a framework to focus on the data problem at hand.
Pig is a programming language developed to make development of data cleansing and processing workflows in the Hadoop ecosystem easier. Unlike MapReduce where every data problem needs to be conformed to fit the Map and Reduce tasks, Pig offers a high level language called Pig Latin. Pig Latin allows for writing data processing applications using general operations like joins, filters, groupings, etc. Pig ultimately translates Pig Latin operation into a series of MapReduce tasks. Pig lets users load data from Hadoop HDFS, perform ad-hoc analysis and store processed data back to HDFS.
Even though Pig is powerful, somewhat intuitive, high level language for defining MapReduce jobs in the Hadoop ecosystem, however it is unfamiliar to many existing users and involves a learning curve. Hive solves this problem by presenting a very familiar SQL-like interface on top of Hadoop HDFS for performing interactive, ad-hoc on data stored in HDFS as well as define intricate data processing workflows without the need for complex ETL jobs. The SQL-like language supported by hive is referred to as HiveQL.
Hive is not a relational database and supports only a subset of ANSI-SQL, it is rather optimized for query performance on a very large dataset by executing the queries across a cluster of multiple nodes, however Hive is not designed for low latency queries where only a subset of data needs to be queried in a very short turnaround time. Thus making it unsuitable for traditional data warehousing applications.
Scalding is a Scala extension of Cascading, a framework on top Hadoop MapReduce to make it more general purpose. Scala is a high-level functional programming language which runs on the Java JVM, however because of Scala’s support for the functional programming paradigm, its type inference, pattern matching and implicit conversion features it doesn’t require all the boilerplate that Java code usually does. Thus it makes it much simpler to write code in Scalding for the Hadoop environment.
Scalding was developed at and is maintained by Twitter
Spark is an easy to use distributed computing framework designed to be fast and the general purpose meant for large-scale data processing.
Spark leverages in-memory storage as well as computations making it fast in terms of processing speeds, for certain types of applications Spark can be 100x faster than Hadoop MapReduce applications. Spark also offers high level APIs in Java, Scala, Python and R making it fast in terms of developing considerably complex applications with very less amount of code.
Spark is also designed to handle a variety workloads ranging from batch, interactive, iterative and streaming jobs with a single framework, making it the most general purpose distributed computing framework out there.
Spark also offers libraries to create machine learning, interactive SQL-like queries, real-time and graph processing applications all using a single unified platform.
Parquet is a self-describing, columnar, compressed, binary, structured data storage format for the Hadoop ecosystem. Parquet is independent of data processing frameworks, data models and programming language making it a very versatile data storage format.
Parquet supports partitioning data vertically as well as horizontally making it not only easier to compress and store structured data but also retrieving only the desired blocks of data very quickly by the means of its supported projection and predicate pushdown mechanisms. Some of the salient features of the parquet file format being interoperability, space efficiency and query performance.
HBase is a distributed, column-oriented NoSQL database that resides on top of Hadoop HDFS. Belonging to the NoSQL class of databases, HBase doesn’t support any of the relational database properties like transactions, ACID properties or SQL. Most of HBase applications need to be written in Java, though it does support a limited amount of read-only queries via Hive.
HBase does offer high throughput read and writes and supports random access queries and record-level updates that HDFS does not offer.
Cassandra is an open source, massively distributed NoSQL database with the ability to span multiple data centers across the globe and yet present the user with a single, unified logical view of its data to the user. Cassandra has a master-less architecture with no single point of failure and is also resilient to the loss of an entire data center.
Cassandra also supports data storage as key-values stores and offers high throughput reads and writes
DataStax is a certified version of the open source version of Apache Cassandra integrated with analytics on Cassandra using MapReduce, Hive, Pig, Mahout, Scoop and Spark. DataStax also comes bundled with search from Apache Solr, automatic cluster management and visual DevOps tools and expert support.
MongoDB is a class of NoSQL databases that uses document oriented data model. MongoDB stores all of its data as JSON files leveraging the structure inherent to all JSON documents. Related documents are organized into collections and each document is stored as a set of key-value pairs and allows distribution of documents across several nodes in a cluster thus supports scalability and supports a rich Java Script like scripting language.
Apache Sqoop is developed for the purpose of moving data between Hadoop and any structured data source like relational data sources, enterprise data warehouses and NoSQL databases.
Flume is a scalable, fault tolerant and highly available service for collecting and moving large amounts of log data with a simple flexible architecture based on streaming flows. It depends on ZooKeeper for its high availability.
Kafka is distributed message queue which follows the consumer-producer architecture, which provides high throughput and persistent messages that can be produced as well as consumed in parallel. The messages can span multiple nodes making it perfectly scalable.
Hadoop MapReduce is specifically meant for batch processing of massive quantities of data, however it is not particularly suited for processing event based transactions in real-time. Apache Storm is an open source distributed real-time stream processing engine. Storm enables processing of event-based streaming data with ease-of-use using in resilient, distributed way and supports a myriad of programming language of choice.
Spark Streaming is a library written in Scala on top of Spark’s core API that enables spark to process streaming data in real-time. Being a library on top of Spark Core it is tightly coupled with Spark’s other libraries including Spark SQL, MLlib etc. This is an extension of Spark core and it inherits all of Spark’s features like scalability, fault-tolerance, and high throughput. Spark streaming can also re-use any of the code from Spark’s batch applications for live data processing purposes in real-time.
Spark streaming provides data ingestion mechanisms from all the popular sources like Kafka, Twitter, ZeroMQ, Kinesis, TCP sockets and also from log files.
Apache Mahout is library of machine learning algorithms for building scalable machine learning applications in Hadoop environment. Mahout provides data science tools to find meaningful patterns from big data and implement predictive analytics at scale.
Spark’s machine learning library consists of data types, common machine learning algorithms and utilities to implement machine learning on top of Spark at scale. It consists of all the common supervised and unsupervised learning algorithms like classification, regression, clustering, collaborative filtering, dimensionality reduction etc. Latest version of Spark also comes bundled with robust feature engineering algorithms and feature scaling algorithms.
Like any other Spark library, MLlib also tightly integrates and with other Spark Libraries like Spark SQL or streaming. MLlib also provides utilities for storing predictive models as parquet files and also has built-in support for Predictive Model Markup Language (PMML) so models can be exported to external tools.
Impala is an open source analytic database for Hadoop created by Cloudera to mitigate the latency issues of Hive. Impala completely bypasses MapReduce to query the HDFS directly with support for traditional SQL like syntax and built-in data level security. It is ideal for data warehousing related applications.
Spark SQL is an extension of Spark core that provides with a SQL like interface for writing data processing applications using SQL like language. Spark SQL also supports Hive and HiveQL, but unlike Hive, Spark SQL can support processing of massive amounts of data spread on a cluster as well as low latency queries working only on a subset of data.
Spark SQL can also be intermixed with Java, Python or Scala APIs and starting with the latest versions Spark SQL also presents with a DataFrame API which is similar to R or Python’s data frames. Spark SQL also comes with an ODBC driver and exposes Spark’s fast data processing engine to 3rd party apps and BI tools.
Elastic MapReduce is a web service offering from Amazon that makes it easy to quickly establish a Hadoop environment in the cloud in cost-effective way. It makes it possible to setup Hadoop across AWS EC2 instances and dynamically allocate more or less resources depending on the workload thus minimizing costs by optimizing resource usage.
Databricks is a company established by the same people who created Apache Spark and Databricks cloud (DBC) is their offering of fully managed version of Apache Spark in the cloud. Databricks cloud make the process of launching a Spark cluster really easy and provides notebooks in Scala, Python and R (similar to iPython notebooks) that make the data science process from data ingestion, to data wrangling, predictive modeling and analytics seamless.
Databricks cloud also offers management tools in the cloud along with support for production pipelines and support for 3rd party app connectivity to the cloud.
Lucene is an open source search library written in Java by Dough Cutting, the creator of Hadoop. Lucene makes it easy to add search functionality to any website. Lucene lets any rich text file be indexed as long as its text can be extracted. Later on many other search servers emerged based on the Lucene core.
Solr is a fast, open source enterprise search platform written in Java and based on Apache Lucene. Solr is scalable, reliable and fault-tolerant supporting full-text search, real-time indexing, database and NoSQL integration and has REST, HTTP/XML and JSON APIs. Solr has now been merged into the Lucene project itself.
ElasticSearch is an open source, distributed, multi-tenant search server based on Apache Lucene platform developed in Java. It is capable of full-text search and offers fast response times based on Lucene’s inverted index concept, supports a RESTful API and schema-less JSON documents.
DataStax is certified version of Apache Cassandra which delivers the ability to quickly search for any data in stored in underlying Cassandra database through its enterprise search capabilities using Apache Solr.