In my previous post I shared the origins of the Data Lake pilot within the EMC Open Innovations Lab. Based off that criteria we decided we needed to build an new analytics environment that would allow for real time data processing and the ability to compare it to historical data. Add into the mix that data will be ingested in 1 second increments at a large volumes.
The picture above shows Pivotal HAD Architecture that is shown in most presentations and blog posts about Pivotal. There are 4 main products. Pivotal Hadoop, which supplies Yarn, Zookeeper, HDFS (in our pilot this is provided by Isilon), Hbase, Pig, Hive, Manhout, Map Reduce (v1 for legacy, v2 is provided by Yarn), and virtual extensions. Hawq for SQL, which resides on its own server. Hawq segments get installed in our pilot on every node that has a Hadoop Yarn role. Command Center which uses icm to deploy all the software in the architecture and manages the whole environment. GemFireXD for in memory data processing with persistent data located in HDFS. Finally at the bottem of the lake there is Isilon which runs the namenode tracker and data node for HDFS.
Spring gets it own VM where we write adapters to ingest data. We will not be using Sqoop, Data Loader, Flume or Unified storage services.
As we designed the architecture the decision was made to feature functionality over performance for phase 1. During phase 2 of this project that will look at the performance of this analytics environment which includes the use of Data Scientists to optimize the processing of the data. For our servers we decided that virtualization would be our way to go. I’m a big believer in the virtualize everything strategy and with Hadoop I have yet to be shown that there is any downside. Also by virtualizing the environment we can get the added bonus of elasticity. Here is a great wiki from apache on virtualizing hadoop (http://wiki.apache.org/hadoop/Virtual%20Hadoop)
As noted in the wiki, virtualizing HDFS has some downsides. To address this we have decided to use EMC Isilon for HDFS. Isilon supports the HDFS interfaces for the DataNode and NameNode to host data and metadata that creates a shared storage model allowing for muti-tenacy including the use of multiple hadoop clusters and HDFS application access to the same data set. Accessing HDFS, for hadoop or any other application is as simple as pointing the HDFS clients to the DNS name of the Isilon cluster. Using an Isilon shared storage model for Hadoop has many benefits. When using traditional HDFS on DAS the Storage to compute ratio is fixed. Add more compute means adding more storage. By separating the 2 we can elastically scale our compute, using VMware, and our storage with Isilon independently.
For real time analytics we went with GemFireXD. GemFireXD is an in memory data grid that allows to real time analytics with the ability to persist the data to HDFS. Data will be ingested into the grid using adapters created in spring XD. GemFireXD allow data to be kept in memory for a predefined amount of time. When that time is reached it is flushed to HDFS using the DFS of the namenode. We will use a 4 node grid with each VM having 120GB of memory. While in memory the data is also avaliable to be queried by Hawq.
For Hadoop we went with Pivotal Hadoop (PHD). All hadoop distributions are based on the same code set from Apache. Since in memory data processing was a must, the integration between GemFireXD and Hawq was key. Hawq adds the ability to use SQL queries against data that resides in memory (Gem) and persistent (HDFS). The PHD cluster will have 18 nodes running the following roles:
Role |
Server |
PCC |
PCC |
Name/Data Node |
Isilon2 |
Yarn Resource Manager/History Server |
PHDnode1 |
Node Manager |
PHDnode3-16 |
Zookeeper |
PHDnode3,5,7 |
Hive |
PHDnode1 |
Hive Metastore |
PHDNode2 |
Hbase Master |
PHDNode2 |
Region Server |
Phdnode3-16 |
Hawq Master |
Hawq |
Hawq Segment |
PHDnode3-16 |
Hawq allows sql queries to be performed on the data. It is installed on to its own node and has integration into Hadoop and HDFS through segments that are installed on the yarn nodes. Using the Pivotal Framework Extender (PFX) SQL quieries can also be perfomed on data in th eGemFireXD cluster.
So what does this all look like? Here is a viso of the environment.
Note that we are still deciding the best way to ingest data, and haven’t fully decided on the analytic applications to use yet. Ill discuss this more in future posts. Also threads between map jobs and HDFS have to be tuned so that is not set in stone as we will tune this during testing.
And in the virtual environment it looks like this:
The next blogs will show how to install and integrate all of these components.
Comments