So what is VMware Big Data Extension (BDE)? BDE started its life as Serengeti, an open source project started by VMware to automate deployments of Hadoop in a virtual environment. The open source Project Serengeti is available on github:
https://github.com/vmware-serengeti
or you can download the enterprise version from VMware:
http://www.vmware.com/go/download-bigdataextensions
Although BDE is an enterprise supported application within VMware, it is a free download
So what does BDE give you?
First it allows you to deploy Hadoop cluster’s within a virtual environment. Cluster deployment is automated using a Chef engine and includes all the major Apache Hadoop components; HDFS, MapReduce, Pig, Hive, Zookeeper and Yarn. Hadoop is an apache project http://hadoop.apache.org/, and BDE can deploy all the major hadoop distributions including Apache, PivotalHD, Cloudera, Hortonworks and Mapr.
By default BDE come setup to deploy Apache with no additional setup. Other distributions are deployed by either tarball’s or by creating a local YUM repository with the correct RPM’s. Future blogs will show you how to set this up. What is really cool is that by supporting multiple distributions it allows a newbie to the world of Big Data to deploy and test different distributions simply without knowing anything about Hadoop. It can be leveraged by IT to deploy their first big data projects, or more advance Hadoop as a service (HAAS) offerings.
BDE automates deployments using chef (http://www.opscode.com/chef/). Chef is an automation framework designed to easily deploy servers or application to physical, virtual or cloud environments. It is built in ruby and uses a Domain Specific Language (DSL) to write “recipes” or “cookbooks”. BDE is preconfigured with all the recipes, cookbooks, and json files preconfigured for the supported hadoop distributions. The only information the user has to provide is the install RPMS or Tarball files for the hadoop distribution, and the variables of the cluster. These variables are Node size (cpu/disk/mem) and cluster type (Basic, HBase, Data/Compute separation, Compute only). If you’ve ever wanted to try out automation technologies like Puppet or Chef, BDE gives you a great example of the power of theses cloud based automation programs.
Another great feature is its elasticity. It allows for separation of compute and data without losing locality. One of the things I’ve learned about Hadoop clusters is the inequality between CPU and storage consumption. It is not uncommon to find cluster that run at only a 10-15% cpu utilization rate. By separating these two components you can get better utilization of your environment by scaling up or down either CPU or storage when needed. EMC Isilon has native integration with HDFS and can be leveraged by BDE during cluster deployments for storage. This allows you to scale up the storage appropriately, and gives the add bonus of using NFS or CIF shares to ingest data into Hadoop. Compute nodes are VM’s and BDE has a scale out feature that allows you to add nodes on demand (currently you can’t reduce nodes) or scale Up/Down to add or remove CPU or Memory resources to a hadoop cluster. EMC has published a white paper on integrating BDE and Isilon:
https://community.emc.com/docs/DOC-26892
The word “commodity” is commonly used within the Hadoop world to refer to the hardware that should be used to deploy clusters. However VMware has been involved with the Apache community to develop Hadoop Virtual Extensions (HVE). HVE allows hadoop distributions to be virtualization-aware and enhances the support for failure and data locality in a virtual environment. HVE is a key component utilized by BDE to optimize Hadoop deployments on VMware.
My next series of blog posts will go into the details of how to get BDE up and running utilizing all the major distributions and HDFS integration with Isilon.
Comments