In previous posts I showed how to setup VMware Big Data extension, and integrate it with Isilon for HDFS. This blog will show you how to use BDE to deploy Cloudera.
One of the benefits of VMware Big Data Extension is the ability to configure, deploy and run multiple Hadoop distributions from different vendors. When you deploy the Big Data Extensions vApp, the Apache 1.2.1 Hadoop distribution is included in the OVA that you download and deployed. You can add and configure other Hadoop distributions, like Cloudera’ s distribution version 4 (CDH4), using Yellowdog updater (YUM). YUM is an open-source command-line package-management utility for Linux operating systems that allows automatic updates, package and dependency management, on RPM-based distributions like CentOS. Cloudera and PivotalHD distributions require the setup of a YUM repository on the Serengeti vApp management server to host the RPM’s for the hadoop distribution.
This blog will show you how to setup Cloudera CDH4 for automated deployments by BDE. To setup CDH4 you must first setup a YUM repo. The repo holds the RPM’s that are required to install CDH4. These RPMS, and the associated gpgkey can be found here:
http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/4.2.2
http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera
VMware Big Data Extension supports CDH4 version 2.x on RedHat and derivatives version 5
After downloading the RPM’s and creating a repo, a configuration script is used to configure the BDE automation. VMware Big Data Extension uses a Ruby script called config-distro.rb located in the /opt/serengeti/sbin directory on the Serengeti vApp management. This script sets up the chef manifests that are used to automate Hadoop cluster deployments. We run this utility and give it the correct distro information for the different packages we want to deploy.
When the Serengeti vApp is deployed, along with the management server VM is a template VM. This VM is a Centos 5 distribution and is used to deploy all the nodes that make up a Hadoop cluster. The management VM uses chef to deploy the packages to the template and configure it accordingly.
Log in to management server using either putty or the VMware console
|
|
Install the createrepo utility. Type the following at the command line: yum install –y yum-utils createrepo |
|
Change directory cd /etc/yum.repos.d/ Create repo file touch cloudera-cdh4.repo Edit file vim cloudera-cdh4.repo
|
|
Enter the following : [cloudera-cdh4] name=Cloudera's Distribution for Hadoop, Version 4 baseurl=http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/4.2.2 gpgkey =http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera gpgcheck = 1 Write information to file |
|
Create a local repo by synching the repo file with Cloudera source code: reposync –r cloudera-cdh4 This command will retrieve the RPM’s from the internet. |
|
List directory show newly created cloudera-cdh4 directory with RPM’s: ls |
|
Create a directory for the RPMs’ mkdir –p /opt/Serengeti/www/cdh/4 Move downloaded RPM’s to this directory mv cloudera-cdh4/RPMS/ /opt/serengeti/www/cdh/4/ |
|
Change directory and look at contents: cd /opt/Serengeti/www/cdh/4 ls
|
|
Create a Cloudera repo that is published by the apache service: createrepo . |
|
A ls shows the repodata directory has been created |
|
Create a repo file vim cloudera-cdh4.repo |
|
Enter the following into the file [cloudera-cdh4] name=Cloudera's Distribution for Hadoop, Version 4 baseurl=http://10.10.81.36/cdh/4/ enabled=1 gpgcheck=0 NOTE: baseurl should be the IP address of the management server. An ifconfig from the command line will give you this address. Save file |
|
Open a browser and enter the url: https://10.10.81.36/cdh/4/cloudera-cdh4.repo You should see the contents of the repo file created in the last step |
|
Use the config-distro.rb command to create the correct setting for the Chef manifest config-distro.rb --name Cloudera --vendor cdh --version 4.2.2 --repos http://10.10.81.36/cdh/4/cloudera-cdh4.repo |
|
Change directory and run the cat command on the manifest file to check contents: cd /opt/serengeti/www/distros cat manifest The file should have the same settings as the screen shot on the left. The file will also contain setting for Apache, and any other distributions you have setup. |
|
Change directory and edit the map file: cd /opt/Serengeti/www/specs vim map
|
|
Scroll through the file till you find the “CDH” section. Verify that the version number is the same that you downloaded and set the repo up with. Close file without saving |
|
Restart tomcat service service tomcat restart |
|
In the VMware web client, go to the Big Data extensions tab and click on Hadoop Distributions. You should see the Cloudera distribution version 4.2.2 is now ready. This verifies the contents of the manifest file |
|
Click on the Big Data Clusters tab, and select deploy cluster. Under the Hadoop distribution drop down select Cloudera. All deployment types should be available. This verifies the contents of the map file.
To download the EMC Hadoop starter kit that shows configuring BDE with Isilon for HDFS and configuration of Cloudera go to the following link: https://community.emc.com/docs/DOC-26892
|
Thanks for this article! After following the instructions we now have a Hadoop cluster up and running and it only took about 15 minutes to create.
We did run into a few issues, but they were easily fixed.
First, if you use cut and paste to run the commands, the - preceding command line arguments doesn't paste as a -, so we had to watch out for that.
Second, several times the commands use the directory 'Serengeti', with a capital 'S' when the actual directory on the management server has a lower case 's'.
Lastly, in both the /opt/serengeti/www/cdh/4/cloudera-cdh4.repo /opt/serengeti/www/distros/manifest files, we had could not use the IP address of the management node. This caused the cluster creation to fail with a 'hostname does not match the server certificate' SSL error. After adding a DNS entry for it, and updating these two files with the management server's FQDN, everything was created successfully.
Thanks again!
Posted by: Anthony | 04/30/2014 at 05:43 PM
Great article. Thanks.
Has anyone tried this with CDH 5.0 I suspect it doesn't work but would be great to know whether it worth trying if anyone has had any success with deploying CDH 5.0 via BDE.
Posted by: Carl | 05/12/2014 at 05:04 AM
Thansk Carl. I belive that BDE 2.0 will support CDH5. Will have to wait for official announcment from VMware in June.
Posted by: James | 05/14/2014 at 02:17 PM