Aditya Yadav & Associates

Architecting World Class Engineering Companies

Home
Mobile
Books
Amazon Cloud Computing Wi
Amazon Cloud Computing Wi
Understanding Programming
Deploying HTML5
Cross Platform POX Json J
CloudEase - Enterprise
Pocket Guides
Research Reports
Products
Consulting
CloudEase - Enterprise
User eXperience
Presentations
EAI / SOA / ROA / MOA
CEP
GPU/FPGA
QA
Web2.0 & Enterprise2.0
Internet Scale Websites
Cloud Computing
Peer To Peer
Technology Strategy
Agile
Lean
Labs
Articles & Whitepapers
Blogs
Podcasts
Videocasts
Training
Tools We Recommend
Press Release(s) And News
The Way We Work
Contact Us
CloudEase - Enterprise Hadoop + Hive + Pig + HBase + Mahout + Cascading + Zookeeper et. al.
 
 
<T.B.D.>
 
 
 

Nugget 1: Standalone Hadoop
 
Being able to start hadoop in Local mode allows for quick developer sanity checks before deploying to a Hadoop Cluster or worse to AWS which has high charges for Hadoop.
 
I assume you have downloaded the latest hadoop distribution and unzipped it.
If you are on Windows you would need to install Cygwin to try this. Linux(s) don't need anything additional.
 
Open Cygwin Bash Shell and goto Hadoop home directory.
  1. Create a directory to hold the input files for the Local Hadoop Job> mkdir input
  2. Copy some files into the input directory> cp conf/* input
  3. Run the hadoop example map reduce job that comes along with hadoop distribution> bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

 

The hadoop job runs and shows something like above.
 
The job output is stored in the 'output' directory. Its contents are as follows.
 
 
cat the output file to screen to see its contents.
 
 

Nugget 2: Standalone Cascading
 
Similar to running Hadoop locally being able to build and run a Cascading job locally allows for quick sanity checks. We will run a cascading job to parse Apache Web Server logs.
 
The source code download contains a 'jobs' directory which should be placed at the level of hadoop and cascading directories. Directory names in the build script and commands may need to be tweaked. Run the ant build by typing 'ant' in the job folder (assuming you have ant on the path). It build the cascading job as follows 
 
 
The cascading job jar 'LocalCascadingJob.jar' gets built. Launch it over hadoop locally using (in cygwin bash shell) in the job directory> ../hadoop-0.20.2/bin/hadoop jar LocalCascadingJob.jar apachelogs output
 
Cat'ing contents of all the files in the output directory using> cat output/*