Trying to experiment with Hadoop and Streaming using cloudera distribution CDH3 on Ubuntu.
Have valid data in hdfs:// ready for processing.
Wrote little streaming mapper in python.
When I launch a mapper only job using:
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar -file /usr/src/mystuff/mapper.py -mapper /usr/src/mystuff/mapper.py -input /incoming/STBFlow/* -output testOP
hadoop duly decides it will use 66 mappers on the cluster to process the data. The testOP directory is created on HDFS. A job_conf.xml file is created. But the job tracker UI at port 50030 never shows the job moving out of "pending" state and nothing else happens. CPU usage stays at zero. (the job is created though)
If I give it a single file (instead of the entire directory) as input, same result (except Hadoop decides it needs 2 mappers instead of 66).
I also tried using the "dumbo" Python utility and launching jobs using that: same result: permanently pending.
So I am missing something basic: could someone help me out with what I should look for? The cluster is on Amazon EC2. Firewall issues maybe: ports are enabled explicitly, case by case, in the cluster security group.
Sorry, I'm an idiot.
The jobs staying "pending" forever because although the datanode processes in the cluster were running, the taskTracker processes were not.
They failed because the account they were run under did not have write access to a local mapred cache directory due to a setup bug, so they all terminated on startup. So there were no nodes for the job tracker to distribute jobs to.
Once this was corrected, tasks can be run normally.