I am running Hadoop on a project and need a suggestion.
Generally by default Hadoop has a "block size" of around 64mb..
There is also a suggestion to not use many/small files..
I am currently having very very very small files being put into HDFS due to the application design of flume..
The problem is, that Hadoop <= 0.20 cannot append to files, whereby i have too many files for my map-reduce to function efficiently..
There must be a correct way to simply roll/merge roughly 100 files into one..
Therefore Hadoop is effectively reading 1 large file instead of 10
Any Suggestions??
Media6degrees has come up with a fairy good solution to combine small files in Hadoop. You can use their jar straight out. http://www.jointhegrid.com/hadoop_filecrush/index.jsp
Have you considered using Hadoop Archives? Think of them as tar files for HDFS. http://hadoop.apache.org/common/docs/r0.20.2/hadoop_archives.html
What you need to do is write a trivial concatenator program with an identity mapper and one or just a few identity reducers. This program will allow you to concatenate your small files into a few large files to ease the load on Hadoop.
This can be quite a task to schedule and it wastes space, but it is necessary due to the design of HDFS. If HDFS were a first class file system, then this would be much easier to deal with.