Ping a Specific Port

Question

Arenstar

Asked: 2010-11-16 19:03:56 +0800 CST2010-11-16 19:03:56 +0800 CST 2010-11-16 19:03:56 +0800 CST

Hadoop Rolling Small files

772

I am running Hadoop on a project and need a suggestion.

Generally by default Hadoop has a "block size" of around 64mb..
There is also a suggestion to not use many/small files..

I am currently having very very very small files being put into HDFS due to the application design of flume..

The problem is, that Hadoop <= 0.20 cannot append to files, whereby i have too many files for my map-reduce to function efficiently..

There must be a correct way to simply roll/merge roughly 100 files into one..
Therefore Hadoop is effectively reading 1 large file instead of 10

Any Suggestions??

3 Answers

Voted

Aman · Answer 1 · 2010-12-09T16:51:55+08:00

Aman

2010-12-09T16:51:55+08:002010-12-09T16:51:55+08:00

Media6degrees has come up with a fairy good solution to combine small files in Hadoop. You can use their jar straight out. http://www.jointhegrid.com/hadoop_filecrush/index.jsp

1

Charles Wimmer · Answer 2 · 2011-01-06T05:07:24+08:00

Charles Wimmer

2011-01-06T05:07:24+08:002011-01-06T05:07:24+08:00

Have you considered using Hadoop Archives? Think of them as tar files for HDFS. http://hadoop.apache.org/common/docs/r0.20.2/hadoop_archives.html

1

Ted Dunning · Answer 3 · 2010-12-05T12:17:42+08:00

Ted Dunning

2010-12-05T12:17:42+08:002010-12-05T12:17:42+08:00

What you need to do is write a trivial concatenator program with an identity mapper and one or just a few identity reducers. This program will allow you to concatenate your small files into a few large files to ease the load on Hadoop.

This can be quite a task to schedule and it wastes space, but it is necessary due to the design of HDFS. If HDFS were a first class file system, then this would be much easier to deal with.

0

Hadoop Rolling Small files

Ping a Specific Port

How do I tell Git for Windows where to find my private RSA key?

How do you restart php-fpm?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?