I'm looking for a way to zgrep
hdfs
files
something like:
hadoop fs -zcat hdfs://myfile.gz | grep "hi"
or
hadoop fs -cat hdfs://myfile.gz | zgrep "hi"
it does not really work for me is there anyway to achieve that with command line?
I'm looking for a way to zgrep
hdfs
files
something like:
hadoop fs -zcat hdfs://myfile.gz | grep "hi"
or
hadoop fs -cat hdfs://myfile.gz | zgrep "hi"
it does not really work for me is there anyway to achieve that with command line?
This command-line will automatically find the right decompressor for any simple text file and print the uncompressed data to standard output:
I have used this for .snappy & .gz files. It probably works for .lzo and .bz2 files.
This is an important feature because Hadoop uses a custom file format for Snappy files. This is the only direct way to uncompress a Hadoop-created Snappy file. There is no command-line 'unsnappy' command like there is for the other compressors. I also don't know of any direct command that creates one. I've only created them as Hive table data.
Note:
hadoop fs -text
is single-threaded and runs the decompression on the machine where you run the command.zless/zcat/zgrep are just shell wrappers that make gzip output the decompressed data to stdout. To do what you want, you'll just have to write a wrapper around the hadoop fs commands.
Aside: The reason this probably didn't work for you is that you're missing an additional slash in your hdfs URI.
You wrote:
This attempts to contact the host or cluster called myfile.gz. What you really want is either hdfs:///myfile.gz or (assuming your config files are set up correctly), just myfile.gz, which the hadoop command should prepend with the correct cluster/namenode path defined by fs.defaultFS.
The following works for me.
I usually use hdfs fuse mounts.. so can use almost any regular Unix commands (some of the commands may not work as hdfs is not POSIX-compliant filesystem).
gunzip/zcat
works just fine on hdfs fuse mounts. And faster to type too :) , easier to read if e.g. you want to script that.
To mount hadoop as a "regular" filesystem: http://www.cloudera.com/content/cloudera/en/documentation/cdh4/latest/CDH4-Installation-Guide/cdh4ig_topic_28.html