Ping a Specific Port

Question

Jas

Asked: 2015-01-23 02:49:56 +0800 CST2015-01-23 02:49:56 +0800 CST 2015-01-23 02:49:56 +0800 CST

Is there a way to grep gzipped content in hdfs without extracting it?

772

I'm looking for a way to zgrep hdfs files

something like:

hadoop fs -zcat hdfs://myfile.gz | grep "hi"

or

hadoop fs -cat hdfs://myfile.gz | zgrep "hi"

it does not really work for me is there anyway to achieve that with command line?

3 Answers

Voted

jackparsons · Answer 1 · 2016-01-03T16:27:52+08:00

jackparsons

2016-01-03T16:27:52+08:002016-01-03T16:27:52+08:00

This command-line will automatically find the right decompressor for any simple text file and print the uncompressed data to standard output:

hadoop fs -text hdfs:///path/to/file [hdfs:///path/to/another/file]

I have used this for .snappy & .gz files. It probably works for .lzo and .bz2 files.

This is an important feature because Hadoop uses a custom file format for Snappy files. This is the only direct way to uncompress a Hadoop-created Snappy file. There is no command-line 'unsnappy' command like there is for the other compressors. I also don't know of any direct command that creates one. I've only created them as Hive table data.

Note: hadoop fs -text is single-threaded and runs the decompression on the machine where you run the command.

6

Travis Campbell · Answer 2 · 2015-02-21T10:32:12+08:00

Best Answer

Travis Campbell

2015-02-21T10:32:12+08:002015-02-21T10:32:12+08:00

zless/zcat/zgrep are just shell wrappers that make gzip output the decompressed data to stdout. To do what you want, you'll just have to write a wrapper around the hadoop fs commands.

Aside: The reason this probably didn't work for you is that you're missing an additional slash in your hdfs URI.

You wrote:

hadoop fs -cat hdfs://myfile.gz | zgrep "hi"

This attempts to contact the host or cluster called myfile.gz. What you really want is either hdfs:///myfile.gz or (assuming your config files are set up correctly), just myfile.gz, which the hadoop command should prepend with the correct cluster/namenode path defined by fs.defaultFS.

The following works for me.

$ hadoop fs -ls hdfs:///user/hcoyote/foo.gz
Found 1 items
-rw-r--r--   3 hcoyote users    5184637 2015-02-20 12:17 hdfs:///user/hcoyote/foo.gz

$ hadoop fs -cat hdfs:///user/hcoyote/foo.gz | gzip -c -d | grep -c Authorization
425893

$ hadoop fs -cat hdfs:///user/hcoyote/foo.gz | zgrep -c Authorization
425893

5

Tagar · Answer 3 · 2015-07-24T10:13:47+08:00

Tagar

2015-07-24T10:13:47+08:002015-07-24T10:13:47+08:00

I usually use hdfs fuse mounts.. so can use almost any regular Unix commands (some of the commands may not work as hdfs is not POSIX-compliant filesystem).

gunzip/zcat

$ gunzip /hdfs_mount/dir1/somefile.gz 
$ grep hi /hdfs_mount/dir1/somefile.gz

works just fine on hdfs fuse mounts. And faster to type too :) , easier to read if e.g. you want to script that.

To mount hadoop as a "regular" filesystem: http://www.cloudera.com/content/cloudera/en/documentation/cdh4/latest/CDH4-Installation-Guide/cdh4ig_topic_28.html

0

Is there a way to grep gzipped content in hdfs without extracting it?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?