Ping a Specific Port

Question

Van Gale

Asked: 2009-07-15 02:13:54 +0800 CST2009-07-15 02:13:54 +0800 CST 2009-07-15 02:13:54 +0800 CST

What is meant by "streaming data access" in HDFS?

772

According to the HDFS Architecture page HDFS was designed for "streaming data access". I'm not sure what that means exactly, but would guess it means an operation like seek is either disabled or has sub-optimal performance. Would this be correct?

I'm interested in using HDFS for storing audio/video files that need to be streamed to browser clients. Most of the streams will be start to finish, but some could have a high number of seeks.

Maybe there is another file system that could do this better?

3 Answers

Voted

David Buttler · Answer 1 · 2010-05-28T16:13:24+08:00

David Buttler

2010-05-28T16:13:24+08:002010-05-28T16:13:24+08:00

HDFS stores data in large blocks -- like 64 MB. The idea is that you want your data layed out sequentially on your hard drive, reducing the number of seeks your hard drive has to do to read data.

In addition, HDFS is a user-space file system, so there is a single central name node that contains an in-memory directory of where all of the blocks (and their replicas) are stored across the cluster. Files are expected to be large (say 1 GB or more), and are split up into several blocks. In order to read a file, the code asks the name node for a list of blocks and then reads the blocks sequentially.

The data is "streamed" off the hard drive by maintaining the maximum I/O rate that the drive can sustain for these large blocks of data.

3

towo · Answer 2 · 2009-07-15T03:29:28+08:00

Best Answer

towo

2009-07-15T03:29:28+08:002009-07-15T03:29:28+08:00

Streaming just implies that it can offer you a constant bitrate above a certain threshhold when transferring the data, as opposed to having the data come in in bursts or waves.

If HDFS is laid out for streaming, it will probably still support seek, with a bit of overhead it requires to cache the data for a constant stream.

Of course, depending on system and network load, your seeks might take a bit longer.

2

Omer Faruk Celebi · Answer 3 · 2013-01-12T04:21:43+08:00

Omer Faruk Celebi

2013-01-12T04:21:43+08:002013-01-12T04:21:43+08:00

For Streaming Data From the Hadoop: The Definitive Guide, 3rd Edition:

HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.

0

What is meant by "streaming data access" in HDFS?

Ping a Specific Port

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What's the command-line utility in Windows to do a reverse DNS look-up?

How to check if a port is blocked on a Windows machine?

What port should I open to allow remote desktop?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?