According to the HDFS Architecture page HDFS was designed for "streaming data access". I'm not sure what that means exactly, but would guess it means an operation like seek is either disabled or has sub-optimal performance. Would this be correct?
I'm interested in using HDFS for storing audio/video files that need to be streamed to browser clients. Most of the streams will be start to finish, but some could have a high number of seeks.
Maybe there is another file system that could do this better?
HDFS stores data in large blocks -- like 64 MB. The idea is that you want your data layed out sequentially on your hard drive, reducing the number of seeks your hard drive has to do to read data.
In addition, HDFS is a user-space file system, so there is a single central name node that contains an in-memory directory of where all of the blocks (and their replicas) are stored across the cluster. Files are expected to be large (say 1 GB or more), and are split up into several blocks. In order to read a file, the code asks the name node for a list of blocks and then reads the blocks sequentially.
The data is "streamed" off the hard drive by maintaining the maximum I/O rate that the drive can sustain for these large blocks of data.
Streaming just implies that it can offer you a constant bitrate above a certain threshhold when transferring the data, as opposed to having the data come in in bursts or waves.
If HDFS is laid out for streaming, it will probably still support seek, with a bit of overhead it requires to cache the data for a constant stream.
Of course, depending on system and network load, your seeks might take a bit longer.
For Streaming Data From the Hadoop: The Definitive Guide, 3rd Edition: