I have a large set of log files that I need to extract data from. Is it possible to use Flume to read these files and dump them into an HDFS (Cassandra, or another data source) which I can then query?
The documentation seems to suggest it's all live event based log processing. I'm wondering if I'm missing some obvious process to just have flume read and process static log files from a directory.
Yes, this is the standard use case for flume.
The server with the log files will run a flume-node and another (or potentially the same) server will run a flume-master. The flume-nodes will discover the flume-master and from the flume-master you can execute commands like:
This creates a configuration that tells flume how to access the file (it can tail or read the entire file, other options are available) and where to put it.
Then it is a matter of pointing the configuration at a particular server:
There is more information in the flume user guide: http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html