We are using Lustre in a cluster with approximately 200TB of storage, 12 Object Storage Targets (that connect to a DDN storage system using QDR Infiniband), and roughly 160 quad and 8-core compute notes. Most of the users of this system have no problems at all, but my tasks are I/O intensive. When I run an array job that has 250-500 processes that are simultaneously pounding the file system typically between 10 and 20 of my processes will fail. The log files indicate that the load on the OSTs are going over 2 and that the Lustre client is returning either bad data or failed read()
function calls.
Currently the only way we have of resolving my problem is to run fewer simultaneous jobs. This is unsatisfactory, because there is no way to know in advance if my workload will be CPU-heavy or I/O heavy. Besides, just turning down the load isn't the way to run a supercomptuer: we would like it to run slower when running under load, not produce incorrect answers.
I'd like to know how to configure Lustre so that clients block when the load on the OSTs goes too high, rather than having the clients get bad data.
How do I configure Lustre to make the clients block?
Have you thought of adding more OSSs and spreading out the OSTs? That should decrease the load. In that vein, what kind of I/O pattern are you doing? Do you have many large files, if so, are they striped? Default striping is 1, which means each file resides on only 1 OST, that can be changed on a per file (at create) or on a per directory basis (for new files).
You could also try increasing the timeouts in lustre (lctl get_param/set_param) namely: