I have a strange slowdown problem with a Firebird database. During the daily use of the database the clients experience significant slowdowns while the system still have lots of resources available. Some information about the environment:
- 64 bit Firebird 2.5.2 server running in SuperServer mode
- the database is running on a 64 bit Windows 2008 R2 server OS
- the server OS is running in a VMware 4.1 VM with 4 CPU cores and 16 GB of RAM
- the database size is around 37 GB and the number of concurrent connections to the database is around 150.
While observing the slowdowns:
- the CPU usage on the machine is between 40-60% without higher spikes and the load is nicely distributed among all 4 cores
- the memory usage of the server is around 4-6 GB and the rest of the memory is used as OS cache
- the disc queue lenghts almost never go over 0.3 with around 2-5 ms latency
- there is almost no network activity on the server.
Still, the slowdown seems to be linked to the general load on the server. During the night when no users are connected to the database/no background jobs are running a test query used for reference executes in 4-5 seconds while during the day when all the users are connected to the database executing the same reference query requires 60+ seconds to finish. It should be added though that the slowdown is general in nature, there are no specific queries that are slower while the server is under load, everything gets slower within the specific Firebird database. The server has other databases with a very low number of transactions executed daily and these other databases show no sign of slowdowns. I even created a copy of the live database experiencing slowdowns and executed the same query against both the originaly and the duplicate database - the original did execute the query slow and hte duplicate fast. The only difference between the original and the duplicate I know is the number of connected users/concurrent transactions.
As I found no evident reasons of all these in the available OS resources so I tried to fetch statistics from Firebird.
The observations:
- at peak times the databse has 30-40 transactions running in parallel according to mon$statements (where mon$state == 1 which according to the archives means the transactions are running or are waiting for a lock)
- fb_lock_print displays the following about the database:
LOCK_HEADER BLOCK
Version: 145, Active owner: 0, Length: 2097152, Used: 1335440 Flags: 0x0001 Enqs: 9993237, Converts: 93191, Rejects: 1417230, Blocks: 2 Deadlock scans: 0, Deadlocks: 0, Scan interval: 10 Acquires: 19972846, Acquire blocks: 0, Spin count: 0 Mutex wait: 0.0% Hash slots: 1009, Hash lengths (min/avg/max): 0/ 2/ 7 Remove node: 0, Insert queue: 0, Insert prior: 0 Owners (38): forward: 20824, backward: 872088 Free owners (126): forward: 973360, backward: 728016 Free locks (370): forward: 852200, backward: 195936 Free requests (12425): forward: 614608, backward: 1230536 Lock Ordering: Enabled
Here I noted that the "rejects" field accounts for ~14% of "enqs" field but unfortunately I don't know the exact meaning of these values. I guess about 14% of the lock requests are rejected for some reason but I might be completely wrong.
So the questions:
- How should the output of fb_lock_print interpreted in this case? Are these numbers "wrong" in some sense? Can they be improved by some parameter tuning?
- What additional steps should be taken to pinpoint what causes the slowdowns?