My application has a SQL that takes ~ 30 min to run in the Oracle server in production. It takes about the same in a test Oracle server.
For some reason, in another Oracle server, it runs much faster: only 5 min or so!
These timings are really for the SQL only (no application processing overhead). I got them from Oracle Enterprise Manager. Moreover, they are consistent, i.e. if you rerun the SQL again, you get about the same timings.
Hardware, Oracle version (10g), Data are the same in these three servers. Even the SQL execution plans are exactly the same.
What makes the SQL run so much faster in that test environment?
The SQL is a MERGE:
MERGE /*+ USE_NL(DORMANT_POINTS) */
INTO MKT_CURVE_POINT DORMANT_POINTS
USING (SELECT
// big select
) ACTIVE_POINTS
ON (
// ..
)
WHEN MATCHED THEN
UPDATE
SET DORMANT_POINTS.ACTIVE_PARENT_PRICE = ACTIVE_POINTS.ACTIVE_PARENT_PRICE
WHERE DORMANT_POINTS.ACTIVE_PARENT_PRICE <>
ACTIVE_POINTS.ACTIVE_PARENT_PRICE;
I suspect that this is something about caching. I noticed a high number of buffer gets as compared with physical reads in the server that runs the SQL fast. That ratio is lower in the server that runs it slowly.
What can explain this huge performance difference?
Things to consider:
Welcome to the world of performance tuning :-/
"Even the SQL execution plans are exactly the same."
Firstly, how do you know that. In 10g, you should be finding the cursor in v$sql, then using the sql_id in a call to DBMS_XPLAN.DISPLAY_CURSOR. That shows the explain plan actually used, as opposed to an EXPLAIN PLAN statement which is a prediction about what plan might be used.
"I noticed a high number of buffer gets as compared with physical reads in the server that runs the SQL fast"
Daft question, but is the data the same ?
If the 'fast box' is doing less buffer gets than the 'slow box', then it is processing less data. If it is doing more buffer gets than the slow box, it is processing more data.
Don't concentrate on physical reads themselves. If the data and query plans are the same, then the same logical blocks are going to be processed in the same order. If they happen to be in the cache and don't require a physical read, it will run faster, but that's not really something you can control.
In theory, this situation may indicate that there is more memory for cache on the fast box than the slow boxes, but if the hardware is the same, that would imply that your prod and test instances have memory available at the hardware level but the database is configured not to use it, which is pretty unlikely. It is more likely that other stuff running on those boxes has forced different data blocks into the cache.
Concurrency, locking and latches may play a part. I assume the production server is doing something other than wait for this particular query?
Is the memory policy set to auto? Perhaps the production server has allocated it's SGA pool differently.
It's also a question of what data is actually in in the shared pool and buffer cache. The test server may have more of the relevant data in the buffer, not being flushed out by other production queries.
Then there is hardware configuration. Something simple like write back cache could make a huge difference.
Fun as it is, lets not waste all our time on guesses. Trace the query with full timings and see what's really going on :)
Also Oracle Server Seam to adapt to expected loads rather than current load. So if you utilize the servers differently you will get different caching behavior. Also check the execution plans carefully. If there is a slightly different execution on the big select like an skip scan instead of aindex scan it might explain everything.
your explain plan may be different.
do this:
set lines 200 explain plan for
your sql will NOT execute. you will get a message
"explained"
then do this
select * from table (dbms_xplan.display);
Do this in both DBs and compare. this is how oracle accesses the tables. odds are they are different.
make sure your tables are analyzed in both DBs and the indexes AND the data is the same.