This is a software design question
I used to work on the following rule for speed
cache memory > memory > disk > network
With each step being 5-10 times the previous step (e.g. cache memory is 10 times faster than main memory).
Now, it seems that gigabit ethernet has latency less than local disk. So, maybe operations to read out of a large remote in-memory DB are faster than local disk reads. This feels like heresy to an old timer like me. (I just spent some time building a local cache on disk to avoid having to do network round trips - hence my question)
Does anybody have any experience / numbers / advice in this area?
And yes I know that the only real way to find out is to build and measure, but I was wondering about the general rule.
edit:
This is the interesting data from the top answer:
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
This is a shock for me; my mental model is that a network round trip is inherently slow. And its not - its 10x faster than a disk 'round trip'.
Jeff attwood posted this v good blog on the topic http://blog.codinghorror.com/the-infinite-space-between-words/
Here are some numbers that you are probably looking for, as quoted by Jeff Dean, a Google Fellow:
It's from his presentation titled Designs, Lessons and Advice from Building Large Distributed Systems and you can get it here:
The talk was given at Large-Scale Distributed Systems and Middleware (LADIS) 2009.
Other Info
It's said that gcc -O4 emails your code to Jeff Dean for a rewrite.
There are a lot of variables when it comes to network vs. disk, but in general, disk is faster.
The SATA 3.0 and SAS buses are 6 Gbps, vs. a networks 1Gbps minus protocol overhead. With RAID-10 15k SAS, the network is going to seem dog slow. In addition, you have disk cache and also the possibility of solid state harddrives, which depending on the scenario, could also increase speed. Random vs. Sequential data access plays a factor, as well as the block size in which data is being transferred. That all depends on the application that is being used to access the disk.
Now, I have not even touched on the fact that whatever you are transporting over the network is going to or coming from disk anyway...so.......again, disk is faster.
Well, that depends on whether the network resource has the data you are requesting readily available (in memory or similar) or if it would just, in turn, read it from a disk.
In any case, throughput may be higher in some cases but I believe latency will be higher.
The experience I have with this is that when you're on a 1Gbit connection and you try to download a file your harddisk is usually the bottleneck. A thing you have to keep in mind though is that you have to set up a connection first, which also takes time. So for sending big chunks of data network might actually be faster than disk.
IMX the disk is still faster. The theoretical transfer rate of the network is high but in practice you don't get close to that.
About two years ago I had hard drive trouble on my laptop and the DMA went out. This made the hard drive dramatically slower, and in particular slower than network. But when I switched to another computer I was back to my original state of HDD faster than Internet.
My experience with gigabit networks is, given the right server, that you can beat local performance in terms of throughput and latency. See Network Tests: Are We Getting Gigabit Performance?
For all practical purposes I would recommend treating network & local storage as equivalent and only use memory caches.
Standard caveat as you mentioned is true in that there are no general rules; and that actually most of the time one should be working with well configured servers and be using metrics to evaluate the best method of data transfer.
If you are using a low end machine with a slow hard drive then it will almost certainly be quicker to use a gigabit network connection to a server with a fast storage array.
Equally if you are working with two machines of near identical hardware then the latency and network overhead would make local storage quicker; it's common sense really.
It depends. If your I/O is primarily random access then its flat throughput is probably not that great compared to the network bandwidth that could be available. However, most network traffic is ultimately generated by processes that involve I/O. If the working set of whatever process is generating the network traffic fits into cache then it won't be constrained by disk bandwidth. If it thrashes the cache then disk will become a bottleneck.
I work on data warehouse systems, and the canonical DW query is a table scan. If your query hits more than a few percent of the rows in the fact table (or partition) then a table or partition scan using sequential I/O will be more efficient than a random access query plan using index lookups and seeks.
Networked storage (i.e. SANs) tends not to perform well on streaming workloads unless it is tuned appropriately. If the SAN is being used for a general purpose consolidation environment it will almost certainly be tuned quite sub-optimally for a streaming, spiky load like a data warehouse. I have seen a vendor white paper suggest that you need about 3x the number of disks to get the same throughput on a SAN that is not tuned for streaming I/O as for one that is.
My experience tallies with that. In fact, I have never deployed a data warehouse onto a consolidation environment where I could not run the same ETL process significantly quicker on my desktop PC. I've also had sales reps from a major vendor of SAN equipment say off the record that a lot of their customers use direct attach storage for the DW system because SANs aren't fast enough.
Networked storage is at least an order of magnitude more expensive per IOPS than direct attach storage for random access workloads and closer to two orders of magnitude more expensive for streaming.
Yes, in general, networks are now get faster than hard-drives, but this may chnage over time.
I think, therefore I am
When an application is running it means the host machine is working, while working over network needs a common protocol, checking for peer availability, channel security... and if the peers use different platforms, it's harder to achieve what you can do on a single machine.
I prefer to look at this in the terms of trade-offs rather than who is the strongest...
You have to describe an exact use case for this comparison. Harddrives have seek time + transfer rate and cache. Networks have latency, transfer rate and protocol overhead...
I think your original cache memory>memory>disk>network still stands true in general though
The disk is connected with the CPU via SCSI, SAS or IDE bus. Which is a internal network running a specific protocol - SCSI or ATAPI. Ethernet is designed to work on longer distances and can be much slower than SAS/SCSI/IDE. So which one is faster, dependes on which technologies are you comparing. If you compare a 20 years old laptop HDD with a 10Gbps in RAM storage, the winner will always be the networking. And when you buy a storage you have to compare it versus price and manageability.