Briefly: if I have 5 Tb of data and want to deploy this on 5 cassandra servers - does each machine need to have 5 Tb of disk space for data (not counting log space)? From the docs it sounds like at times cassandra will need 2x the data size - so 10Tb / server or 10Tb total in the array?
How much RAM should each machine have? Assume that the 5Tb is all in the same column space. I had been planning to max out the RAM on each machine but I'm not sure that's enough. Do I need an array of servers with a total of 5Tb of RAM?
If you spread evenly your 5 TB of data on you 5 servers, each server will host 1 TB of data. Because of compaction needs, each server will need 2 TB of disk space (in the worst case, a compaction needs twice as much space on disk as you have data), which means 10 TB total in your cluster.
The case above is where you only store a single replica of your data among the cluster. In this case, if a server fail, one fifth of your data will be unreachable. If you want to store 2 replica of your data in the cluster, each node will need 4 TB of disk space, which means 20 TB total in your cluster.
The cassandra suggested disk space per node for performance is 1/2 Terabyte so unless you want to wait for extremely long compactions and very long map/reduce times you should rethink how many machines are necessary.
Supposing that all data is in RAM, that's a huge and expensive machine. Fortunately in most applications, you don't need to keep all your data in memory but only your live / active data.
Cassandra is able to retrieve automatically data from HD to RAM when the entry is missing from memory. Conversely, when the record in not accesses is expelled from memory ("cold" data). You might compare Cassandra as an application level cache, where entries are account rows. Cache hits/misses in this case can be interpreted as records on the cassandra filesystem disk which have to be brought back in memory.
So in terms of sizing:
So talking about RAM: You only need to keep enough data in memory in order to avoid unnecessary re-fetch from disk. It's very application dependent. I would suggest to run some benchmarking to verify how many active sessions you get per day wrt to the total amount of sessions you have stored on the system. This applies well if your system is read dominated and reads exhibit temporal locality.
Check also this thread for further inspiration https://stackoverflow.com/questions/4924978/cache-design-question
This ratio of live data vs total data determines the RAM requirements for your system. Essentially, it's a tradeoff of cassandra misses vs RAM costs. Similar considerations - at a different level - apply to cpu cache design.