Youtube as we know, is massive. It has thousand of concurrent users streaming at least 2 megabytes per video. Obviously, that gets to be a lot of traffic... far too much for any one server.
What networking technologies allow pushing 4 billion videos a day?
Scaling on the backend
In a very simple setup, one DNS entry goes to one IP which belongs to one server. Everybody the world over goes to that single machine. With enough traffic, that's just too much to handle long before you get to be YouTube's size. In a simple scenario, we add a load balancer. The job of the load balancer is to redirect traffic to various back-end servers while appearing as one server.
With as much data as YouTube has, it would be too much to expect all servers to be able to serve all videos, so we have another layer of indirection to add: sharding. In a contrived example, one server is responsible for everything that starts with "A", another owns "B", and so on.
Moving the edge closer
Eventually, though, the bandwidth just becomes intense and you're moving a LOT of data into one room. So, now that we're super popular, we move it out of that room. The two technologies that matter here are Content Distribution Networks and Anycasting.
Where I've got this big static files being requested all over the world, I stop pointing direct links to my hosting servers. What I do instead is put up a link to my CDN server. When somebody asks to view a video, they ask my CDN server for it. The CDN is responsible for already having the video, asking for a copy from the hosting server, or redirecting me. That will vary based on the architecture of the network.
How is that CDN helpful? Well, one IP may actually belong to many servers that are in many places all over the world. When your request leaves your computer and goes to your ISP, their router maps the best path (shortest, quickest, least cost... whatever metric) to that IP. Often for a CDN, that will be on or next to your closest Tier 1 network.
So, I requested a video from YouTube. The actual machine it was stored on is at least
iad09s12.v12.lscache8.c.youtube.com
andtc.v19.cache5.c.youtube.com
. Those show up in the source of my webpage I'm looking at and were provided by some form of indexing server. Now, from Maine I found that tc19 server to be in Miama, Florida. From Washington, I found the tc19 server to be in San Jose, California.Several techniques are used for large sites.
www.youtube.com
-> any number of IP addressesLet's look in DNS:
So www.youtube.com could actually go to several IP addresses.
anycasted IP addresses
A single IP could be handled by any number of Autonomous Systems (a Network on the internet) simultaneously. For instance, many of the root DNS servers as well as Google's
8.8.8.8
DNS server are anycasted at many points around the globe. The idea is that if you're in the US, you hit the US network and if you're in the UK, you hit the UK network.media coming from different server
Just because you're on
www.youtube.com
, that does't mean that all the content has to come from the same server. Right on this site, static resources are served fromsstatic.net
instead ofserverfault.com
.For instance, if we watch Kaley Cuoco's Slave Leia PSA we find that the media is served up by
v10.lscache5.c.youtube.com
.multiple internet connections
I assure you, Youtube has more than one internet connection. Notwithstanding all the other techniques, even if Youtube really was a single site and a single server, it could in theory have connections to every single other network to which it was serving video. In the real world that's not possible of course, but consider the idea.
Any or all of these ideas (and more!) can be used to support a Content Delivery Network. Read up on that article if you'd like to know more.
You are wrong to imagine that YouTube (aka Google) has only one server; this inforgraphic might help illustrate the scale of the system that backs that service.
Even if you only have one point of presence you can absolutely have more than one server behind a single name, and even IP, using tools like load balancers and all.
Google, though, have an awful lot of points of presence, and use tools like AnyCast - a technique to publish the same IP at multiple places on the Internet, and have people routed to the closest server pool owning it - to back the infrastructure.
If you want to know more about large scale systems and the technologies these companies use, the best source now is http://highscalability.com
The biggest companies like Google or Akamai, they always have components which they wrote/created by themselves. (for example Akamai developed a webserver for their services)
I'll touch on the network side of things a bit: Google has a Point of Presence (PoP) in 73 unique datacenters around the world (not including their own). They are a member of 69 unique Internet exchanges. Google is in more datacenters and Internet Exchange points than other network listed on peeringdb.
Google's total internet exchange capacity is >1.5Tbps, and that 1.5Tbps is reserved for networks with >100Mbps of traffic with Google, but less than I'd guess around 2-3Gbps. After you have 'sufficient volume', you are moved to private peering (PNI).
In addition to Internet Exchange peering and private peering (with AS15169), YouTube also operates a transit network: AS43515, and another network which I assume is for paid peering/overflow, AS36040. Google also operates Google Global Cache servers, for ISPs to deploy even more locally within their network. (Data from peeringdb, bgp.he.net).
Based on my experience, I believe YouTube uses much more than just IP geolocation or Anycast to chose a location to serve video from.
Google runs a huge global backbone network, they own dark fiber, they have financed submarine cables. The volume of traffic YouTube generates is huge! I'd guess YouTube has a peak traffic volume of >12Tbps. Google represents at least 7% (and probably >10%) of all Inter-domain internet traffic.
So to actually answer your question, from a network perspective, in order to scale like YouTube you have to make a massive investment in your network - from the fiber in the ground to the WDM gear, and the routers. You have to get the content and the network as close as possible to your users. This usually means peering, IXs, and maybe a bit of transit. You have to be able to intelligently tell users where to get the content from to keep traffic as evenly distributed and cheap as possible. And of course, you have to have the massive server infrastructure to store, process, convert, and deliver 4 billion views a day!
If you are curious about the server side, I wrote a blog post which breaks down some of the recently released datacenter images.