We have a few VM's on a Windows 2008 Server (Hyper-V) and are having a problem with routing between them.
The setup is that the Hyper-V server runs RRAS and maps IP's on its NIC to internal IP's (192.168.1.X) that the VMs use. The VMs use the hyper-V server as their gateway for outbound traffic. The reason for this setup is that our ISP assigns IP's by MAC address, so otherwise the VM's couldn't use the external IP's assigned to the server.
The issue is that the VM's cannot talk with each other using their external IP address. For example, if Server A is 4.2.2.1 (external IP)/192.168.1.1 (internal IP) and Server B is 4.2.2.2 (external IP)/192.168.1.2 (internal IP), you cannot ping 4.2.2.2 from 4.2.2.1. You CAN ping 192.168.1.1 from 192.168.1.2. We also have a Server C that is 4.2.3.1 (a different subnet), and that machine has no problem pinging Server A or Server B. So essentially unless the machines are on separate subnets, they can't talk with each other.
The reason we just don't use 192.168.1.X to communicate is that for this particular purpose we are setting up a monitoring server. This monitoring server will use a FQDN (like servera.myservers.net) to try to ping Server B. So we need to know if there is a DNS failure or something.
One weird thing is that if you do a tracert from Server A to Server C, you get a timeout for the first two attempts, and then a connection, but you don't see it going through a gateway.
I believe that the Microsoft NAT implementation suffers from an artifact that many NAT implementations do (older Cisco PIXOS, Linux ipchains-- the precursor to iptables) in that NAT only occurs on traffic arriving on the "public" interface. The Cisco-ism for this behavior is "hairpinning" (I guess because the packet makes a "hairpin turn" and leaves through the interface it entered on).
Here's an analagous problem:
A Customer has a Cisco PIX at the edge of their network doing NAT between a public static IP address and their LAN. They've got an HTTP server on the LAN at 192.168.1.1, and their public IP is 172.18.9.1. A request from a browser running on a PC on the LAN to "http://172.18.9.1" returns "The page could not be displayed" because the PIX NAT implementation does not NAT the traffic arriving on the internal interface bound for 172.18.9.1 to 192.168.1.1.
Here's a Server Fault question that also describes the behaviour I'm talking about (albeit, again, not specifically citing Microsoft's NAT implementation): Unable to connect on natted server from a host computer on the same LAN using public IP address
I believe you're seeing a similar behaviour with Microsoft's NAT implementation, but I don't have hard evidence (i.e. documentation from Microsoft). I don't have the resources to spin up a test machine at hand, and Microsoft doesn't seem to be using the "hairpin" keyword in their documentation to the positive or negative.
(I actually find it rather funny, in that Server Fault question that I reference above, that people consider a lack of "hairpinning" to be "normal". Linux iptables would handle what you're doing w/ no problem. I've always considered NAT implementations that can't handle this "hairpinning" to be inferior.)