I am configuring database mirroring over a WAN link between two SQL 2008 named instances who's host servers are not domain members, using certificates for authentication. After many attempts to get this working by myself I have started from scratch and went step-by-step according to BOL http://technet.microsoft.com/en-us/library/ms191140.aspx, however the issue I was trying to resolve is still present.
The issue is on the set of final steps which sets the partner status on each server, when I perform step #2 to set the partner status on "HOST_A", I get the following error:
Msg 1418, Level 16, State 1, Line 2
The server network address "TCP://server-b.our-domain.com:5022" can not be reached or does not exist. Check the network address name and that the ports for the local and remote endpoints are operational.
The interesting thing however is that I can see traffic on the firewall (TCPDUMP) going back and forth between the two servers for about 15 seconds before that error gets spit back at me.
At this point I am not sure how to proceed because I can connect to the SERVER-A\BLUE instance from SSMS on SERVER-B and I can connect to the SERVER-B\RED instance from SSMS on SERVER-A without a problem. I am very confused why I am getting the error at this point in time. The endpoints on both sides are listed as started in sys.tcp_endpoints and sys.endpoints.
Another interesting note is that before attempting step 2, I can telnet from SERVER-A to SERVER-B over 5022 and from SERVER-B to SERVER-A over 5022, however after step 2 fails, I can no longer telnet either direction. TCPDUMP will show traffic going from either to the other, but there is no return traffic after step 2 fails.
The main issue for me is that this error seems to have the wrong description for whatever is actually happening since clearly the network address exists and can be reached and the endpoints are operational as well (at least until the operation fails [Rolleyes] ) I have also tried doing the config in the opposite direction (doing a full backup/restore with no recovery etc. going the other direction) and it fails the exact same way providing the same errors, but again with all of the traffic showing on the firewall.
Lastly, in SQL logs I also get the error "Error: 1443, Severity: 16, State: 2." Which seems to be directly related, and some of what I have found online suggests an issue with windows authentication, however that should not be the case since my endpoints are configured with certificates.
Any help with this would be greatly appreciated.
Here is the actual T-SQL used for setting this up, which follows what is in the BOL article.
--ON SERVER-A\BLUE
use master
go
create master key encryption by password = 'password123!'
go
create certificate CA_cert
With subject = 'CA_cert Certificate'
go
create endpoint Mirroring
STATE = STARTED
AS TCP (
LISTENER_PORT=5022
, LISTENER_IP = ALL
)
FOR DATABASE_MIRRORING (
AUTHENTICATION = CERTIFICATE CA_cert
, ENCRYPTION = REQUIRED ALGORITHM AES
, ROLE = ALL
)
go
BACKUP CERTIFICATE CA_cert TO FILE = 'c:\sql\CA_cert.cer'
go
--ON SERVER-B\RED
use master
go
create master key encryption by password = 'password123!'
go
create certificate NJ_cert
With subject = 'NJ_cert Certificate'
go
create endpoint Mirroring
STATE = STARTED
AS TCP (
LISTENER_PORT=5022
, LISTENER_IP = ALL
)
FOR DATABASE_MIRRORING (
AUTHENTICATION = CERTIFICATE NJ_cert
, ENCRYPTION = REQUIRED ALGORITHM AES
, ROLE = ALL
)
go
BACKUP CERTIFICATE NJ_cert TO FILE = 'c:\sql\NJ_cert.cer'
go
--ON SERVER-A\BLUE
create login NJ_login WITH PASSWORD = 'password123!'
go
CREATE USER NJ_user FOR LOGIN NJ_login
go
CREATE CERTIFICATE NJ_cert
AUTHORIZATION NJ_user
FROM FILE = 'C:\sql\NJ_cert.cer'
go
GRANT CONNECT ON ENDPOINT::Mirroring TO NJ_login
go
--ON SERVER-B\RED
create login CA_login WITH PASSWORD = 'password123!'
go
CREATE USER CA_user FOR LOGIN CA_login
go
CREATE CERTIFICATE CA_cert
AUTHORIZATION CA_user
FROM FILE = 'C:\sql\CA_cert.cer'
go
GRANT CONNECT ON ENDPOINT::Mirroring TO CA_login
go
--ON SERVER-B\RED
alter database testdb
set partner = 'TCP://server-a.our-domain.com:5022'
go
--ON SERVER-A\BLUE
alter database testdb
set partner = 'TCP://server-b.our-domain.com:5022'
go
-- Everything works fine up until this point at which time I get the previously mentioned errors
Attach Profiler to both instances (all three if there is a witness) and monitor the events Audit Database Mirroring Login Event Class and Broker:Connection Event Class.
The error 1418 simply tells that within a specific timeout the mirroirng session was not up and running, for whatever reason. When you issue the ALTER DATABASE ... SET PARTNER = 'tcp://..' on the principal the principal will connect to the mirror and the mirror will connect to the principal in response. Which means that both the principal 'partner' value and the mirror 'partner' value, set previously, come into picture, and they both have to be correct and the the underlying infrastucture (routing, DNS, IPSEC, Firewalls) has to allw connection to the desired address:port from both partners. Throw in the witness if you have one and you got yourself a pretty complex hairball of TCP connectivity that has to be verified.
If the issue is certificate security, then the Audit Database Mirroring Login event will clearly state the cause and problem (certificate not valid, expired, bad certificate used etc etc). If the issue is the udnerlying TCP fabric (routing, DNS, IPSEC, firewall etc) then the Broker:Connection event will actually show the problem.
If you want to understand exactly how does the certificate based authentication works, read on at How does Certificate based Authentication work.