I'm having trouble with freeipa on a few machines. It's been very frustrating to debug so far. Here's the details of the issue;
How it manifests:
The user can login just fine to any host, but on some hosts they can't run sudo commands.
What i know:
There's an IPA sudo policy which is 'allow this user to run any command on any host', and also a HBAC policy of 'allow this user to use any service on any host' so i think i can rule out IPA policy being a problem.
This only seems to affect machines when they contact one particular ipa server (via the dns srv record), according to tcpdump, which i determined by flushing the sss_cache and doing sudo -k. One of the machines in question is actually that ipa server itself, so I've ruled out networking/firewalls being the cause. I'm pretty certain it's limited to just that ipa server, and clients using that particular ipa server.
Focusing on just that ipa server itsself, and comparing it to one of my other ipa servers the sudo.conf, sudoers, sssd.conf are identical (minus debugging being added to the broken one). Both have their LAN ip in /etc/hosts, and both use ntpd (which i think rules out kerberos timing issues). Aside from turning on debugging the sssd.conf and sudo.conf files are untouched from clean install.
The broken ipa server was the first one installed, so it's the master ca etc.
Sudo on the machine in question (i'm focusing on the broken ipa server itsself for simplicity) does work for users which are locally defined in the /etc/sudoers file, /etc/passwd, etc.
Details:
All the machines are using centos 7, and ipa 4.2.0
Logs: (domain name and user sanitized)
=-=-=- from end of sssd logs on server1 =-=-=-
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [set_server_common_status] (0x0100): Marking server 'server1.domain.com' as 'working'
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [be_pam_handler_callback] (0x0100): Backend returned: (0, 0, <NULL>) [Success (Success)]
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [be_pam_handler_callback] (0x0100): Sending result [0][domain.com]
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [be_pam_handler_callback] (0x0100): Sent result [0][domain.com]
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [be_pam_handler] (0x0100): Got request with the following data
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [pam_print_data] (0x0100): command: PAM_ACCT_MGMT
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [pam_print_data] (0x0100): domain: domain.com
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [pam_print_data] (0x0100): user: sirex
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [pam_print_data] (0x0100): service: sudo
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [pam_print_data] (0x0100): tty: /dev/pts/2
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [pam_print_data] (0x0100): ruser: sirex
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [pam_print_data] (0x0100): rhost:
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [pam_print_data] (0x0100): authtok type: 0
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [pam_print_data] (0x0100): newauthtok type: 0
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [pam_print_data] (0x0100): priv: 0
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [pam_print_data] (0x0100): cli_pid: 13960
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [pam_print_data] (0x0100): logon name: not set
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [ipa_hostgroup_info_done] (0x0200): Dereferenced host group: ipa-servers
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [hbac_get_category] (0x0200): Category is set to 'all'.
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [hbac_get_category] (0x0200): Category is set to 'all'.
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [hbac_get_category] (0x0200): Category is set to 'all'.
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [ipa_hbac_evaluate_rules] (0x0080): Access granted by HBAC rule [allow ops to anything]
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [be_pam_handler_callback] (0x0100): Backend returned: (0, 0, <NULL>) [Success (Success)]
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [child_sig_handler] (0x0100): child [13965] finished successfully.
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [be_pam_handler_callback] (0x0100): Backend returned: (0, 0, Success) [Success (Success)]
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [be_pam_handler_callback] (0x0100): Sending result [0][domain.com]
(Tue Jun 28 23:21:33 2016) [sssd[be[domain.com]]] [be_pam_handler_callback] (0x0100): Sent result [0][domain.com]
=-=-=- sudo debugging output on server1 from password prompt to failure =-=-=-
Jun 28 21:59:07 sudo[9738] <- expand_prompt @ ./check.c:398 := [sudo] password for sirex:
Jun 28 21:59:07 sudo[9738] -> verify_user @ ./auth/sudo_auth.c:193
Jun 28 21:59:07 sudo[9738] -> sudo_pam_verify @ ./auth/pam.c:131
Jun 28 21:59:07 sudo[9738] -> converse @ ./auth/pam.c:305
Jun 28 21:59:07 sudo[9738] -> auth_getpass @ ./auth/sudo_auth.c:347
Jun 28 21:59:07 sudo[9738] -> tgetpass @ ./tgetpass.c:76
Jun 28 21:59:07 sudo[9738] -> tty_present @ ./tgetpass.c:329
Jun 28 21:59:07 sudo[9738] <- tty_present @ ./tgetpass.c:333 := true
Jun 28 21:59:07 sudo[9738] -> term_noecho @ ./term.c:88
Jun 28 21:59:07 sudo[9738] <- term_noecho @ ./term.c:99 := 1
Jun 28 21:59:07 sudo[9738] -> getln @ ./tgetpass.c:272
Jun 28 21:59:09 sudo[9738] <- getln @ ./tgetpass.c:315 := **********
Jun 28 21:59:09 sudo[9738] -> term_restore @ ./term.c:73
Jun 28 21:59:09 sudo[9738] <- term_restore @ ./term.c:82 := 1
Jun 28 21:59:09 sudo[9738] <- tgetpass @ ./tgetpass.c:202 := **********
Jun 28 21:59:09 sudo[9738] <- auth_getpass @ ./auth/sudo_auth.c:365 := **********
Jun 28 21:59:09 sudo[9738] <- converse @ ./auth/pam.c:387 := 0
Jun 28 21:59:09 sudo[9738] <- sudo_pam_verify @ ./auth/pam.c:142 := 0
Jun 28 21:59:09 sudo[9738] <- verify_user @ ./auth/sudo_auth.c:282 := 1
Jun 28 21:59:09 sudo[9738] -> sudo_auth_cleanup @ ./auth/sudo_auth.c:160
Jun 28 21:59:09 sudo[9738] -> sudo_pam_cleanup @ ./auth/pam.c:189
Jun 28 21:59:09 sudo[9738] <- sudo_pam_cleanup @ ./auth/pam.c:193 := 0
Jun 28 21:59:09 sudo[9738] <- sudo_auth_cleanup @ ./auth/sudo_auth.c:177 := 0
Jun 28 21:59:09 sudo[9738] -> sudo_pw_delref @ ./pwutil.c:249
Jun 28 21:59:09 sudo[9738] -> sudo_pw_delref_item @ ./pwutil.c:238
Jun 28 21:59:09 sudo[9738] <- sudo_pw_delref_item @ ./pwutil.c:243
Jun 28 21:59:09 sudo[9738] <- sudo_pw_delref @ ./pwutil.c:251
Jun 28 21:59:09 sudo[9738] <- check_user @ ./check.c:189 := true
Jun 28 21:59:09 sudo[9738] -> log_failure @ ./logging.c:318
Jun 28 21:59:09 sudo[9738] -> log_denial @ ./logging.c:256
Jun 28 21:59:09 sudo[9738] -> audit_failure @ ./audit.c:68
Jun 28 21:59:09 sudo[9738] -> linux_audit_command @ ./linux_audit.c:70
Jun 28 21:59:09 sudo[9738] -> linux_audit_open @ ./linux_audit.c:49
Jun 28 21:59:09 sudo[9738] <- linux_audit_open @ ./linux_audit.c:61 := 13
Jun 28 21:59:09 sudo[9738] <- linux_audit_command @ ./linux_audit.c:97 := 3
Jun 28 21:59:09 sudo[9738] <- audit_failure @ ./audit.c:81
Jun 28 21:59:09 sudo[9738] -> new_logline @ ./logging.c:746
Jun 28 21:59:09 sudo[9738] <- new_logline @ ./logging.c:867 := user NOT authorized on host ; TTY=pts/2 ; PWD=/home/sirex ; USER=root ; COMMAND=/bin/bash -l
Unless i'm reading this wrongly, it looks to me like sudo is speaking with sssd, which talks to IPA via kerberos (on the same machine). That says OK, then sudo... for some reason, rejects that and says no anyway ?
Configs: (of broken ipa server)
=-=-=- sudo.conf (comment lines removed) =-=-=-=-
Debug sudo /var/log/sudo_debug all@debug
Debug sudoers.so /var/log/sudo_debug all@debug
Plugin sudoers_policy sudoers.so
Plugin sudoers_io sudoers.so
Set disable_coredump false
=-=-=- sssd.conf (whitespace removed) =-=-=-=-
[domain/domain.com]
debug_level = 5
cache_credentials = True
krb5_store_password_if_offline = True
ipa_domain = domain.com
id_provider = ipa
auth_provider = ipa
access_provider = ipa
ipa_hostname = server1.domain.com
chpass_provider = ipa
ipa_server = server1.domain.com
ipa_server_mode = True
ldap_tls_cacert = /etc/ipa/ca.crt
[sssd]
services = nss, sudo, pam, ssh
config_file_version = 2
domains = domain.com
[nss]
memcache_timeout = 600
homedir_substring = /home
[pam]
[sudo]
[autofs]
[ssh]
[pac]
[ifp]
=-=-==- /etc/sudoers (comments removed) =-=-=-=-=-
Defaults !visiblepw
Defaults always_set_home
Defaults env_reset
Defaults env_keep = "COLORS DISPLAY HOSTNAME HISTSIZE INPUTRC KDEDIR LS_COLORS"
Defaults env_keep += "MAIL PS1 PS2 QTDIR USERNAME LANG LC_ADDRESS LC_CTYPE"
Defaults env_keep += "LC_COLLATE LC_IDENTIFICATION LC_MEASUREMENT LC_MESSAGES"
Defaults env_keep += "LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE"
Defaults env_keep += "LC_TIME LC_ALL LANGUAGE LINGUAS _XKB_CHARSET XAUTHORITY"
Defaults secure_path = /sbin:/bin:/usr/sbin:/usr/bin
root ALL=(ALL) ALL
%wheel ALL=(ALL) ALL
Edit1: Ok, on suggest from jhrozek i also enabled debugging on the [sudo] section, which gives this in the logs:
(Wed Jun 29 21:08:27 2016) [sssd[sudo]] [sudosrv_get_rules] (0x0400): Retrieving default options for [sirex] from [domain.com]
(Wed Jun 29 21:08:27 2016) [sssd[sudo]] [sudosrv_get_sudorules_query_cache] (0x0200): Searching sysdb with [(&(objectClass=sudoRule)(|(sudoUser=ALL)(name=defaults)(sudoUser=sirex)(sudoUser=#123607)(sudoUser=%confluence-administrators)(sudoUser=%jira-administrators)(sudoUser=%build_system_shell)(sudoUser=%jira-developers)(sudoUser=%publictracker)(sudoUser=%staff)(sudoUser=%wikiprivate)(sudoUser=%jira-users)(sudoUser=%vpn_users)(sudoUser=%ipausers)(sudoUser=%admins)(sudoUser=%gerrit)(sudoUser=%sirex)(sudoUser=%wiki)(sudoUser=%ops)(sudoUser=%gerrit-submit)(sudoUser=%sirex)(sudoUser=+*))(&(dataExpireTimestamp<=1467234507)))]
(Wed Jun 29 21:08:27 2016) [sssd[sudo]] [sudosrv_get_rules] (0x2000): About to get sudo rules from cache
(Wed Jun 29 21:08:27 2016) [sssd[sudo]] [sudosrv_get_sudorules_query_cache] (0x0200): Searching sysdb with [(&(objectClass=sudoRule)(|(name=defaults)))]
(Wed Jun 29 21:08:27 2016) [sssd[sudo]] [sudosrv_get_sudorules_from_cache] (0x0400): Returning 0 rules for [<default options>@domain.com]
ldbsearch gives 0 results, but does also show 'asq: Unable to register control with rootdse!' (though it says that on the other servers also)
on the broken server
ldbsearch -H /var/lib/sss/db/cache_domain.com.ldb -b cn=sysdb '(objectClass=sudoRule)'
gives 0, whereas it gives 3 on the other auth servers, so i guess it's somehow replication related, but the question now is how to fix that.
Edit2: Oddly, on the broken server
ipa sudorule-find All
returns the 3 rules !? I've tried removing the sssd cache file on the broken server and restarting sssd, but ldbsearch still gives 0 rules found.
If i do ldbsearch with no filter i get 48 records on the broken server and 51 on the others. Only the 3 sudo rule entries are missing.
I made these sudo rules on one of the working ipa servers, so i'm led to believe that either replication isnt working for the sysdb table, or it's not working just for the sudo rules of it. There is a firewall between them but user creation on a working ipa server IS replicated over to the broken ipa server, so i dont know if i can rule the firewall out. It's worth noting however that while i think all the ports are allowed between them, the broken server is in a DMZ subnet so i don't allow port 22 (ssh) back from that ipa server to the others. I don't know if that matters ? I have however done the conncheck script and it said everything was either OK or warning for the two ports that were in use by ipa itself
Edit3: OK, so making a sudo rule on the broken server which affects all servers (so it should be cached in sssd) makes the new rule show up in the UI (as well as the 3 others) but it does not show up in sssd. So it looks like sssd isn't caching the rules properly.
I just found a file ~/.ipa/log/cli.log, which on the broken server (only) has
2016-05-29T22:59:23Z 6583 MainThread ipa ERROR Certificate operation cannot be completed: Unable to communicate with CMS (Internal Server Error)
I don't know if that's a red herring or a smoking gun ?
Edit4: From the comments of Danila Ladner and subsequent testing, this seems to be occur in 4.2.0-15.0.1.el7.centos.17 but not in 4.2.0-15.0.1.el7.centos.6.1 which i believe was due to the corresponding upgrade of libsss to 1.13.0.40.el7_2.9
I believe it's related to: https://fedorahosted.org/sssd/ticket/1108
and
https://bugzilla.redhat.com/show_bug.cgi?id=1256849
but now i just need to work out a fix for it. the ipa-compat tree wasn't enabled on the 'broken' ipa server, now it is but still no joy.
This seems to be related to the following bug: https://fedorahosted.org/sssd/ticket/1108
and
https://bugzilla.redhat.com/show_bug.cgi?id=1256849
the suggestion there to enable the compat tree seems to have done the trick, after allowing about 30 minutes for the sssd caching to expire across the network.
In short, you need to ensure the compat tree is enabled in ipa for sssd to cache the sudo rules correctly. I had the compat tree off on the one 'broken' ipa server, and when clients were talking to that particular ipa server (via DNS SRV record) they were not caching any sudo rules. This manifested by machines sometimes being able to let users sudo, and sometimes not. Because the ipa server itsself doesn't use the SRV record but instead uses itsself, sudo on the ipa server was always broken.
Running 'ipa-compat-manage enable' on the ipa server and waiting for the sssd cache to expire seems to have fixed the problem.