AzureAD federation failing. Need help finding root cause

Greetings all. We have an existing Grafana 6.6.2 install on CentOS 7.8 using nginx for the user-facing SSL proxy. It’s configured with AzureAD authentication which was (up until last week)_working properly. There doesn’t appear to be anything that’s changed and static logins are still fine. Behaviorally, new logins will correctly forward to the company’s microsoftonline login dialog, send two-factor confirmations via SMS, etc. The redirect query following auth back to grafana is hanging. Here’s the relevant setup (anonymized a bit).

From grafana.ini
##################################### Azure Oauth ####################
[auth.generic_oauth]
name = AzureAD SSO
enabled = true
allow_sign_up = true
client_id = c5a49c03-a339-46d0-af6f-*************
client_secret = 6zQ?h8FQiv@1vI7Il?a***************
scopes = openid email profile
auth_url = h44ps://login.microsoftonline.com/9feebc97-ff04-42c9-a152-76********/oauth2/v2.0/authorize
token_url = h44ps://login.microsoftonline.com/9feebc97-ff04-42c9-a152-76********/oauth2/v2.0/token
allowed_domains = company.com
;allowed_domains =
;allowed_groups =

Nginx error message:
2021/02/17 17:47:54 [error] 99280#0: *3185 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.63.144.220, server: inet-dashboard.services.company.net, request: “GET /login/generic_oauth?code=0.AAAAl7zunwT_yUKhUnZwc4chGAOcpMU5o9BGr28S4NX1e2MOAKs.AQABAAIAAAD–DLA3VO7QrddgJg7Wevr1wwm2sea6b8Y4VgQlCRiwGKMcmRGH6oKDitxnCFVexv5nnhSF3p64GW3l9S8qrb2Xa7rgDsBXTjzLuYViP0xp20QyG4lonfj6ZOv4eIVqt_IiSVBJP8IeRGOyXf5_nEBk_qcwuRMjGVhE-uugI2mjj3fl8T-EUpQuyHMk04wBUe-IHcTT_stupyag3Y43WUQdLz28NcvKTFjbvhdxiwz4nvZdV7DrtYuvTMBxWeMUEZu6JVRIbXppNYp8v3iM-0P8dCW63-WHqe7X9sckLn_Q7CQYzuciiv4Csj4NGnnE_kxX1kAMfmZxA0UfTH06jZziDOQBRuThEvc6bPNd1rIKhuhknsJ-l2ZFmn_zbTBem9VnjwFFEz4IslyLfiINhbtnPtFqFje_At0cGPx8vaXijY8zJNLeNKM97ly0S1hxrBpe3HszsD5-yF0dtPcPrXjnk2V9JSW_dduSLewZ2u79jHL6y-uCg2jkWKKIOM7cDqBsbNCINeF4Tk8grBigalmJ0kvnoQYxSfxNOz_3pRwzQxkMsQe-5WXAxpp9mB8CfiZPX-rkhieDC9f64SeXmkV9q5cd2-GVhptaHMS1SgtyDRh9KiH5lamgGE3qSqfmFVXGbsKw9OQQpMg5zFV3shvG6zXpqJNQWEO4UdjSpdj3cCFicitbaxyOiAjx_jh-dTSFAJ49td-S7hkNnX0N3rA1ujxQZmgXTDeoh5u1_hSokEFa4JWF1AhdWIw077z-D1CVh7j9E6QocTC3Szk0_AmFCUvN5E14YZiXE6D5GlN-t98EDykRVct2_43qTp8uasHMPLB1I-L7efuQ2dCSir2WLytbtH7dsr1rM147jOdxi-KL5RZa9SQoG1ZpfmQEFUgAA&state=bEwGbZOA-luf2g8NM-mMCRztKurDpdcP8tIHQ6MdNXo%3d&session_state=9527d858-bb4a-4cae-a4f7-89b53c5eaae7 HTTP/2.0”, upstream: "h44p://127.0.0.1:3000/login/generic_oauth?code=0.AAAAl7zunwT_yUKhUnZwc4chGAOcpMU5o9BGr28S4NX1e2MOAKs.AQABAAIAAAD–DLA3VO7QrddgJg7Wevr1wwm2sea6b8Y4VgQlCRiwGKMcmRGH6oKDitxnCFVexv5nnhSF3p64GW3l9S8qrb2Xa7rgDsBXTjzLuYViP0xp20QyG4lonfj6ZOv4eIVqt_IiSVBJP8IeRGOyXf5_nEBk_qcwuRMjGVhE-uugI2mjj3fl8T-EUpQuyHMk04wBUe-IHcTT_stupyag3Y43WUQdLz28NcvKTFjbvhdxiwz4nvZdV7DrtYuvTMBxWeMUEZu6JVRIbXppNYp8v3iM-0P8dCW63-WHqe7X9sckLn_Q7CQYzuciiv4Csj4NGnnE_kxX1kAMfmZxA0UfTH06jZziDOQBRuThEvc6bPNd1rIKhuhknsJ-l2ZFmn_zbTBem9VnjwFFEz4IslyLfiINhbtnPtFqFje_At0cGPx8vaXijY8zJNLeNKM97ly0S1hxrBpe3HszsD5-yF0dtPcPrXjnk2V9JSW_dduSLewZ2u79jHL6y-uCg2jkWKKIOM7cDqBsbNCINeF4Tk8grBigalmJ0kvnoQYxSfxNOz_3pRwzQxkMsQe-5WXAxpp9mB8CfiZPX-rkhieDC9f64SeXmkV9q5cd2-G

and the nginx snippet:
location / {
rewrite /(.*) /$1 break;
proxy_pass h44p://inet-dashboard.services.company.net:3000;
proxy_redirect off;
proxy_set_header Host $http_host;
}

Any thoughts on what might have broken here? (note: web URLs replaced with “h44p” to allow me to post the contents without hitting the noob filter).

Thanks.
Dan

You should to enable debug log level in the Grafana and check Grafana logs, not only Nginx logs. Blind guess: Grafana has no access to https://login.microsoftonline.com (because firewall, sec. group, NAT, …)

The grafana server is in debug mode also, but unfortunately, that’s not providing much more information. We suspected the server host not being able to reach the MS service as an issue but unfortunately, opening firewall policies is a (potentially weeks) long process. This grafana event, as well as lots of outgoing SYN messages seem to confirm it:
t=2021-02-25T08:37:03-0500 lvl=eror msg=login.OAuthLogin(NewTransportWithCode) logger=context userId=0 orgId=0 uname= error="Post https://login.microsoftonline.com/9feebc97-ff04-42c9-a152-767073872118/oauth2/v2.0/token: dial tcp 20.190.154.16:443: connect: connection timed out"

What is confusing about this is the platform has been running for nearly a year without that reach-ability. Did AzureAD change something 2 weeks ago that requires a call-back by the hosting app where it did not previously?

Dan

I don’t own and known your/azure infrastructure, so just blind guess based on my experience: someone has opened firewall for IPs, which were used for login.microsoftonline.com DNS record at that time. Now Azure probably changed IPs for DNS record, so those new IPs (e.g. 20.190.154.16) are not whitelisted in your firewall and Grafana can’t exchange code for the token. It’s really not a Grafana issue, but your infrastructure setup problem.

@jangaraj , thanks. I had noticed this early on but dismissed it since I’d never requested outbound connectivity. It turns out we were matching a filtering policy for another application and getting to the cloud service for nearly a year, then Azure moved their endpoints to Akamai CDN hosting a few weeks ago and it broke. After an outbound https permit and source-nat were put in place, it’s all functioning as expected.