SAP Corner — Sekhar Guna

The race condition explained

XSUAA tokens expire after a configurable period (default: 12 hours, often set to 1–2 hours in production). Most applications cache the token and refresh it when it's about to expire. The race condition happens when:

Service instance A checks: "is my token expired?" → No (5 seconds left)
Service instance B checks: "is my token expired?" → No (5 seconds left)
Token expires
Service instance A makes an API call with the expired token → 401 Unauthorized
Service instance A starts refreshing the token
Service instance B makes an API call with the expired token → 401 Unauthorized
Service instance B also starts refreshing the token
Both instances make simultaneous OAuth2 requests to XSUAA

XSUAA rate-limits token requests. Under load with many service instances, this cascade of simultaneous token refreshes causes some requests to be throttled — resulting in 429 responses from XSUAA, which your services interpret as authentication failures and retry, making the problem worse.

Why it's hard to reproduce

This never happens in development because:

You have one service instance, not many
Load is low, so the 5-second expiry window is rarely hit simultaneously
Dev tokens often have longer expiry

In production it appears as intermittent 401/429 errors that resolve themselves — exactly the kind of flaky bug that's hard to pin down. The fix gets ignored because "it only happens occasionally" until the day load spikes and it becomes a cascading failure.

The fix — mutex locking

The solution is a distributed lock on the token refresh operation. Only one instance refreshes at a time; others wait and use the newly refreshed token.

For Node.js/TypeScript services on BTP (Cloud Foundry):

import { Mutex } from 'async-mutex';

const tokenMutex = new Mutex();
let tokenCache: { token: string; expiresAt: number } | null = null;

export async function getXsuaaToken(config: XsuaaConfig): Promise<string> {
  // Fast path: valid cached token, no lock needed
  if (tokenCache && Date.now() < tokenCache.expiresAt - 30_000) {
    return tokenCache.token;
  }

  // Slow path: need refresh, acquire mutex
  return tokenMutex.runExclusive(async () => {
    // Re-check inside the lock — another instance may have refreshed already
    if (tokenCache && Date.now() < tokenCache.expiresAt - 30_000) {
      return tokenCache.token;
    }

    const response = await fetch(`${config.xsuaaUrl}/oauth/token`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/x-www-form-urlencoded',
        Authorization: basicAuth(config.clientId, config.clientSecret),
      },
      body: 'grant_type=client_credentials',
    });

    if (!response.ok) {
      throw new Error(`Token refresh failed: ${response.status} ${response.statusText}`);
    }

    const data: { access_token: string; expires_in: number } = await response.json();
    tokenCache = {
      token: data.access_token,
      expiresAt: Date.now() + data.expires_in * 1000,
    };

    return tokenCache.token;
  });
}

The critical detail: re-check the cache inside the lock. By the time a waiting instance acquires the mutex, the instance that held it has already refreshed the token. Without the re-check, every waiting instance would trigger another refresh immediately after acquiring the lock.

Tip

Use a 30-second buffer on the expiry check (expiresAt - 30_000), not 0. Tokens expire at the XSUAA server, not your clock. Network latency + clock skew can cause a token that "hasn't expired yet" locally to be rejected by the server.

Testing the fix

Write a test that simulates the race condition:

it('should only refresh token once under concurrent load', async () => {
  const fetchSpy = jest.spyOn(global, 'fetch');
  let callCount = 0;
  fetchSpy.mockImplementation(async () => {
    callCount++;
    await new Promise(r => setTimeout(r, 50)); // simulate network latency
    return { ok: true, json: async () => ({ access_token: 'new-token', expires_in: 3600 }) } as Response;
  });

  // Simulate 10 concurrent token requests on an expired cache
  tokenCache = null;
  await Promise.all(Array(10).fill(null).map(() => getXsuaaToken(config)));

  expect(callCount).toBe(1); // only ONE network call despite 10 concurrent requests
  fetchSpy.mockRestore();
});

If your mutex implementation is correct, callCount will be exactly 1. Without the mutex, it will be 10 — one refresh per concurrent caller.