The race condition explained
XSUAA tokens expire after a configurable period (default: 12 hours, often set to 1β2 hours in production). Most applications cache the token and refresh it when it's about to expire. The race condition happens when:
- Service instance A checks: "is my token expired?" β No (5 seconds left)
- Service instance B checks: "is my token expired?" β No (5 seconds left)
- Token expires
- Service instance A makes an API call with the expired token β 401 Unauthorized
- Service instance A starts refreshing the token
- Service instance B makes an API call with the expired token β 401 Unauthorized
- Service instance B also starts refreshing the token
- Both instances make simultaneous OAuth2 requests to XSUAA
XSUAA rate-limits token requests. Under load with many service instances, this cascade of simultaneous token refreshes causes some requests to be throttled β resulting in 429 responses from XSUAA, which your services interpret as authentication failures and retry, making the problem worse.
Why it's hard to reproduce
This never happens in development because:
- You have one service instance, not many
- Load is low, so the 5-second expiry window is rarely hit simultaneously
- Dev tokens often have longer expiry
In production it appears as intermittent 401/429 errors that resolve themselves β exactly the kind of flaky bug that's hard to pin down. The fix gets ignored because "it only happens occasionally" until the day load spikes and it becomes a cascading failure.
The fix β mutex locking
The solution is a distributed lock on the token refresh operation. Only one instance refreshes at a time; others wait and use the newly refreshed token.
For Node.js/TypeScript services on BTP (Cloud Foundry):
import { Mutex } from 'async-mutex';
const tokenMutex = new Mutex();
let tokenCache: { token: string; expiresAt: number } | null = null;
export async function getXsuaaToken(config: XsuaaConfig): Promise<string> {
// Fast path: valid cached token, no lock needed
if (tokenCache && Date.now() < tokenCache.expiresAt - 30_000) {
return tokenCache.token;
}
// Slow path: need refresh, acquire mutex
return tokenMutex.runExclusive(async () => {
// Re-check inside the lock β another instance may have refreshed already
if (tokenCache && Date.now() < tokenCache.expiresAt - 30_000) {
return tokenCache.token;
}
const response = await fetch(`${config.xsuaaUrl}/oauth/token`, {
method: 'POST',
headers: {
'Content-Type': 'application/x-www-form-urlencoded',
Authorization: basicAuth(config.clientId, config.clientSecret),
},
body: 'grant_type=client_credentials',
});
if (!response.ok) {
throw new Error(`Token refresh failed: ${response.status} ${response.statusText}`);
}
const data: { access_token: string; expires_in: number } = await response.json();
tokenCache = {
token: data.access_token,
expiresAt: Date.now() + data.expires_in * 1000,
};
return tokenCache.token;
});
}
The critical detail: re-check the cache inside the lock. By the time a waiting instance acquires the mutex, the instance that held it has already refreshed the token. Without the re-check, every waiting instance would trigger another refresh immediately after acquiring the lock.
Use a 30-second buffer on the expiry check (expiresAt - 30_000), not 0. Tokens expire at the XSUAA server, not your clock. Network latency + clock skew can cause a token that "hasn't expired yet" locally to be rejected by the server.
Testing the fix
Write a test that simulates the race condition:
it('should only refresh token once under concurrent load', async () => {
const fetchSpy = jest.spyOn(global, 'fetch');
let callCount = 0;
fetchSpy.mockImplementation(async () => {
callCount++;
await new Promise(r => setTimeout(r, 50)); // simulate network latency
return { ok: true, json: async () => ({ access_token: 'new-token', expires_in: 3600 }) } as Response;
});
// Simulate 10 concurrent token requests on an expired cache
tokenCache = null;
await Promise.all(Array(10).fill(null).map(() => getXsuaaToken(config)));
expect(callCount).toBe(1); // only ONE network call despite 10 concurrent requests
fetchSpy.mockRestore();
});
If your mutex implementation is correct, callCount will be exactly 1. Without the mutex, it will be 10 β one refresh per concurrent caller.