Events from the future Vs strict API validation
I present to you, the most delicious problem my team solved this month!
Picture the scene
A distributed system for managing the delivery of appointments in the adult social care sector. It consists of the following parts:
- The origin of data is a mobile app with offline capabilities
- A care rostering system receives updates about completed care from the mobile app, which it then emits notifications for
- An event-driven integration listens to those notifications and copies data to....
- A destination system with a timeline of care interface
The problem
A small percentage of events were failing to be processed. Even stranger was the fact future events for the exact same entity would succeed.
The cause
Here's the root cause, in reverse order....
- The destination system's API blocks the creation of timeline events if they have a status of "completed" but a time in the future. Very logical.
- The integration processes events immediately, often within a couple of seconds. Speedy!
- The rostering system that emits those events, trusts the event time provided by the mobile app. Naturally.
- In order to support offline authoring, the mobile app records event time as the user's device clock. Of course!
- Some Carer's mobile devices have their clocks set ~20 seconds in the future. Oh dear.
Do you see the issue? It's easy when you know the cause.
The investigation
When we first investigated the steady trickle of 400 Bad Request responses being logged in Grafana, it was a real head-scratcher. The validation error message did not make sense, "Start time must be in the past for a completed event", but these are in the past, we thought. Look, I can trigger them to re-process and they go through fine.
Working backwards we carefully reread the Python API's validation code and tested the endpoint. No problems there.
I then did a deep dive into the PHP integration, tracing exactly which property on the event is used as the source when transforming the event time. I noticed that particular property was not actually included in our logs. It wouldn't have made sense to given the log already includes the notification time. I whipped up a quick patch to include the property and hotfixed production. Now we wait.
I hate having to wait for a bug to happen again in production. It feels wrong, like we're using real people as guinea pigs. I want to fix it right then.
Eventually, the alert went off and looking at the new log I felt relief and confusion in equal measure, the event time was a few seconds ahead of the notification time. I jumped into the care rostering system's PHP Symfony code and traced the source back to the database and eventually the mobile app.
The solution
At first I approached the lead mobile dev, who confirmed we don't have control over the user's device clock and it will occasionally happen. We can't fix that.
I then asked the Lead Developer for the team that maintains the rostering system's API: Can we change the code so it corrects the data by setting the time to be now if we detect it's in the future? This was deemed high risk given it's a legacy system with known issues relating to timezones and clocks changing. That team did not want to take the risk of adding code that modifies customer data.
I then consulted the maintainer of the destination system and asked if we loosened the validation, would the timeline interface handle completed events in the future? Why would anyone want that? Was the response. Fair enough, I guess we're not fixing that either.
I dialled-in a quick meeting with all involved, just to confirm the decision. In the end it would be the integration that my team maintain which would implement the solution. If a future event time was detected, it would not process it but instead add it to another queue with a "reattempt after" datetime set. Then a continuous worker would process items from this queue when they're ready.
Summary
I loved solving this "bug" because at no single stage had anyone done anything stupid. Is it even fair to call it a bug given the negative implications of that word?
This problem is simply real world behaviour and another example of how computers start to behave like unpredictable organic systems when distributed across networks.