Starting Jan 3, 2017, some of our customers and users may have noticed that AOL address books were not connecting reliably. In fact, most of them were failing. We researched the issue and raised it with AOL’s development team and they quickly resolved the issue.
How We Found Out
We continually monitor our upstream address book sources for availability and accuracy. We publish the current status on our table of supported sources where anyone can check on the health of our integrations at any time. When one of our sources goes down we get a stream of nagging messages in our private #alerts slack channel.
Members of our Ops and development teams are tuned into this channel at all times. This lets us all know when any of our sources is in trouble and we can all jump to action to resolve it.
Occasionally, our AOL monitoring reports a false positive. This typically happens because the refresh token for our test account has expired and we have to manually re-authorize CloudSponge’s access to the contacts. Occasionally, our address book data in AOL changes and we need to, manually again, sign into the account and restore it. We did the usual dance and found that it didn’t help.
Upon deeper investigation, we discovered that our attempts to fetch the contact data from AOL’s API were returning a 500 error. It’s not unusual for us to see issues with many of our providers, but this is the first time we’ve seen anything affecting AOL since we started to use their APIs three years ago.
What We Did About It
Step 1: Temporary Workaround
We observed several facts about these AOL failures:
- The OAuth flow was returning a valid access token and we were able to access the user’s basic information.
- The failures were always occurring when attempting to get the user’s contacts.
- About 10% of the time, the call to get the user’s contacts would succeed.
So we devised a workaround to invert the failure ratio. By retrying to download the contacts up-to 20 times, we were able to increase the probability of a connection succeeding to 90%. This brought our failure rate down to 10%. Not ideal, but much better than before.
In making this change, we also ended up piping some interesting log data to our Papertrail account. We were able to search for specific log data showing exactly which API requests were failing and at what time. This information proved to be invaluable for communicating the issue to AOL.
Step 2: Contact AOL
While we looked into the details of the issue on our end, we also reached out to AOL’s development team. They responded very quickly and we shared the log results of our workaround with them.
Once their team had the information they were able to track down the root cause and fix the issue on their end very quickly.
The explanation for the root cause that we received was non-specific yet still satisfying, given how quickly they addressed the issue.
Ends Well
At the end of the day, our support for AOL’s integration was hampered for 5 days. Thanks to our thorough understanding of the issue on our side and thanks to AOL’s tenacity, we had a same-day resolution to the issue.