App DevelopmentAPI Development

Lessons Learned from a Push Notification Refactor

PushNotification-13

Over the last few months, I was tasked with reworking a notifications system and making it more consistent and developer-friendly. Since it was an inherited codebase, there were many times when I would run into issues and not know enough to figure out why something went wrong. Over time, however, it became apparent that the issues in the codebase were not related to a lack of understanding in regards to the code, but to how the code was written. The previously implemented processes were difficult to reason with and just as difficult to debug. By the time I was done with refactoring, however, the system had reached a point where:

  • Notifications had a high success rate
  • Notification failures were more transparent and took less time to fix
  • Logging and analytics on notifications were easy to parse and useful to read
  • Test Engineers were able to successfully and confidently pass notifications through a comprehensive regression test

Here are some lessons I learned that helped me accomplish this:

  1. Notification responsibilities should be simple and concise
  2. Notification processing should continue when there is no fatal error
  3. Notification processing should be testable, debuggable, and self-documenting
Before We Go Further…

This article assumes a working knowledge of Amazon Web Services (AWS) and some of its services (DynamoDB, Simple Notification Services, Lambdas). Code examples are a mixture of actual AWS syntax and pseudocode for simplicity’s sake.

Notification Responsibilities Should Be Simple & Concise

The original codebase for notifications was messy. It was rife with disorganized code and anti-patterns. This made bug fixes and feature improvements slow-going processes. To fix this, a few changes had to be made: we normalized database calls and simplified module responsibility.

Database Call Normalization

In the initial codebase, DynamoDB calls were extensive. Functionality was spread out over multiple functions and a developer might have to traverse multiple files to fully understand all the use cases of a single table . All of this redundant functionality needed to be reduced to a single simple and flexible interface:

// BEFORE FIX

//”notificationsTable” is a wrapper class used to simplify DynamoDB

const getNotifications = async (notificationsTable) => {
   return await notificationsTable.getNotifications()
}

const getNewNotifications = async (notificationsTable) => {
   return await notificationsTable.getNewNotifications()
}

const getPlatformNotifications = async (notificationsTable) => {
   return await notificationsTable.getPlatformNotifications()
}

const getNewPlatformNotifications = async (notificationsTable) => {
   return await notificationsTable.getNewPlatformNotifications()
}

After the update, we simplified getting notifications to a single function with a configuration object to be passed in by the caller. Then we updated the database table module to handle multiple types of queries based on those configuration decisions. As the DynamoDB has the most context into how notifications are stored and retrieved, it made sense to place the burden of retrieving different types of notifications on the DB itself instead of an outside caller.

The benefits here are that this same function can be reused anywhere that notifications are needed in the application, so long as the configuration is set up correctly. What’s more, this function could easily be mocked for dependency injection. One function mock that has different return results based on the configuration would be trivial to create and reusable across many unit tests.

// AFTER FIX

const getNotifications = (config, notificationsTable) => {
    const { type, isNew } = config
    return await notificationsTable.getNotifications(type, isNew)
}
Module Simplification

We began module simplification with notification processing. In the initial process, notifications were retrieved and looped over. Each notification, depending on its properties, would be pushed to users in a diverse set of ways. This process added complexity for multiple reasons. Firstly, multiple notifications were being processed simultaneously in different ways, making notification logging quite difficult to track. Secondly, debugging became a hassle, as tracing a specific type of notification required stepping over many lines of irrelevant code related to entirely different types of notifications.

// BEFORE FIX
export const publish = async (notification) => {
   const {userIDs, targetEndpoint} = notification
   if (userIDs) {
       publishToUsers(notification)
   } else if (targetEndpoint) {
       publishToSubscriptionEndpoint(notification)
   } else {
       publishToGenericEndpoint(notification)
   }
}

To solve for this complexity, notifications were filtered before processing. What this meant is that a set of notifications will all be processed the same way. If no notification matches the criteria for a specific processing type, nothing happens for that notification processing iteration. This, in turn, simplifies processing to have one strategy per processing system instead of having one lambda with multiple processing strategies. This makes debugging faster because a developer only has to deal with a single system at a time. Pinpointing where the problem exists is as simple as checking which system fails. In addition, adding and removing notification types and features becomes easier and more trivial with this strategy, as it involved adding and deleting independent notification processors instead of adding and removing conditional statements from a sprawling if-else logic tree.

// AFTER FIX
export const publishToUsers = async (notifications, users) => {
   const isUserNotification = (notification) => !!notification.userIDs
   const userNotifications = notifications.filter(isUserNotification)
   for (const notification of userNotifications) {
     const hasUserID = (user) = notification.userIDs.includes(user.id)     
     const targetUsers = users.filter(hasUserID)
     for (const targetUser of targetUsers) {
         const endpoint = targetUser.notificationEndpoint
         await publishNotification(notification, endpoint)
     }
   }
}

The initial process of publishing notifications was also far too complex for what it was trying to do. Instead of directly calling a function to publish the notification, a lambda was called. Invoking a new lambda causes unneeded complexity by making an entirely new execution environment. This can make debugging a very complicated endeavor, especially if a new developer joins the team and is assigned to fix a notification issue. There are a few hacky solutions for tracking a notification from one lambda to another invocation, such as mocking the environment, but it is out of the scope of this article and such a solution would make things quite complex.

// BEFORE FIX
const publishNotification = (notification, endpoint) => {
   const Message = buildNotificationMessage(notification)
   const MessageStructure = 'json'
   const TargetArn = getNotificationEndpointARN(notification)
   const payload = { Message, MessageStructure, TargetArn: endpoint }
   await notificationService.invokePublisherLambda(payload) // Hard to debug
}

Instead of making a call to another lambda, we updated the code to make the call to another module within the same execution of code. This change made reasoning about the codebase much easier. If a notification failed to publish, the issue could be found by checking the `publishNotification` function, instead of tracing the publisher *and* another publishing lambda.

// AFTER FIX
const publishNotification = (notification) => {
   const Message = buildNotificationMessage(notification)
   const MessageStructure = 'json'
   const TargetArn = getNotificationEndpointARN(notification)
   const payload = { Message, MessageStructure, TargetArn }
   await notificationService.publish(payload) 
}

Notification Processing Should Continue When There Is No Fatal Error

Once processing notifications became a simpler process, the next goal was to ensure errors created predictable results and still allowed for as many notifications to send as possible. A few steps that allowed for this were that common functions handled errors and all notifications were processed to completion, even if a non-fatal error occurred.

Common Error Handling

Before this addition, errors with notifications failed silently and gave very little explicit information. At best, a notification error might show up in the logs as a status code error with no meaning attached to it. A quick look into the SNS Documentation from Amazon shows a very distinct set of possible errors and status codes, so we created an error handler to account for all of those issues predictably as well as log out informative data.

We felt the benefits of this approach immediately. The system handled errors predictably and the logs for them were rich in information. This allowed us to investigate easily when issues did crop up.

// SOLUTION
const handleError = (error, service, resource = 'Not Provided') => {
   const message = errorMessage(service, error, resource)
   console.error(message) // Log all errors

   switch (error.code) {
       case 'EndpointDisabled':
       case 'InvalidParameter':
       case 'ParameterValueInvalid':
       case 'PlatformApplicationDisabled':
           return errorTemplate(400, message)
       case 'AuthorizationError':
           return errorTemplate(401, message)
       case 'SubscriptionLimitExceeded':
           return errorTemplate(403, message)
       case 'NotFound':
           return errorTemplate(404, message)
       default:
           return errorTemplate(500, message)
   }
}

const errorTemplate = (errorCode, message) => {
   const error = new Error(message)
   error.code = errorCode
   error.statusCode = errorCode
   return error
}

const errorMessage = (service, error, resource) => {
   return `Service: ${service} -- ` + 
          `Error: ${error.code} -- ` + 
          `Message: ${error.message} -- ` + 
          `Resource: ${resource}`
}
Graceful Error Handling

As mentioned before, the old notification system wasn’t catching errors effectively. To mitigate this, we added a new system that allowed for the errors to be handled gracefully and fail without stopping the processing of the remaining notifications. To prevent errors from stopping the process entirely, we inserted a try catch block. This allowed for errors to still be logged, but didn’t stop the notification execution. This was useful, as oftentimes notifications can fail for a number of reasons that should not stop remaining notifications from sending. What is more, sending all notifications in this way allows for a better sense of what errors are happening consistently.

// SOLUTION
export const publishToGeneralEndpoint = (notifications) => {
   for (const notification of notifications) {
       try {
           const endpoint = getGeneralEndpoint(notification)
           await publishNotification(notification, endpoint) 
           console.log("Success! Notification sent to the platforms")
       } catch (error) {
           handleError(error) // no action besides logging
       }
   }
}

Notifications Should Be Testable, Debuggable, and Self-Documenting

As I alluded to earlier, we added predictable error handling and ample logging to the notifications as they were processed. One benefit of this was that notifications became much more debuggable and testable. When an error occurred, test engineers knew about it and developers could easily troubleshoot notification errors and check success rates.

Error Visibility

This form of logging does a few things. It allows developers to know which notifications succeeded by id and then do a search on them. It also gives a verbose set of information regarding errors so that developers know exactly where to go when an issue arises. Whenever a test engineer notices as suspicious failure in notifications sending, this cron job will usually have some great initial information so that they can decide whether or not the issue needs developer work or not.

… Begin Cron Job … 

 … 15 min wait … 

…  calling processSubscriptionNotifications … 
 -- Service: Publish 
 -- Error: 403
 -- Message: User Unauthorized to Publish to this SNS Endpoint 
 -- Resource: PublishPlatformNotification

 -- Notification Sent: d7f713c8-107a 
 -- Endpoint: arn:aws:sns:us-east-1:2345:ios-push-notifications

 -- Notification Sent: d7f713c8-107a 
 -- Endpoint: arn:aws:sns:us-east-1:2345:ios-push-notifications

… calling processUserNotifications … 

 -- Service: Publish 
 -- Error: 403
 -- Message: User Unauthorized to Publish to this SNS Endpoint 
 -- Resource: PublishUserNotification

 -- Notification Sent: 52f8862d-bc96 
 -- Endpoint: arn:aws:sns:us-east-1:35345:ios-push-notifications

 -- Notification Sent: 52f8862d-bc96
 -- Endpoint: arn:aws:sns:us-east-1:23434:ios-push-notifications

 … 15 min wait … 
Error Traceability

One notification improvement that is very AWS-centric is the value of capturing notification metrics. For each notification, there are metrics you can hook up to Cloudwatch which will log out specific errors from the notification provider instead of just the AWS-specific error types. One example is getting success and failure logs straight from the notification service provider. These logs are great for deeper dives into what went wrong with specific notifications when they are sent. Information such as the provider response, endpoint arn, and message id allow for quick investigation resolution when all other debugging investigations fail. Usually these logs are more supplemental than anything, but more evidence is always a benefit when debugging.

{
   "notification":{
  	"messageMD5Sum":"c9bf5424f85d-4d4ba3c066119bd1490d",
  	"messageId":"44e969f5-ece9-48ca-9cd2-208ed11b398d",
  	"topicArn":"arn:aws:sns:us-east-1:23634563:ios-push-notifications",
  	"timestamp":"2019-02-20 16:54:29.695"
   },
   "delivery":{
  	"deliveryId":"1fa29adb-f741-57c0-a8c1-ce3c7fc70b5a",
      "destination":"arn:aws:sns:us-east-1:3463453:endpoint/APNS/
      ios-general-enpoint/f03ab462-4f2e-49asf03ab462-4f2e-49as",
  	"providerResponse":"{\"reason\":\"BadDeviceToken\"}",
  	"dwellTimeMs":66,
  	"attempts":1,
  	"token":"03298f1f580d4017ac8b7314047070b8",
  	"statusCode":400
   },
   "status":"FAILURE"
}

Key Takeaways

Processing notifications should be easy. When things get to complex or hard to debug, thinking about making simple interfaces with minimal redundancy is usually a good move. For me, simplifying this notification system allowed for quicker developer work when updating features as well as debugging issues.

Join our team to work with Fortune 500 companies in solving real-world product strategy, design, and technical problems.

Find Your Role