Lessons Learnt from Maintaining Serverless Applications

NIDC 2024 | NIALL KEYS | APPRENTICE SOFTWARE ENGINEER

Thanks for coming along to my talk today. I hope you were able to take something away as I shared how we have embraced the next stage in our serverless journey here at Instil.

This talk was based around lessons we have learned while maintaining Stroll, a serverless-first car insurance platform that now has been live in production for quite a few years. Previously at NIDC, Matthew Wilson shared this platform's development story, so if you are just starting out with serverless, I would recommend you go and watch that. These insights have also be shared on the Instil Blog (Part 1, Part 2, Part 3) so you can also get them over there if written form is more your thing.

Here's a quick recap of the key lessons we've learnt (just incase you weren't paying attention..):

1. The Benefits of Serverless Extend to Maintenance Mode

With serverless we are alleviated from the burden of managing servers especially when it comes to the likes of scaling or patching for new vulnerabilities, such as the CVE-2024-3094, which we mentioned. Instead the cloud provider, in our case AWS will patch their runtime (for lambda) and we will automatically use this new version as soon as it becomes available.

However, this doesn't mean that once a Lambda is deployed, there isn't anything we need to do. As we still have responsibilities around runtime versions.

2. Plan For Runtime Updates

While we don't need to worry about patching the servers themselves, we do need to be aware of the runtime versions we are using and ensure we have adequate time in our application timeline to address any necessary updates.

LINK TO BOOKMARK

AWS Lambda Runtime Deprecation Policy

AWS Deprecates Lambda Runtimes when Support for the Upstream Language Version Ends.

This can be quite burdensome if we are using the AWS SDK version provided in the runtime, as you will have to perform this update in parallel. These updates can include breaking changes as we discussed. However, AWS released a tool during the v2 to v3 migration to help you with these changes.

There are a range of views on whether or not you should or shouldn't include the AWS SDK package in your deployment. However, even if you are happy to bundle the AWS SDK it's still important you keep your SDK version in line with the sdk maintenance policy.

LINK TO BOOKMARK

AWS SDK Maintenance Policy

Similarly to the runtime policy, AWS SDK have a maintenance policy that must be considered in your application's timeline.

3. Consider Service Quotas

AWS has limits on the maximum values for the resources, actions, and items in the account your application is hosted in. It's vital as we saw with the Step Function example on stroll that we consider these when developing a serverless application, not just with the current traffic in mind but also ensuring they can handle any future traffic spikes, as these limits are often much smaller than expected.

LINK TO BOOKMARK

AWS Service Quotas

AWS provide documentation on any limits they have set on resources, actions and items provided through services in AWS

4. Ease Of Observability

When maintaining serverless applications, the ability to quickly debug issues as they arise is vital. For this we must ensure there appropriate observability in place. As we saw, the use of serverless makes this slightly more complicated because of the distributed systems. However there are some services AWS provide that we can use to make this easier:

Step Functions

Step functions a visual orchestration workflow service provided by AWS that makes it really easy to see what’s happening within the orchestration of events in a workflow, compared to when trying to manage orchestration within a lambda. When sometimes goes wrong, it's clearly marked in error state with details provided to the error that was thrown. When the issue is fixed you can use the console to either rerun the entire workflow, or like in our example use redrive to ensure the steps that had succeeded aren't also reran

X-Ray

AWS X-Ray provides a similar level of observability for your application as Step Functions does for your orchestration by providing a trace map of the services used to fulfill a request and consolidating all logs written to cloudwatch using the associated trace id. These trace maps can also be used to identify latency within the application which we can work to resolve.

5. Deploy Efficient Lambdas

Often the latency we identify with X-Ray can be due to the cold start that occurs when we trigger a Lambda. However, there are a few things you can do to fight back against it:

1. Consider an alternative Runtime

In scenarios where your latency is very important there are some alternative runtimes like the AWS Low Latency Runtime which is much more lightweight and can help reduce the cold start as a result, or the use of Lambda SnapStart which instead carries out the init phase during deployment and caches the environment for reuse. Initially this was only available only for Java, but support for Python and .NET runtimes has just been added.

2. Reduce the bundled dependencies in the package

This could be by only importing the packages used or by utilising the built in version of the AWS SDK in the runtime as we said earlier instead of bundling your own.

3. Select the correct memory configuration

However, sometimes the best thing to do to ensure your Lambda runs efficiently is to increase its memory allocation. This will reduce the latency as the lambda will respond faster but also run efficiently in terms of cost because the time it takes to run will be less.

This can be complex to workout the best configuration, however the AWS Lambda Powertuning Tool helps with us by testing your lambda with several memory configurations and presenting a graph allowing you to determine the best configuration to deploy with

6. Alerting For Cost

As we discussed, the cost of serverless can be difficult to predict and can be a shock when the bill comes in. To help with this, we can use services provided by AWS to ensure we have observability of our two kinds:

Resource Level Observability

We used this kind of observability to ensure that the average duration of a Lambda doesn't exceed an expected threshold, set by us as developers because when it does we'll want to consider a different memory configuration as we saw above. We use Cloudwatch Alarm for this that triggers when the average duration of the lambda exceeds the threshold we have set, notifying the team over email or over slack through AWS Chatbot.

Account Level Observability

Account level monitoring is more important we consider the concept of indivual developer accounts as we can set a zero spend budget using AWS Budgets, so that as soon as the infrastructure begins to cost money, that developer along with an admin group are notified. After the zero spend is exceeded, we can include a tiered budget so that the developer is notified at each stage of the budget being exceeded.