ECS publishes a stream of events to CloudWatch Events, and within CloudWatch we’re filtering for the “unable to consistently start tasks successfully” events. In particular, we’ve got an event pattern that looks for the SERVICE_TASK_START_IMPAIRED
event name.
Any matching events trigger a Lambda function, which extracts the key information from the event, gets a Slack webhook URL from Secrets Manager, then posts a message to Slack. This is a standard pattern for alerting Lambdas that we’ve used multiple times.
If you’re interested, all the code for this setup is publicly available under an MIT licence. Both the Terraform definitions and Lambda source code are in our platform-infrastructure repo.
And now if you’ll excuse me, there’s an ECS task that needs my attention…