We have a number of robots installed at various locations, and servicing customers. All robots get their instructions from a central cloud database with customer data, and each have an SQS queue which delivers the commands they have to execute, and the robots broadcast any events using SNS, and some lambdas are triggered by these SNS messages and handling them.
Now we want to have a better handling and overview of errors occurring on the robots and in generel have better statistics.
What we need is:
- Get an alarm when an error happens that requires manual action to recover.
- An overview of which types of errors that happens most.
- What errors happen before others (i.e. what error has lead us to a
recovery_error
which needs manual maintenance) Overall stats of the performance from a given period
- Number of successful sessions
- Failed sessions caused by user error
- Failed sessions caused by technical errors
- Errors where the robot cannot automatically recover and go back to initial position.
All messages have a type
attribute which can be status
, warning
, error
or recovery_error
and a value
attribute which describes the type of status, error etc.
My thought is to have a lambda that's subscribed to all SNS messages and will upload these to another system which we will then collect it all and provide what we need for extracting the data mentioned above.
Which AWS products would you recommend for this? I already looked a little at CloudWatch, but I'm not sure if it can cover our needs.
We have also considered just dumping all SNS messages into a database, and do custom queries on the tables. But that sounds like a solution that can quickly require a lot of work on our side, as our need grows.
We'd prefer an off the shelf solution and adjust our workflow to that.
Thanks in advance for any tips.
CloudWatch provides out of box time based metrics and logs ingesting, querying and dashboard. Also it provides alarming based on the metrics. Generally it satisfies your requirements to collect the metrics of your devices, alarming when something error happening, having stats dashboard based on given period. Even you can use CloudWatch agent/API directly sending the data from devices.
Also managed elastic search with Kibana also provides the great data aggregation capability and better dashboard user experience.
Another approaching is leveraging the IoT services they probably better fit into your requirements.