Recently we faced one issue in reading messages from
SQS in
AWS cloud where we are processing same message multiple times. This issue we identified by using messsage identifier(
mid) of the each message in AWS
REDSHIFT table column.
How our system works:
- Post messages to AWS SQS using one task
- Read batch of messages from SQS and start processing json message
- Validate each message in a batch and put in AWS S3 bucket
- Load into AWS REDSHIFT database
- Delete batch of messages from SQS after succesfuly processing messages
As I mentioned earlier we faced an issue of some of the messages processing multiple times and this was happening in
only one environment and not all environments. To find RCA for this, it took almost three days and below is the RCA. Possible reasons for this to occur is,
Reading a message from SQS and not deleting - if this is the case messages never deletes from SQS and all messages should process multiple times. But this is not happening. Messages are deleting but some messages are processing multiple times
Reading same message by multiple tasks to process - This is happening in our scenario.
Task1 reading batch of messages from SQS and before deleting all theses messages some other task picking some messages from task1 read messages. This is due to
visibility time out of message in the SQS. Lets see how this happens.
When a task read batch of messages from SQS , these messages are moved to
inflight mode and not visible to other task to read as these messages are under processing. And SQS has a property called visibility time out, so message in the
inflight mode message are not visible to other tasks until this time out completes.
Before this time out expires, we need to complete our process of message validation, loading to S3, storing to REDSHIFT and deleting. In our case some of the messages are not completing this process(or not deleting) with in the visibility time out(10sec in this case). So causing the message visible again after the visibility timeout expires.
As I mentioned this was happening only one environment because, in that environment only we have visibility time out 10secs and all other enviroments we have 20secs. Our task is not completing with in 10secs and causing the duplicate message processing issue. To solve this issue, we just increased visibility time out to 20secs.
I hope this helps.
Happy Reading!!!