Our software uses sequence-based recovery, just like TCP. Each write to the database records the sequence number, and when we start recovery we ask the other end of the connection for the next sequence number that wasn't written to the database.
However our system processes messages in several stages between several different applications. The first stage can process messages far faster than the part that writes to the database.
If there are lots of messages cued up on the other end, the database part can get overwhelmed. So if we go down for a few minutes at a busy time, we can never catch up. We try to read in too many messages, the database writes a few then crashes, and then we start again.
The standard solution to this problem is back pressure. If the database part refuses to accept too many messages, then the reader part has to wait for the database to be ready before sending more. This means that we only request messages from the other end when we have space in the queues.
Also the database part get slower if we send it too many messages at once. Its memory fills up, the processor and eventually disk caches get less efficient, and after a while it runs out of memory completely.
So back pressure actually makes the system considerably faster.
And also means it doesn't crash under load.
No comments:
Post a Comment