https://segment.com/blog/introducing-centrifuge/
https://news.ycombinator.com/item?id=17134911
blog-post-centrifuge-billions-of-events-a-day#fairness-buffering-semantics-and-retry-behavior1 2 3As it turns out, 'scheduling' that many requests in a faulty environment is a complex problem. You have to think hard about fairness (what data should you prioritize?), buffering semantics (how should you enqueue data?), and retry behavior (does retrying now add unwanted load to the system?). blog-post-centrifuge-billions-of-events-a-day#fairness-buffering-semantics-and-retry-behavior1 2 3
blog-post-centrifuge-billions-of-events-a-day#backpressure-strategies1Across all of the literature, we couldn't find a lot of good 'prior art' for delivering messages reliably in high-failure environments. The closest thing is network scheduling and routing, but that discipline has very different strategies concerning buffer allocation (very small) and backpressure strategies (adaptive, and usually routing to a single place). blog-post-centrifuge-billions-of-events-a-day#backpressure-strategies1
blog-post-centrifuge-billions-of-events-a-day#queue-limits-how-you-can-access-data1The problem with using any sort of queue is that you are fundamentally limited in terms of how you access data. After all, a queue only supports two operations (push and pop). blog-post-centrifuge-billions-of-events-a-day#queue-limits-how-you-can-access-data1
blog-post-centrifuge-billions-of-events-a-day#dead-letter-queue1can detect that we are nearing a rate-limit after sending 1,000 messages for customer A in the first second, so we can then copy the next 49,000 messages for customer A into a dead-letter queue, and allow the traffic for B and C to proceed. blog-post-centrifuge-billions-of-events-a-day#dead-letter-queue1
blog-post-centrifuge-billions-of-events-a-day#how-to-handle-large-sets-of-bad-messages1Traditionally there are two ways to handle a large set of bad messages. The first is to stop your consumers and retry the same set of messages after a backoff period. This is clearly unacceptable in a multi-tenant architecture, where valid messages should still be delivered. blog-post-centrifuge-billions-of-events-a-day#how-to-handle-large-sets-of-bad-messages1
blog-post-centrifuge-billions-of-events-a-day#ksuidThe jobs table primary key is a KSUID, which means that our IDs are both are k-sortable by timestamp as well as globally unique. This allows us to effectively kill two birds with one stone–we can query by a single job ID, as well as sort by the time that the job was created with a single index. blog-post-centrifuge-billions-of-events-a-day#ksuid