We need a reliable way to distribute a variety of update events emitted from MediaWiki core (and other services) to various consumers. Currently we use the job queue for this (ex: Parsoid extension), but this is fairly complex, not very reliable and does not support multiple consumers without setting up separate job types.
We are looking for a solution that decouples producers from consumers, and gives us better reliability than the current job queue.
Benefits
- Simplification: Avoid the need to write and maintain a separate MediaWiki extension for each event consumer. Reduce maintenance by focusing on one standard queuing solution.
- Single points of failure: A failure of the job queue Redis instance or EventLogging database will cause instant failures of update jobs and the loss of events / jobs. A robust event queue eliminates these single points of failure.
- Robust updates at scale: Updates like cache purges are currently propagated on a best-effort basis. If a node is down when the event is sent, there is no way to catch up. With more services and more aggressive caching we'll need more reliability at scale. Currently the only way to achieve this would be creating one job per consumer. However, this does not scale to many consumers.
- Performance and scalability: Job queue overload has in the past slowed down edit latency significantly. Both the job queue and EventLogging are hitting scalability limits.
- SOA integration: The job queue is a MediaWiki-specific solution that cannot be used by other services. The event queue should provide a clearly defined service interface, so that both MediaWiki and other services can produce and consume events using it.
Event type candidates
Moved to T116247: Define edit related events for change propagation.
Requirements for an implementation
- persistent: state does not disappear on power failure & can support large delays (order of days) for individual consumers
- no single point of failure
- supports pub/sub consumers with varying speed
- ideally, lets various producers enqueue new events (not just MW core)
- example use case: restbase scheduling dependent updates for content variants after HTML was updated
- can run publicly: consumer may be anyone on the public Internet (think random Mediawiki installation with instant Commons or instant Wikidata) instead of only selected ones with special permissions
Option 1: Kafka
Kafka is a persistent and replicated queue with support for both pub/sub and job queue use cases. We already use it at high volume for request log queueing, so have operational experience and a working puppetization. This makes it a promising candidate.
Rough tasks for an implementation:
- Set up a kafka instance
- define events & relative order requirements
- hook up a synchronous producer to the relevant MediaWiki hooks
- Figure out good producer & consumer interfaces
- Raw Kafka creates a fairly high bar to entry
- API inspiration: Amazon, Google TaskQueue, Google PubSub, Azure
- Kafka REST proxy by confluent.io
- custom HTTP / websockets? (ex: T88459: Implementing the reliable event bus using Kafka)
Open questions
- Where / how should we expand link table jobs? A consumer of the primary event that enqueues individual updates to another queue? Also see: T102476: RFC: Requirements for change propagation
- How can we scale this down for third-party users?
- Can we build on the existing job queue fall-back?
- T110927: Considerations for supporting job queue use cases with the unified event bus