Messaging Apps (WhatsApp/Messenger)
Requirements
Functional Requirements
- System should support 1:1 chat
- System should support group chat (max limit on how many users can be in a group? )
- Do we just support text or media as well?
- Is there a message size limit? ~ 100,000 characters
- System should keep track of status of users (online / offline)
- Support persistent Chat history (for how long?) ~ forever
- Multiple device support
- Push notification - notify users of new messages when they are offline (integrate with third party API)
Non Functional Requirements
- Low latency - real time chatting experience
-
System should be highly consistent, same chat history across all the devices
Should the service support web app, mobile app or both?
- Highly available, we can tolerate low availability in the interest of consistency
Capacity Estimation and Constraints
Daily Active users = 500 million
Average messages per user / per day = 40 messages
Total messages per day = 40 x 500 million = 20 billion messages / day
Storage: Avg message size = 100 bytes
Total messages size = 20 billion x 100 bytes = 2 TB / day
5 years chat history = 2 TB x 365 x 5 = ~3.6 PB
Bandwidth:
2 TB / day = 2 TB / 86400 sec ~= 25MB / second (Egress and Ingress)
High Level Design
- UserA connects to chat service and sends a message to UserB through the chat service
- The chat service receives the messages → sends acknowledgement to UserA → stores the message → sends message to UserB (or to the server that holds the connection to UserB [map of connections])
-
To start, client uses
HTTP
connection to connect with Chat service.keep-alive
header helps maintain the connection for long time and reduces number of TCP handshakes** Since
Http
is client initiated, server cannot initiate a connection to a client to send a message**
Message protocols
Polling
- client periodically asks server if there are new messages
- depending on the frequency of polling, it could be expensive - TCP connection handshake
- Also consumes server resources that could be used for other tasks as the answer is no most of the time
- Server would also have to store undelivered messages in the meantime.
Long Polling
- Client maintains connection with server until it has new messages or it times out or threshold has been reached
- Once the client receive new messages, it immediately sends another request to the server restarting the process
Cons:
- Sender and receiver may not connect to the same chat server. HTTP based servers are usually stateless. If you use round robin for load balancing, the server that receives the message might not have a long-polling connection with the client who receives the message.
- A server has no good way to tell if a client is disconnected.
- It is inefficient. If a user does not chat much, long polling still makes periodic connections after timeouts.
How can a server keep track of all the opened connections to redirect
message to the users ?
The server can maintain a hash table, where “key” would be the UserID and
“value” would be the connection object. So whenever the server receives
a message for a user, it looks up that user in the hash table to
find the connection object and sends the message on the open request.
Websockets
- Asynchronous updates from server to client
- Bidirectional and persistent connection
- Starts its life as a HTTP connection and could be “upgraded” via some well-defined handshake to a WebSocket connection
- Works well even if firewall is in place (bc it uses port 80 or 443 which are also used by HTTP/HTTPS connections)
- Can be used for both sending and receiving
How many chat servers do we need? Let’s plan for 500 million connections at any time
. Assuming a modern server can handle 50K concurrent
connections at any time, we would need 10K
such servers.
How do we know which server holds the connection to which user?
We can introduce a software load balancer
in front of our chat servers; that can map each UserID to a server to redirect the request
.
Delivered flow:
Server receives a message from user A → stores in a database (could happen asynchronously) → sends message to user B → sends acknowledgement to the sender
Discuss how we don’t need to store first to send acknowledgement to sender. We can pass the message along and store in the background to reduce latency.
Note: Chat applications mostly use WebSockets
Maintaining message sequencing:
- Keep sequence number with every message for each client.
Storing and Retrieving messages from the database
- How to efficiently store and retrieve messages from the dB?
Two options:
- Start a separate thread to store the message
- Send an async request to the database to store the message
We have to keep certain things in mind while designing our database:
- How to efficiently work with the database connection pool.
- How to retry failed requests.
- Where to log those requests that failed even after some retries.
- How to retry these logged requests (that failed after the retry) when all the issues have been resolved.
What type of database to use ? SQL vs NoSQL
- High number of updates / writes
- Same write : read ratio
- Need to fetch a range of records quickly
- While querying, a user is mostly interested in sequentially accessing the messages
We cannot use RDBMS like MySQL or NoSQL like MongoDB because we cannot afford to read/write a row from the database every time a user receives/sends a message
- It will increase latency and create a huge load on the databases
Need a wide column database like HBase (used by Facebook)
- Column oriented key-value based NoSQL database
- Can store multiple values against one key into multiple columns
- Runs on top of Hadoop Distributed File System HDFS
- Groups data together to store new data in a memory buffer and once the buffer is full, it dumps the data to the disk.
- This helps to store a lot of small data quickly but also fetching rows by the key or scanning ranges of rows
- Hbase is also efficient to store variable sized data which is required by our service.
Pagination:
Clients should paginate when fetching data from the server. Page size could vary depending on client viewport etc.
Managing user’s status
- Since we are maintaining a connection object on the server for all active users, we can figure out user’s status from this.
- Since we cannot broadcast a user’s status to all active users (500 M connections)
- When a client starts, it can pull current status of all users in their friend list
- Need to know basis → whenever a user goes offline, the message delivery fails and we can update the sender with receiver’s message.
- Client can pull the status of a user whenever they start a new chat.
Data Partitioning
Shard size = 4 TB
Total shards = 3.6 PB / 4 TB ~= 900 shards
UserID
- Hash(UserID) % numShards
- Keep messages of a user on the same database
- Quick to fetch users message history
MessageID
- If we store different messages of a user on separate database shards, fetching a range of messages of a chat would be
very slow
, so weshould not adopt this scheme
.
Cache
- B/w clients and chat servers (for history)
Load Balancing
- B/w clients and Chat servers (which maps userId to server holding the connection for that user)
Fault Tolerance and Replication
- Multiple replicas of chat server
- Multiple replicas of database servers (Reed-Solomon encoding to distribute and replicate it)
- if a chat server fails, transferring connections to another server could be expensive and complex. Instead, we can have clients reconnect to a new server if the connection is lost.
Group Chat
- We can have separate group chat object in chat server.
- GroupChatId → mapping with users that are part of this group
- Users send messages to a group using groupChatId and the servers finds all the users to send the messages to.
Push Notifications
- In our current design, users can only send messages to online users; if the receiving user is offline, we send a failure to the sending user. Push notifications will enable our system to send messages to offline users.
- User can opt in if they want to receive notifications when they are offline
- Setup notification service that takes the messages for offline users and send them to the manufacturer’s push notification server, which will send them to user’s device.