If you don’t read all of this admittedly long post, please do skip to the end and check out the BarCampBoston info. I’ll be holding a session there on the topic of Distributed Microblogging.
Ok, so let’s talk about the hard bits of doing Distributed Microblogging. It’s easy to envision a multitude of servers exchanging microblog posts, and a UI that simply arranges the posts in chronological order. By the way, chronological order is easy to do if all the servers are synched with a time standard like nist.gov, and most are. But the hard bit is making it work in a way that performs well and scales to large populations of both microblog posters and readers. I’ve been thinking of some different alternatives for this, which I’ll lay out here. As always, your thoughts are welcome.
Performance
So, what do I mean by “performs well”? Well, microblog (ie, Twitter) updates happen much more frequently than what we’d consider traditional blog posts, but not quite as fast as instant messenger or chat updates. And microblogs don’t have the notion of presence attached to them. You don’t worry about whether a Twitter poster is “on line” at any particular time, although you may deduce that from the rapidity of their responses to you.
To put a number on it, I’d say that a microblog post notification should be transmitted in less than a minute, ideally in a few seconds, although a delay of up to 5 minutes is not that objectionable. I frequently will ignore my Twitter feed for many minutes, sometimes hours. I have no expectation that I will see someone’s posts immediately (i.e., less than a second) nor do I care to. Microblogging is a river of updates. You don’t expect to see every single one.
Size
Now let’s talk numbers of followers and following. A typical Twitter user has 100 or fewer people that they follow, and a similar number that are following them. But the edge cases are far bigger. Someone like Robert Scoble (@scobleizer) or Chris Brogan (@ChrisBrogan) follow literally thousands of other users, and have even more following them.
Ok, so we need to be able to update thousands of users in nominally less than a minute, but at the exreme less than five minutes. If you don’t buy that, please leave a comment explaining why these limits are not realistic.
Alternatives
RSS feed polling
The simplest solution would be polling, which is reader initiated. Followers would poll the feeds of the microblogs they were interested in. It seems to me that polling has to be discarded as a solution except for occasional or retrospective uses. I think a solution should include RSS feeds but it seems obvious to me that for someone with 100 friends to poll those RSS feeds every minute, or ideally faster because they might post something is a stupendous waste of bandwidth.
On the sending side, if that same person had 100 followers, each of those followers would be polling the person’s RSS feed every few seconds causing a very high server load. Now multiply that by however many people’s accounts are hosted on that server and the problem quickly blows up.
Notification
A much more efficient solution would be for senders (microblog posters) to notify followers when a new post has been made, and perhaps to proactively send the new content in the notification message. Then, resources are used only when there is actual traffic to send. Both senders and receivers are quiescent when no one is posting. So what are the possibilities for implementing notification? I see the following.
- RSS cloud API - a little-known part of the RSS specification, the cloud element allows a feed to publish a web-service address that readers of the feed can register with to be notified of changes using a SOAP or xml-rpc call.
- Jabber (XMPP) channels between DMB servers to carry notifications and content.
- UDP notification with http callback. UDP is lightweight for both senders and receivers. No open connections are required between senders and receivers. It’s sort of like RSS cloud, but narrowly and specifically designed for DMB, as opposed to generalized RSS.
RSS cloud API
The cloud API was specifically designed with this purpose of notifying readers of content updates. Its original intent, judging from the RSS 2.0 spec was to allow actual client feed readers to register with the cloud. In the case of DMB, it would be cooperating servers that would register for notification with each other.
The problem with RSS cloud is overhead. Microblog entries are tiny and frequent compared with blog entries or traditional site updates. To require a follower server to read an entire RSS document to get 140 characters of content, and have this happen every few minutes when the poster updates would be inefficient to say the least. In addition, there is experience with the cloud api that indicates just the HTTP session overhead for notifying many users becomes intolerable, although this was from the perspective of actual clients being notified as opposed to clients’ servers being notified.
Jabber (XMPP)
Jabber is a very tempting candidate for this application, and has been getting quite a bit of discussion in the development community of late. Here’s an example. The advantage to Jabber is that it maintains open sessions between servers, which eliminates the session setup/teardown overhead, and allows for almost instantaneous notification of all “following” parties.
But this may also be a disadvantage in situations where there are hundreds or thousands of “followers” for a single sender. IM or chat is typically one-to-one, or one to a few, but microblogging is frequently one to hundreds or one to thousands. I am not familiar enough with Jabber servers in actual practice to know what their performance or connection limitations are. Anyone with Jabber implementation or operational experience is strongly encouraged to comment.
Jabber’s concept of presence could be used to keep the number of messages by only requesting messages updates when a user is actually logged into his/her microblog system. What’s interesting about this notion is what “logged in” means. Microblogging, at least the way Twitter works, does not really require the concept of presence. For instance, you can be “logged in” to Twitter, but dormant for hours not getting any updates until you request a page refresh.
In fact, one key aspect of microblogs that differentiates them from IM or chat is that they don’t typically “auto-update”. And rather than this being a disadvantage, I-and I think many others-find this on-demand update to be much more useful than a streaming IM or chat window. It’s really more like reading small blog posts. I read when I want to, not when someone else decides to say something. So, using the presence capability without auto-updating will require a little clever UI design to, in a sense, auto-logoff the user when their web page hasn’t been refreshed in a certain period of time. This then, will also require the ability of followers to query the senders they follow for “back posts”, so they can see what happened in the past without keeping a client logged in all the time to save all the posts.
UDP for notification
In the past year, I did some work on a revamped email protocol I called IMTP that uses small UDP messages to notify receiving servers that the sender has some traffic for them. The receiver then calls back to the sender using TCP to get the message body. This was based on Prof. Daniel Bernstein’s Internet Mail 2000 proposal several years ago. He proposed, quite rightly in my opinion, that mail senders should bear the burden of storing the message contents, not the receivers, and that mail content should only be sent when a receiver actually wants to read it.
The advantage to UDP is that it is very light weight for both the sender and the receiver, not requiring any session overhead or setup/teardown. If used for microblogging, UDP notification messages could take the place of the continuously open TCP sessions that Jabber employs, thus reducing the session resource allocation load on both ends.
Now, of course, the issue with UDP is that it is not guaranteed to be delivered. The IMTP service we built would retry the UDP message at some frequency until the receiver called back to either fetch or reject the message.
Conceptually, using UDP messages with a 140 character payload and a message number would fit a microblogging application very well. Because microblogging has no expectation or requirement of presence or real-time delivery like chat and IM do, a dropped UDP message is not a tragedy. If the sender retried even once or twice per minute, that would be plenty of timeliness for microblogging. Plus, if the UDP messages carry a monotonically increasing message number, the receiver can know if they’re missed a message and simply call back to the sender to get it. The microblogging UI can reconstruct the sequence easily when the missing pieces, if any, finally come through.
A notification system using UDP would seem to minimize the resource requirements on both senders and receivers, and can perform the same kind of message fan-in/fan-out optimization that Jabber could. In other words, a given microblog update body needs to be sent only once to any server, regardless of how many followers there may be on that destination server.
Comments, please!
I am very interested in what others think and have to say about these issues. This problem of efficient and timely-enough notification, it seems to me, is the tough nut to crack for a good solution to microblogging.
By the way, I am going to do a session on Distributed Twitter (or Microblogging) at BarCampBoston3 on May 17, 18. We are going to try to bring in some mobile technology folks to discuss the other really interesting issue with Distributed Twitter, the SMS connection.
Also, at BarCampBoston, I am going to try to organize some kind of group to try implementing a Distributed Microblogging application going forward.
Tags: distributed, distributed, microblogging, twitter, microblogging, twitter