Robert Scoble's done some calculations that illustrate why RSS is becoming a bandwidth nightmare for large sites.
Aggregators constantly pinging the RSS feed for updates clearly doesn't scale. This certainly isn't news, but MSDN pulling their full-text feeds is the first high profile example.
The problem can be eased by reducing aggregators' polling frequency - once every couple of hours, or even once a day - but surely that's a band-aid? Scale up the (currently tiny) number of RSS readers a few thousand times and the stickiness of RSS means you've still got a problem, even polling once a day.
Besides which, limiting polling to once per day destroys a large chunk of the appeal of RSS: I want to know whenever stavros updates, and I want to know NOW, dammit.
So what's the solution? A push technology is never going to provide what RSS does - we're immediately back in the world of email and spam and filtering and... ugh. And any pull technology is always going to poll for changes. So the solution must lie in limiting the bandwidth required by the poll.
I have no wish to descend into the depths of the RSS/RDF/Atom/Semantic Web debates, but the solution lies there, in the specification. RSS providers need a way to tell aggregators "go away, there's nothing to see here for another 12 hours". And it needs to happen in less than a couple of Kb.
There's already a TTL element to indicate "number of minutes that indicates how long a channel can be cached before refreshing from the source" and the less useful skipDays and skipHours elements.
Posted by: Gordon Weakliem | September 10, 2004 at 01:26 PM
Actually, despite the fun of slamming skipHours and skipDays, they are more useful for most uses of RSS than ttl. Say (from the numbers Bloglines is showing me) Stuart posted this at 7:10pm. We know from his posting history (Dude! You make me look good!) that a reasonable ttl is around 10000, but to be on the safe side, set it to 5000. I won't check again for 83 hours, whether or not he posts again, at which point I'll pick up whatever edit he made within the first five minutes, along with a possible out-of-character six more posts in the next hour.
When you post to your blog, you have absolutely no idea when you will next post. ttl is just a religious preference for how often *you* think someone should want to set their aggregator to update. But say you blog at work, so you know you will only post between 9 and 5 localtime, Monday through Friday. Why get tens of thousands of hits, 304s or not, when the office is closed? skipHours and skipDays are perfect for you. Me, I'll never post between 1 and 7 am localtime, and I see by my logs that a copy of Radio Userland checked my feed at 00:55 last night, and then that same person checked again at 07:54 this morning, saving us both six absolutely fruitless checks. Just to pick on Luke at random, during that time various copies of SharpReader hit me 142 times, despite the fact that both I and most of their users were asleep.
But, less than a couple of KB? Total bandwidth for returning a 304 Not Modified ought to amount to a few hundred bytes. Apache ought to be able to handle several hundred at once. The problem MSDN is having isn't telling people "nothing new, move along" so much as it is having something new: aggregate enough weblogs, and you've always got a partially new feed, so you send the new stuff and the old stuff, too. What we need, but don't seem to want to design, is either a way of asking for anything new after 20040909T17:34:00Z, or a way of delivering a feed that only tells where to get the actual items, so that if there are three new items, and 17 old ones, you only get the three new ones, rather than fetching yet another copy of the stuff you've seen before.
Posted by: Phil Ringnalda | September 10, 2004 at 03:35 PM
Yeah yeah, so I should post more often. If I always got educational responses like those then I might do so! ;)
The skipHours and skipDays is actually the sort of thing I had in mind. I know I'm unlikely to post after 9p.m. and before 9a.m., Sydney time. That's most of the working day on the West Coast of the US, which equates to a whole bunch of unnecessary hits.
I get the point about getting only 'New' items. Separating the notification from the content would be a nice way to go about it. With that method, the BBC (for example) could have a single sports 'notification' feed, stating the item guid and its associated 'categories'. Then, rather than getting the item about Manchester United from the 'Sports Headlines', 'Football Headlines' and 'English Premiership' feeds, you'd just pull it down once.
So who's going to fix it? Fancy facing the RSS/Atom mob?!
Posted by: stuandgravy | September 10, 2004 at 04:40 PM
ORANJESTAD, Aruba - Felix rapidly strengthened into a dangerous Category 5 hurricane and churned through the Caribbean Sea on a path toward Central America, where forecasters said it could make landfall as “potentially catastrophic” storm.
Felix was packing winds of up to 165 mph as it headed west, according to the U.S. National Hurricane Center. It was projected to skirt Honduras’ coastline on Tuesday before slamming into Belize on Wednesday.
“As it stands, we’re still thinking that it will be a potentially catastrophic system in the early portions of this week, Tuesday evening, possibly affecting Honduras and then toward the coast of Belize,” said Dave Roberts, a hurricane specialist at the center in Miami.
Posted by: IlliltRok | September 04, 2007 at 10:04 AM