The video, which has been removed from Google Video, apparently comes from an in-house “tech talk” presentation by Google’s Ben Darnell to new Google employees. He describes the functionality of Reactor, the back end for managing RSS and Atom feeds, and how it will be used to power “activity streams” for a project code-named Makamaka (or Macamaca?), Google’s major effort to build infrastructure for social networking across all of its applications.
He also said that Google Reader will support more languages, and a feature that clusters users by search history and suggest feeds might be launching relatively soon.
Google’s Orkut already feeds and displays changes profiles, photo albums, video favorite and testimonials.
Here are some notes I wrote up from the audio of the talk, which is just a portion of a 52-minute presentation:
Reactor is the back end for user-driven RSS or Atom feeds. Google Reader, iGoogle, Orkut, Blogger are the biggest users of Reactor. AJAX APIs and Spreadsheets, which just launched a feed feature also use it. It is build around the model of streams, assuming any feed is a sliding window on a potentially unbounded stream of items, which are indexed and now searchable in Google Reader.
Users can tag items and users can subscribe to the tagged items or feeds by reference. As the list is updated it is reflected in other people's views.
The flexibility of Reactor comes at the expense of some performance. In maintaining a history, Reactor can’t tell the difference between a feed that has been deleted and a bug on a site. For now, it exposes the fact when the first person subscribed to the feed, and stops crawling when no subscribers are left, which is uncommon. People don’t always remove feeds.
iGoogle and MyYahoo are biggest readers…breaking new territory by keeping full history and have to address it at some point, but no suitable standard to adopt already. Once we talk that we can work with Blogger and Movable type to get it adopted fairly quickly.
Two-thirds of data comes from feeds with only one subscriber. Thought it was an effect from early growth but not the case. The number of feeds has grown as fast as the number of users.
10 terabytes of raw data for 8 million feeds.
Index growing at four percent per week.
Some feeds are very spammy and we try to garbage collect those.
Page views per user higher except for Gmail and Orkut.
Seventy percent of Google Reader traffic from Firefox. Still a niche, given Internet Explorer is more dominant.
RSS experience not where it needs to be for new users.
Monetization of feeds via Feed Burner. Have a bit of the Google News problem–it’s other people's content. Could possibly identify feeds from AdSense publisher and share the revenue.
Ads in feeds may turn out to be better than ads on Web sites…people selected the feeds, so you know a little bit more about them and they have shown commitment to the feed…some potential we haven’t looked into yet.
A lot of feeds are now dynamic, generated by search results on blog search or Google News, dynamically generated feeds for a single user, such as Flickr feeds for all the comments on your photos.
Crawl scheduling is prioritized on number users. Crawl rate is an hour for any feed with more than one subscriber and three hours for only one subscriber. Soon a feature - that if feeds send pings of being updated - will work with blog search to accept those pings very soon and then we will be able to get the data more quickly, a sort of push-based notification.
First to served feeds out of a disk-based Big Table…latency is something we are happy with. Now a full Mustang Tree. Big Table closest to a standard distributed database at Google…different form SQL or MySQL–no sophisticated queries…limited to what you can do efficiently in distributed environment.
Mustang takes the data and break into chunks and distributing across servers.
Reader has a Mustang Tree for search, and searches the most recent data and what have been read. It keeps all the data in memory…means that it has some poor tradeoffs …designed for thousands of QPS (queries per second). Will have difficulty scaling as accumulate more data.
Looking at ways to do clustering for search, but don’t think will use Mustang.
News has thousands of news sources and millions of feeds. Recently rewrote clustering system. For immediate future do simpler form of clustering…instead of keywords cluster links
Mustang a library for creating search engines. The major exception is Gmail, which has a separate index for every user and providing instant updates
Upcoming Makamaka activity streams via Reactor.
Makamaka is Google’s current big effort to build infrastructure for social stuff across all of our applications.
Most familiar with activity streams like Facebook feeds…will use Reactor to get all of events from friends into the system.
Various Makamaka plans involve us getting calls to Reader from Gmail and Orkut so have to scale up a couple orders of magnitude in queries.
Access to address book and contact list and figure out ways to use that information in Reader.
Beyond just sharing this, including a comment in it is something we are interesting in.
Clustering of related items and including comments when share items.