Yahoo: Boosting Value with New Infrastructure

Includes: AABA, GOOG
by: Larry Dignan

Yahoo (YHOO) had a lot of “technical debt”—systems that most companies would call legacy infrastructure—and has it just about paid off.

Sam Pullara, chief technologist at Yahoo, walked developers through its infrastructure changes over the last three years. The talk could have come from any enterprise that finds itself with a rat’s nest of systems, no architecture vision and siloed divisions. “One of the things you have to understand. Yahoo’s IT wasn’t built on one uniform infrastructure. It was built by acquisition and building up properties,” says Pullara. “We had a lot of technical debt.”

Sound familiar? Yahoo went through what every companies does. Simply put, it had too many legacy systems that were strangling the company. The biggest hurdle: Resisting the urge to build your own stuff. Pullara says:

When building your own systems should you be leveraging the open source community? Are these the kinds of things you do as opposed to reinventing the wheel. What we’re finding is not reinventing the wheel has a great benefit.

Here are some of the key parts of Yahoo’s IT overhaul and my notes.

Cloud computing at Yahoo

Pullara said that the key item for cloud computing is the definition. Yahoo says cloud computing is shared infrastructure and platforms that you don’t have to install, deploy or operate.” “You’re putting your faith in a third party and saying they can do it better than you can,” he says.

Yahoo’s cloud computing team provides internal infrastructure to developers. Services include edge caching, storage, data pipelines, servers and batch processing. These services are stacked into one cloud that’s delivered as a service through Yahoo. Here’s the breakdown:

Edge servicing: Yahoo has its own traffic server via Inktomi. There’s caching with Yahoo CDN and a proxy cache, routing with software load balancing, high-scale and a proposal to add to Apache as an incubated project. Pullara said the source code for its edge services will be contributed to Apache as a start to become a top sponsored project.

Structured storage: Yahoo has a multi-colo key-value store dubbed Sherpa. The service isn’t open source, but available to and third parties. Pullara noted there’s a moving away from structured databases like Oracle (NYSE:ORCL) and SQL to key-value stores. The decision point is determining when you need MySQL vs. a key value store. In 95 percent of the cases, Yahoo finds it doesn’t need MySQL and can use a key value store, says Pullara.

Unstructured storage: Yahoo has a “blob store” called Mobstor that isn’t available externally.

Web and data serving

Yahoo uses virtualization based on Xen. It also has declarative application definitions and components.

For batch processing it’s widely known that Yahoo uses Hadoop MapReduce-based computing. It was deployed in 2006 with dozens of grids and 10s of thousands of nodes. “It’s a big stretch to teach people the MapReduce programming paradigm,” says Pullara. That’s why Yahoo created the PIG language.

Pullara went into details about Hadoop, which he called a love story. Yahoo started building its own MapReduce platform, but decided to go open source. This move has generated other projects including:

  • PIG, a MapReduce language;
  • Owl, a workflow for batch jobs;
  • Zookeeper, distributed in memory Tx DB;
  • HBase, a BigTable clone. HBase is being investigated as a long-term solution for Yahoo, says Pullara.

The biggest perk in focusing on Hadoop is that Yahoo now has a generation of developers that know MapReduce well. Google (NASDAQ:GOOG), Twitter, Facebook and a bunch of others use Hadoop. “You can go to universities and find people that know MapReduce now,” says Pullara.

Yahoo’s Open Platform Services

Pullara said that the definition Yahoo uses for open platform services is that they are generally available both internally and externally and provide high-level interfaces to users, data and computational resources. Pullara acknowledges that there’s a definition nuance compared to cloud computing, but he adds that there’s more value being added.

Yahoo’s Open Platform Services are offered by the Yahoo OS, Yahoo properties and supported by its developers. Includes social APIs.

Pullara said that this overall is what makes Yahoo available to third party applications. Yahoo can read and write APIs through YQL—Yahoo Query Language. YQL is now powering Yahoo Messenger’s Insider page, which is pretty handy and loads nicely, and acts as a smart proxy to Web services.

“Yahoo and third party developers are so much more agile when we have these types of things,” says Pullara, who noted the quality, operations and network effects of Yahoo’s IT overhaul. And oh by the way, it’s cheaper. “We didn’t have hard ROI metrics, but we did a lot better work and weren’t as inefficient as before. “