Jason Taylor – Director of Infrastructure
Stephen Ju – Credit Suisse
Facebook Inc. (FB) Credit Suisse Technology Conference December 4, 2013 11:30 PM ET
Stephen Ju – Credit Suisse
Good morning, everybody. I am Stephen Ju, internet equity research analyst here at Credit Suisse. I am joined on the stage by Jason Taylor who is the Director of Infrastructure at Facebook. Jason leads a group that manages server budget and allocation, designs hardware, performs architecture reviews and curates the long-term infrastructure plan for the company. Jason holds a PhD from MIT in Ultrafast Lasers and Quantum Computing and a BE from Vanderbilt in Physics, Electrical Engineering and Math.
So without further ado, Jason, take it away.
Thank you. Okay, so again I am Jason Taylor and today I want to talk to you about a few things that we are doing with our infrastructure, give you a view of our current infrastructure and then talk about a few developing technologies that we think could provide efficiency wins industry wide over the next two to five years and we'll get into it.
So, we will do a review of the Facebook scale, talk about efficiency, talk about an idea that we are working on right now called disaggregated rack and then get into some new components that we think would be very interesting.
So 84% of monthly active users are outside the United States. We have data centers in five regions. We have a very sizable infrastructure. Also very busy infrastructure. So 1.89 billion users monthly active users, 728 million daily active users. Those users upload 350 million photos per day and we have over 240 billion photos.
In terms of activity, 4.5 billion likes, posts and comments. So it's very busy. In 2012, we spent $1.24 billion on capital expenditures related to the purchase of servers, networking, equipments, storage and the construction of data centers. So with that scale, efficiency has been a high priority for us for several years.
Now I want to talk a little bit about our infrastructure now. Here, we have three different sections. One is called the front end cluster or service cluster and the back-end cluster. That cluster refers to a cluster of servers and that's the same as the network cluster. So, we think a lot of about the network, we do tons of bandwidth inside of our data centers. And so we have to be always mindful of that.
When I talk about a rack, a rack is a -- it's a rack of computers. It's about eight foot tall, has about 40 servers per rack and this front-end cluster is really our window to the world. So everybody who comes to Facebook, all of the hits that comes through Facebook they all go through a front end cluster and then the other clusters support that front end cluster, that front end web server.
So to translate that to servers, there is about 10,000 web servers per front end cluster and about 144 terabytes of cache spread out around -- over around 1000 servers. We also have ads racks in there. So these are servers that are dedicated to serving ads. Multifeed, which is that center news column of Facebook, it's kind of the main page of Facebook. And all of these work together to generate a page. This is what they look like.
So we talk about vanity free server. And when we say vanity free, what we mean is that we've removed the components that are really not necessary for operation and scale. So no VGA ports, USB, all of that kind of stuff, it's really not necessary. So we cleaned all of that out and in looking at these slides this morning, I realized that they are not entirely vanity free. You will notice that they are all blue, so we did a lot of that.
So this newsfeed rack, I am going to get into some of the technical details because they’re really necessary to understand what we propose to do in the future. So a rack is our unit of capacity. Now when we say it's our unit of capacity, it means that all 40 servers inside that rack work together to produce the service that they're doing for the company. So a news speed rack has a lot of RAM and a lot of CPU and the entire history of the last three days of activity in a very compact form sits on a single rack of servers at Facebook. It just indexes into the recent activity.
So when I'm using Facebook and I go to Facebook and I pull up my newsfeed, what actually happens on the backend is a query goes from the web servers to a newsfeed rack and it has all of my friends, it has a list of all of my friends. That aggregator, which receives the query, then contacts all of the leaves on all of the other servers and says what's happened with Jason's friends recently. It gets all of that data together, ranks it via ranking algorithm and then sends off the top ten stories. So there is actually many more options, but it just ranks and it just gives you the top 10.
Now if you backup a little bit and think about the life of a hit on Facebook. So a hit is a singular cluster Facebook. What you have and have drawn time going down is a request starts and that web server contacts MEM cache. So it talks to our caching servers, it then authenticates the hit, and make sure I am who I say I am, it might hit a newsfeed server, the newsfeed server returns a bunch of IDs. Those IDs correspond to stories, content that we want to show the user. All of that content is actually kept in the caching tier. So most of that's pulled up, maybe some of it isn't there, we might hit a database, stuff some stuff into cache and then grab some ads and then shift the page.
So this is a simplified version of what actually happens. In fact, we actually are shipping the page all along, so we actually ship the page in about eight stages and that's so that the browser can start working on rendering in parallel with page generation.
So all of Facebook is made up of five standard servers and so we have -- we've numbered them one through five and each one focuses on a single major service at Facebook. So the web servers are all about lots of CPU, you need a lot of CPU to generate our website. The database servers, it's everything to do with IOPS. So we use flash for that. Hadoop uses a lot of CPU and lots of drives. And then photos, it's all about the lowest dollars per gig. So we store lots of photos, so we do a lot of optimization work there.
Now feed, that newsfeed service that I just talked about uses both the lot of CPU and lot of RAM. So we have copious amounts of both. Any new project, any other servers at Facebook that want servers, they can have anything they want as long as in front of these five servers. In fact, the capacity engineering team has t-shirts that they know on it. You really have to conform to one of these standard servers.
So what you get from there is you get volume pricing, which is huge when you are operating at scale. You also get re-purposing. So if we have ten services that all have a forecast and there is some uncertainty in the forecast, being able to take unused servers from one service and reallocate them to another is a huge efficiency win. If we had a different kind of server for each service, then repurposing wouldn't be possible and we'd always be dealing with the worst case of all forecast. So it's a surprising win but it's kind of analogous to I guess managing a mutual fund, some things are up, some things are down but across the board you do every well.
So you also get easier operations. So in other facilities, not our facilities, you might have a server to technician ratio of about 1 to -- 450 servers to 1 technician. Because all of our servers is the same we're able to attain about a 20,000 to one server to technician ratio. So these servers are all very easy to maintain, very easy to operate.
Some of the drawbacks. So we do have five major servers, major services. The other 35 services need to conform to that standard. So, they may not fit quite perfectly and there is also 200 minor services that all are important for Facebook. The other thing that happens is the need of the change -- of the services change over time and we'll talk about how we have an idea to fix that.
So, in terms of efficiency, our data centers are extremely efficient at managing heat. In a poorly designed facility for every watt of power that a server consumes you are going to burn another watt just on air conditioning. If you do some nice optimizations, you can get to about a 0.5 watt wastage. At Facebook, we have a – it’s called a PUE of 1.07 which is the ratio of the total power consumed at the street divided by the total power consumed by all the servers, which means that for every watt of power we consume we are only burning another 7% on cooling. We do that by essentially eliminating all air conditioning, we’ve talked about this project a lot and it’s really a huge efficiency win for us.
In terms of servers, the vanity free design helps but also helps us really running a good supply chain. So when you have lots of volume and you have very predictable purchases you can really get a lot of cost out of those servers. And probably some of the largest efficiency wins at Facebook come from our software. So we have a tremendous software efficiency effort, one that I would highlight would be HHDM or HPHP we’ve talked about it a lot. But this is the core piece of software that replaces that Apache and PHP in our infrastructure.
Now if we were to go back and use Apache PHP which is very common, we would have to buy four times as many web servers as we have today. And we have added more web servers than anything, so this is a massive efficiency win for us. It’s all open source, we talk about it a lot but software efficiency is huge for Facebook.
So in terms of next opportunities, I am going to talk about an idea called disaggregated rack and then we will get into some new components that we think could be interesting. So when you think about that rack of newsfeed servers and yes, there is 40 servers and they are all identical but think about what they are actually putting online, you have about 80 processors of compute, about 6 terabytes of RAM, 80 terabytes of storage and up to terabytes of flash. Now the application lives on the rack of the equipment, not a single servers, so we are not server centric, we are rack centric.
And so what we think we can do is we think we can take these components and we think we can stack them in a different way and put them online in that rack in a different way that can provide a really nice efficiency wins. So we think our building blocks of compute which is really just the standard server, RAM, so a RAM slate, so server with a lot of RAM on it, a storage slate which is again just one of our knock servers of 15 drives and you replace that fast expanded with a small server in the back, and then flash where we have the flash appliance or flash slate rather than individual PCIE flash cards.
And the three wins of disaggregated rack are that server service speed, so when you have only five servers you really want to be able to fit those servers, you really want to be able to provide the exact right resource, we will talk a little bit about that. As the servers service fit over time and then there's a longer useful life, now for that server service fit what we have on the left here is a type six server, so type six server provides so much CPU and so much RAMs, so the dotted line is the server, provides this much RAM and this much CPU. The consumption is that arrow and because we design the type six server for multi-feed or for newsfeed, it fits perfectly right, the exact right amount of RAM and the exact right amount of CPU.
Now if we did it to another service, say search, search doesn’t need as much CPU, it’s actually very RAM hungry. And so we actually only need about half as much CPU as what that service provides. Now if we differentiate the skew, we give them a different type of server then we start having problems with our volume, we start having problems with repurposing. And so effectively that’s a wasted CPU resource, not the worse -- it is an area of waste.
Now the other thing that can happen is server service could – their needs could change over time, so in the beginning they might need – they might be a perfect fit but then a year or two in they might need more RAM. So their CPU is fine, but they need more RAM. Well all company faced this problem is if you have one thing that does the job and you run out of one bottleneck, you buy a whole another thing. So if I just bought 40 servers and they may need twice as much RAM you are going to buy another 40 servers, or if it’s Facebook you have 2000 servers and then they need twice as much RAM you buy another 2000 servers. So being able to do – being able to grow RAM independently of CPU could be very important.
The other thing that you can get out of this disaggregated rack is a longer useful life. So everybody everywhere it’s right of a computer when do you think that they need is no longer fitting. I don’t have enough CPU, I don’t have enough RAM, I don’t have enough flash around disk base. In general most people refresh their servers every three years. But with disaggregated rack, because it's just one resource we think we can keep compute for more like 3 to 6 years. RAM doesn’t go bad, right, it just RAMs, all the devices last forever. And so those could be five years or more, at this slate, easily four to five years and the flash slate could be easily six years or even 10.
And so if you think of disaggregated rack for graph search, rather than have 40 identical servers, we could have a mix of these building blocks. So, so many compute units, a couple of maybe one flash slate, a couple of RAM slate and a storage slate, and the big thing that’s different here is that later, so when graph search has a CPU win, they are more efficient in CPU. What that means is that they can handle more traffic with the same amount of flash and RAM. Well if they are more efficient then we want to add more flash or add more RAM and get more traffic going to that server, and the way to do that is to just add another flash slate.
So in other words, if you look at the other – in the other direction if at some point, this particular service needs twice as much as flash, it is far more cost effective to just add in another slate of flash to double the amount of flash, so you have only paid for that incremental cost of flash as opposed to buy a whole separate rack. So you are just adding that marginally needed component.
The strength of disaggregated rack is we maintain volume pricing, serviceability, all of that but then we are able to do the custom configuration, so better sit the services specifically. We can also do – we can also hardware easily over time and do smarter technology refreshes. This will also help with just speed of innovation. So if you only got to build a new component and it just has to support that new thing, whatever that new greatness is, you can just build that one thing and then slam into a rackable older or well established SKQs.
Now the potential issues there are physical are physical changes are required, so those data center technicians are going to have to do a little bit more work, that’s okay, we can hire more of those guys and there is some interface overhead.
Now the approximate win estimates – so when I say OpEx I really mean depreciation of power. Conservative assumptions show a 12 to 20% OpEx savings and more aggressive assumptions between 14% and 30% OpEx and this is just a reasonable savings. So this kind of approach of disaggregating the components is something that of course we can do but pretty much anybody in the industry can do, it just really requires being able to have your software adjust to this different model of compute.
So what are we really talking about here? Over the last 20 years, so up in the upper left-hand corner is a large tower case, helps a 386 computer from 1992 and down at the bottom is Mere [ph] Michaels and he is holding a Facebook server from 2012.
Architecturally, nothing has changed between the 386 and what we're serving today. It’s still a computer, a processor that assumes that all of the RAM is local, all the drives are local, all of those peripherals are still on a PCIE bus, the RAM bus is all pretty much the same. Architecturally the servers that we install on a data center are no different than a desktop from 20 years ago.
It has been a lot of innovation that have been huge, so the mass coprocessor that was great. It was -- turns out if you have a dedicated chip to be massed, it’s much faster than a general processor. SMP allowed two processors per server, so having two processors is better than having one because you can amortize the rest of the server cost over two processors twice as much compute. Multi-processors, that was key in scaling Moore's Law. GPUs do vector math very well and flash memory is the most recent game changer.
Fundamentally, though it’s all the same thing, it is a processor running assuming everything is local. And so while we have had exponential improvements in CPU RAM disconnect and all of this has been amazing progress we’re still operating on the same model, and so one of the ideas, one of the things that we are hoping to accomplish with disaggregated rack and ideas like this is to break this model up. Think about a server in a data center as unit of computers rack for any of you that have been doing this for a long time this is no different than a mainframe. This is computers repeat themselves over and over and over again, all of technologies through the cycles.
The big thing that’s changed is the network bandwidth is now amazingly high. And whenever you have a lot of network bandwidth whenever you have a really high backplane you disaggregate your component. This happens again and again and it’s now time we have 10 Gb next in all of our servers is completely conceivable that we can go to 25, 40 or even 100 Gb in the next few years. So network is not a problem, which means that this kind of approach is perfectly reasonable.
Talking about some of these components, we also see ways in which these components can evolve. Now with CPU, server CPUs are really just compute dense versions of desktop processors, so the desktop world consumes lots and lots of processors, maybe a 4 core processors, if you double the cores, then you’ve got a server processor. So you’ve got all of these nice volume and in the desktop world and servers are just hanging out on that.
One of the things that’s been a very popular idea in the mobile world and other embedded processors is this idea of a system on a chip. So when you build servers and you build them over and over again you notice that the CPU, the PCH the Nick [ph], they are all the same. So there are some components that you always buy together when you build a server. Fundamentally these are all just standard IC chips and so what system on a chip is this is a popular idea that’s gaining some traction is to -- when you're building a processor, build a regular processor but then leave a little bit more silicon licensed Nick design, licensed PCH and essentially put all of that onto the same processor.
If you're buying three component all at a time, one I just buy one component. And so system on a chip does that, and the non-obvious win with system on a chip is it simplifies the rest of the computer. So when you're building a motherboard, when you're building pretty much any modern server you might have an eight layer border, got even a 12 layer board that has a lot of complexity, or if you pull in a lot of the components into a single chip then all of this becomes a much simpler design and simpler means of higher yield, which means much cheaper. So pulling these things together, the stuff we buy anyway all the time, system on a chip design is really nice one.
So really SSC processors developed for servers that are derived from either the mobile or desktop market can be very high power and cost efficient, it’s really a nice win, lots of people see it and we expect this to be very popular over the next few years.
In terms of RAM, so there is something called the DDR standard. So DDR standard talks about how RAM is designed. DDR allowed for the commoditization of RAM. So you have multiple vendors and they are all hitting a single standard, that DDR standard, it’s difficult to hit – it’s difficult to build a completely different kind of memory and hit the performance of standard RAM. And so there is no space for that in computers right now. You either have standard RAM which is built on capacitors and consumes lots of power, or there is nothing. And then if you want something interesting you have to put it up the PCIE bus which is why you see PCIE flash cards.
And so we think that there is space, and there is a lot of interesting opportunities for a separate kind, a separate class of RAM and so what we expect is that for lower latency memory technology there's an idea of near RAM, which is your standard RAM and far RAM and where far RAM could be slower RAM.
It’s like the stuff you are going to work on very frequently, the stuff that you need all of the time, put that in regular RAM. The stuff that you need wants every hour or once every 15 minutes, put that on the slower stuff. You spend more money on the fast stuff, you spend less money on the slower stuff everybody wins. It’s a nice analogy to how we have been using flash in our data centers. We use flash kind of as RAM substitute. So we have a large data set that we need to keep, we can’t keep all of it in the RAM because RAM is really pretty expensive, flash on a dollars per gig is more dense, more power efficient and cheaper. And so we put a lot of the data in flash and then keep only the more recently access stuff in RAM. That can be expanded to regular computers.
Flash, so SSC drives are commonly used in databases and applications and they need low latency, high throughput storage. And the flash industry has been focused on driving higher and higher right endurance and performance. So they will be going for better and better flash cards and better and better flash devices. We actually think that by looking in the opposite direction towards very poor quality flash low latent endurance, highly dense flash, low IOPS performance, so the opposite of what computers normally do is bigger, better faster and we want to go the other way.
We think that a cold flash storage option is possible, so some of our workloads that we see and some of the workloads we think pretty much everyone will see is something where you want to write to it once but then not really very – and you want to write it once, read it all the time but you don’t have to update it. So for us that’s photos. When somebody updates a photo we want to keep that photo forever, we don’t want to change it, we are not doing editing on the photo, even if we did only a few people would do it.
The fact is that the a write once read memory flash for an alternative solid state technology could provide extremely high-density storage at a reasonable cost. And so just to run some very basic numbers on this, if you look at a rack of Nox drives [ph], this is the system that we use for our cold storage archival storage, you have about 2 TB of data in a rack, it draws about 1/2 kW of power and that’s because many of the drives are spun down, so they are not consuming any power, it weighs about 2500 pounds and consumes about 0.8 W per terabyte, now if you compare that to a rack of SSC drives, so we have done nothing special, just put a ton of laptop drive in a rack. We didn’t actually build it. We get about 4 petabytes of data, it draws about 1.9 kW of power, and that's without any optimization, it weights a little bit less and consumes about half as much watts per terabyte. So the power density of rack of drives in cold storage is almost two times that of a solid state option that’s built from currently available laptop drives.
No optimization yet, just buy stuff off the street. And we think that a focused effort in warm solid state option could yield much higher density and longer hardware lifetime at a reasonable cost. So with all of this massive data growth today the industry produces 3 DB of hard drive going to about 20 DB of hard drives in 2020. A large portion of that data that’s going to be stored on those drives is really write once. Quite frankly it’s probably write once read never. But let’s imagine that we do read it, warm could be just fine. And so we really think that solid state for a permanent storage would be very, very interesting.
That’s it from me.
Stephen Ju – Credit Suisse
Jason, we probably have time for one short question, how is that materially different versus the other technologies that are already widely available?
Sure when we say how is cold flash materially different than the current flash, whenever you think about flash today, they think about NAND based flash and NAND based flash is great, that’s the latest incarnation of flash. In fact, we had E proms, EE proms long before that and those were designed with completely different silicon. What we’re saying here is if we design a solid state device from the silicon up that’s focused on very high density, very high-density data storage, but it's not mutable it doesn't change, then we can get a much higher bit density, at very low power. So it is a alternative approach to the whole silicon but we look at it looks very possible and feels like we can probably get that to market in the next three to five years.
Stephen Ju – Credit Suisse
I think with that, we are actually out of time. So thanks very much, Jason.
Copyright policy: All transcripts on this site are the copyright of Seeking Alpha. However, we view them as an important resource for bloggers and journalists, and are excited to contribute to the democratization of financial information on the Internet. (Until now investors have had to pay thousands of dollars in subscription fees for transcripts.) So our reproduction policy is as follows: You may quote up to 400 words of any transcript on the condition that you attribute the transcript to Seeking Alpha and either link to the original transcript or to www.SeekingAlpha.com. All other use is prohibited.
THE INFORMATION CONTAINED HERE IS A TEXTUAL REPRESENTATION OF THE APPLICABLE COMPANY'S CONFERENCE CALL, CONFERENCE PRESENTATION OR OTHER AUDIO PRESENTATION, AND WHILE EFFORTS ARE MADE TO PROVIDE AN ACCURATE TRANSCRIPTION, THERE MAY BE MATERIAL ERRORS, OMISSIONS, OR INACCURACIES IN THE REPORTING OF THE SUBSTANCE OF THE AUDIO PRESENTATIONS. IN NO WAY DOES SEEKING ALPHA ASSUME ANY RESPONSIBILITY FOR ANY INVESTMENT OR OTHER DECISIONS MADE BASED UPON THE INFORMATION PROVIDED ON THIS WEB SITE OR IN ANY TRANSCRIPT. USERS ARE ADVISED TO REVIEW THE APPLICABLE COMPANY'S AUDIO PRESENTATION ITSELF AND THE APPLICABLE COMPANY'S SEC FILINGS BEFORE MAKING ANY INVESTMENT OR OTHER DECISIONS.
If you have any additional questions about our online transcripts, please contact us at: firstname.lastname@example.org. Thank you!