Distributed Caching with Memcached

Cut the load on your Web site's database by adding a scalable object caching layer to your application.

Memcached is a high-performance, distributed caching system. Although application-neutral, it's most commonly used to speed up dynamic Web applications by alleviating database load. Memcached is used on LiveJournal, Slashdot, Wikipedia and other high-traffic sites.

Motivation

For the past eight years I've been creating large, interactive, database-backed Web sites spanning multiple servers. Approximately 70 machines currently run LiveJournal.com, a blogging and social networking system with 2.5 million accounts. In addition to the typical blogging and friend/interest/profile declaration features, LiveJournal also sports forums, polls, a per-user news aggregator, audio posts by phone and other features useful for bringing people together.

Optimizing the speed of dynamic Web sites is always a challenge, and LiveJournal is no different. The task is made all the more challenging, because nearly any content item in the system can have an associated security level and be aggregated into many different views. From prior projects with dynamic, context-aware content, I knew from the beginning of LiveJournal's development that pregenerating static pages wasn't a viable optimization technique. It's impossible due to the constituent objects' cacheability and lifetimes being so different, so you make a bunch of sacrifices and waste a lot of time precomputing pages more often than they're requested.

This isn't to say caching is a bad thing. On the contrary, one of the core factors of a computer's performance is the speed, size and depth of its memory hierarchy. Caching definitely is necessary, but only if you do it on the right medium and at the right granularity. I find it best to cache each object on a page separately, rather than caching the entire page as a whole. That way you don't end up wasting space by redundantly caching objects and template elements that appear on more than one page.

In the end, though, it's all a series of trade-offs. Because processors keep getting faster, I find it preferable to burn CPU cycles rather than wait for disks. Modern disks keeping growing larger and cheaper, but they aren't getting much faster. Considering how slow and crash-prone they are, I try to avoid disks as much as possible. LiveJournal's Web nodes are all diskless, Netbooting off a common yet redundant NFS root image. Not only is this cheaper, but it requires significantly less maintenance.

Of course, disks are necessary for our database servers, but there we stick to fast disks with fast RAID setups. We actually have ten different database clusters, each with two or more machines. Nine of the clusters are user clusters, containing data specific to the users partitioned among them. One is our global cluster with non-user data and the table that maps users to their user clusters. The rationale for independent clusters is to spread writes. The alternative is having one big cluster with hundreds of slaves. The difficulty with such a monolithic cluster is it only spreads reads. The problem of diminishing returns appears as each new slave is added and increasingly is consumed by the writes necessary to stay up to date.

At this point you can see LiveJournal's back-end philosophy:

  1. Avoid disks: they're a pain. When necessary, use only fast, redundant I/O systems.

  2. Scale out, not up: many little machines, not big machines.

My definition of a little machine is more about re-usability than cost. I want a machine I can keep using as long as it's worth its space and heat output. I don't want to scale by throwing out machines every six months, replacing them with bigger machines.

Where to Cache?

Prior to Memcached, our Web nodes unconditionally hit our databases. This worked, but it wasn't as optimal as it could've been. I realized that even with 4G or 8G of memory, our database server caches were limited, both in raw memory size and by the address space available to our database server processes running on 32-bit machines. Yes, I could've replaced all our databases with 64-bit machines with much more memory, but recall that I'm stubborn and frugal.

I wanted to cache more on our Web nodes. Unfortunately, because we're using mod_perl 1.x, caching is a pain. Each process and thus, each Web request, is in its own address space and can't share data with the others. Each of the 30–50 processes could cache on its own, but doing so would be wasteful.

System V shared memory has too many weird limitations and isn't portable. It also works only on a single machine, not across 40+ Web nodes. These issues reflect what I saw as the main problem with most caching solutions. Even if our application platform was multithreaded with data easily shared between processes, we still could cache on only a single machine. I didn't want all 40+ machines caching independently and duplicating information.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Very nice and informative

anand sarwade's picture

Very nice and informative article.

iehsan

iehsan's picture

Does PHP and Java can set and get data from the same Memcached?

JKnight's picture

My system use both PHP and Java. Does PHP and Java can set and get data from the same Memcached without any error?

In my case, when PHP set data to Memcached with compression, Java can not decompress this data.

Could you help me?
Thank a lot for support.

how to configure mem cahce in centos 5.4

Anonymous's picture

NCache-memcached

Anonymous's picture

Memcached is an open-source distributed cache that speeds up applications by caching application data and reducing database load. Memcached is free but it has has limitations also. It has limitations in cache reliability availability and high availability, whch can cause serious problems for mission critical applications.

NCache-Memcached removes these limitations without requiring you abandon using the Memcached API and your investment in your existing code.

NCache-Memcached is 100% Memcached Compatible for .NET and Java and gives you Reliability thru Scalable Cache Replication.

Does NCache has client for Memcached?

Anonymous's picture

Hi, i was just curious if NCache has client for Memcached... it had something for NHibernate... if they do this would be a great product to use for distributed caching

Does NCache has client for Memcached?

Anonymous's picture

Yes , NCache does have a client for Memcached.

Josh.

Can you provide the distributed File system used in facebook

DIMEcyborgs's picture

You have mentioned about the caching mechanism in facebook.. Is there any way I can know the distributed file system or file system being used in facebook scenario..

Gear6 Web Cache?

SFcacheguy's picture

Can anyone with experience comment on the memcached distribution from Gear6? Thanks.

On the Gear6 memcached distribution

Mark Atwood's picture

Hi! I'm the Director of Community Development for Gear6.

The Gear6 distribution of Memcached is a heavily modified fork of version 1.2 of the BSD licenced open source implementation of Memcached.

On a protocol level, it speaks the memcached text protocol. All existing memcached clients should Just Work. Implementation of the binary protocol is on our roadmap.

Unlike the community version of memcached, the Gear6 version is not intented to be run on whatever servers you have handy. Instead it is bundled into a rack mount appliance, and uses flash memory as a fast high density secondary cache as the available RAM fills up.

There are a couple of other features of the Gear6 implemention, including a web-based GUI, a REST based management API, and support for setting up High Availably pairs, so that a hardware failure does not cause a node failure in your memcached fleet.

If you want to play around with the Gear6 implementation, you can go to our website and download a VM image (this does not use flash, of course), or you can go to Amazon Web Services and start up specially bundled EC2 AMIs. Details on doing this are, of course, on our website.

Please do play with our product, and feel free to publicly post what you think of it.

Thanks!

.. Mark Atwood

if i have 3 server each

Anton Ongsono's picture

if i have 3 server each server has 1 GB of memory allocate for memcache, and if i using distibute memcache, it means 1 have 3 GB of allocate memcache?

Distributed Caching using NCache

Jim Proint's picture

NCache has been the caching solution of choice for various mission critical applications throughout the world since August 2005. Its scalability, high availability and performance are the reasons it has earned the trust of developers, senior IT and management personnel within high profile companies. These companies are involved in a range of endeavors including e-commerce, financial services, health services and airline services.

The range of features that NCache currently showcases is highly competitive and revolutionary. It supports dynamic clustering, local and remote clients, advanced caching topologies like replicated and mirrored, partitioned and replicated partitioned. It also provides an overflow cache, eviction strategies, read-through and write-through capabilities, cache dependencies, event notifications and object query language facilities. For a complete list of features and details please visit http://www.alachisoft.com/ncache/index.html.

Download a 60 day trial enterprise/developer version or totally free NCache Express from www.alachisoft.com/download.html

Team NCache

how to do distribut memcache

Anton Ongsono's picture

how to do distribut memcache in php? using memcache::addserver? i have tried it but cant work :( please advise thanks

It's very nice artice Thanks

Wesley23's picture

It's very nice artice
Thanks for sharing this very good and helpful topic and comments

Hi, whenever I am storing

Anonymous's picture

Hi,

whenever I am storing any data in the memcache server using PHP Memcache client or mysql UDF's , I am not able to retrieve the same value using Java client.
I am using Danga's memcache client.
can u tell me where could the problem be ??

great article

Praful Todkar's picture

Great article on memcached.

There is a small confusion in my mind.

"Once the bucket number has been calculated, the list of nodes for that bucket is searched, looking for the node with the given key."

When you say "node" here you mean the actual elements entries in the hash/bucket, right? Node has been used earlier to refer to a machine on the network, as in, a web node. Hence the confusion. Thanks!

memcached is great !

mosh's picture

hello ;]
It's very nice artice
memcached is very powerfull module
for website with high traffic

few days ago I intall it on my forum counter strike
before memcached i was have simtime load server over 200 :( and server was craching
now i have load 1-20
thx for memcached :)
thx for great artice :)

greetings, mosh

thanks a lot!

seanlin's picture

i'm looking for this kind of solution everywhere!
thanks a lot

Deployed a system using

rkenshin's picture

Deployed a system using memcached through Hibernate. Pretty decent.
I still looking for a way to probe the memcached server. Is there a way you know of ?
Looking into mixing Mysql for read-only tables and memcached
Thx for the article

Cheers

memcached on a single server

Micheal's picture

Hi folks,

Is there a good php class that you can suggest for memcached operations like

class memcacheoperations {

function insert($q) {
// insert into memcache and insert into mysql blah..
}
}

I know i can make roll my own but im too lazy in these days

Memcache class available in PECL

Adam Nelson's picture

http://us2.php.net/manual/en/ref.memcache.php is what you're looking for.

Alternatives

Nati Shalom's picture


"If you can get away with running your threaded application on a single machine and have no use for a global cache, you probably don't need Memcached. Likewise, SysV shared memory may work for you, if you're sitting on a single machine."

Beyond the alternatives mentioned above there are several other alternatives that are commonly used in Java world such as GigaSpaces, Terracota, Tangosol to name few. Some of those alternatives provides .Net and CPP support as well.

I happen to represent GigaSpaces so i can speak on its behalf.

GigaSpaces is used in many transactional systems today in very large clusters to store Terra bytes of data. It is used in mission critical systems such as trading applications, pre-paid applications that are mission critical i.e. it is tuned to support extensive write operations not just read mostly and yes it is also used for session replication, and scaling of large websites.

A free version is available through our Community-Edition, for more information refer to this page.

Beyond that we provide a complete Platfrom that provides a solution for the entire application scalability using a scale-out server model.

You can read more on our site
http://www.gigaspaces.com

Nati S.

can Java client be used with 'C' memcached server

Jason's picture

Can a Java client be used against the 'C' version of memcached server or do I have to run a Java version of the memcached server?

If you are using Java you

Nati Shalom's picture

If you are using Java you should probably look at one of the pure java alternatives that i mentioned above

Nati S.

I'm currently working for a

Yaron A.'s picture

I'm currently working for a very big company with 200000000 accounts in the DB. With 6000000 hits per hour in special days.

Java client with the C memchaced works very well.

You can use Prevayler for jav

Anonymous's picture

You can use Prevayler for java.
its persistence, auto recoverable, etc.

Re: Distributed Caching with Memcached

obiwantcp's picture

Speaking to Tim's concern, the Linux Virtual Server contributor Li Wang has kindly implemented an open source TCP handoff implementation for the Linux kernel. If you take such source code furthur you could conceivably ensure web requests transparently go to the right servers in the first place (to retrieve the content). There are paper detailing such tactics, one such paper by Eric Van Hensbergen is located here:

http://citeseer.ist.psu.edu/vanhensbergen02knits.html

question

shanx.lu's picture

I use memcached to store my count data.and set
memcached -d -m 2048 -p 9876
after this i use php api to store my data.here is code:
<?php
include_once "MemCachedClient.inc.php";
$show;
$options["servers"] = array("*.*.*.*:9876");
$options["debug"] = false;
$memc = new MemCachedClient($options);
$path = "dongdong";
for ($i = 0; $i < 10000; $i ++)
{
$get = $memc->get($path);
if (!$get)
{
if ($i != 0)
{
echo "error ".$i."\n";
//continue;
break;
}
$memc->set($path, "1");
$show = 1;
}
else
{
echo $get."正确".$i."\n";
$show = $get + 1;
$memc->replace($path, $show);
}
}
?>

All is ok, but when i run to 1988, the cycle break, i tried more times.all failed.why?
here is the bug:
MemCache: replace dongdong = 10952
sock_to_host(): Failed to connect to 192.168.241.109:9876
sock_to_host(): Host 192.168.241.109:9876 is not available.
sock_to_host(): Host 192.168.241.109:9876 is not available.
sock_to_host(): Host 192.168.241.109:9876 is not available.

Re: Distributed Caching with Memcached

Anonymous's picture

Interesting, but could someone please fix the missing image?
http://www.linuxjournal.com/7451f1.png does not exist!

Thank You

distributed HASHing

Anonymous's picture

You might be interested in the distributed hash stuff in
Chord. I think it's related to what you're doing with memcached. You might be able to use those ideas to improve the two level hash.

Re: Distributed Caching with Memcached

timstarling's picture

Memcached's big claim is that it's faster than a database, which may well be true. But with no local caching, it certainly can't compete with a true distributed memory system like those commonly used in supercomputers. With memcached, if you have a data item which is needed on every web request, then that data item will be sent across the network from the same server on every request.

Memcached also has a lingering problem with slabs reassignment. If your application uses one particular size class heavily, and doesn't use another size class, then writes to the unused size class (when they eventually occur) will fail. The daemon can't automatically recover memory from the other slabs for use in an empty slab. Similarly, it's not a true LRU cache, the item dropped will always be an item from the same slab. The lifetime of an item in the cache is skewed due to differing amounts of memory allocated to each slab after a restart.

At Wikipedia, we've also had perennial problems with writes failing, probably due to high load from other processes on the server leading to a connection timeout. This is unfortunate when the write is an important cache invalidation operation.

Tim Starling (Wikipedia developer)

At Wikipedia, we've also

Nati Shalom's picture


"At Wikipedia, we've also had perennial problems with writes failing, probably due to high load from other processes on the server leading to a connection timeout. This is unfortunate when the write is an important cache invalidation operation."

Tim you can use In Memory Data Grid for exactly that purpose i.e. In Memory Data Grids can be used as the system of records and therefore handle writes as well as reads and do the synchronization with your data base as a background process - I refer to this as Persistence as a Service (PaaS).

Nati S.

GigaSpaces
Write Once Scale Anywhere

in regards to the comments by Tim Starling

George Daswani's picture

That's why you use something like Tangosol Coherence instead.. It has various caching topologies like near-cache, replicated, partitioned - and has distributed locking when needed. Moreover, it implements a cache-loader mechanism and can be used as a write back cache.

It's not free - and it's a Java only product. It would be nice if the memcache developers look at it's feature set and use it as a roadmap.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState