Caching the Web, Part 2

This month Mr. Guerrero tells us about the definitive proxy-cache server, Squid.
Joining a Hierarchy

If your cache is to be part of a cache mesh or your proxy server is to be connected to another proxy that will be its parent, you must use the cache_host directive. You must include one line for each of your neighbors. The syntax for this line is:

cache_peer

where:

  • hostname is the name of your neighbor.

  • type is one of parent or sibling.

  • http_port is the neighbor's port from which to fetch objects.

  • icp_port is the port to which ICP queries are sent. Use a value of 0 if your neighbor does not run ICP, or 7 if your neighbor runs the UDP echo service. This can help Squid to detect if the host is alive.

You can specify the option default to use this host as a last resort in case you can't speak ICP with your parent cache. Another option is the weight=N to favor a specific parent or sibling in the neighbor selection algorithm. Larger values give higher weights.

If you have a stand-alone cache, you should not include any of these directives. If you have one parent that runs its HTTP port on 3128 and its ICP port on 3130, the line to include in the squid.conf file is:

cache_peer

With the cache_peer_domain directive, you can limit which neighbors are queried for specific domains. For example:

cache_peer_domain
cache_peer_domain

will query the first cache only for the .COM and .EDU domains, and the second for some of the European domains.

If you have only one parent cache, the overhead of the ICP protocol is unnecessary. Since you are going to fetch all objects (HITs and MISSes) from the parent, you can use the no_query option in the cache_peer directive to send HTTP queries to only that cache.

Also, there are some domains you will always want to fetch directly rather than from your neighbors. Your own domain is a good example. Fetching objects belonging to your local web servers from a faraway cache is not efficient. In this case, use the always_direct acl command. For example, in our organization we use:

acl intranet dstdomain mec.es
always_direct allow intranet

to avoid getting our own objects from the national cache server.

The Cache Manager

Squid includes a simple, web-based interface called cachemgr.cgi to monitor the cache performance and provide useful statistics, such as:

  • The amount of memory being used and how it is distributed

  • The number of file descriptors

  • The contents of the distinct caches it maintains (objects, DNS lookups, etc.)

  • Traffic statistics with each client and neighbors

  • The “Utilization” page, where you can check the percentage of HIT your cache is registering (and thus bandwidth you are saving).

Be sure to copy the cachemgr.cgi program installed in your /usr/local/squid/bin (or wherever you chose) to your standard CGI directory, and point your browser to http://your.cache.host/cgi-bin/cachemgr.cgi. There, you should type your cache host name, usually “localhost” or the name of your system, and the port your cache is running, usually 3128, and check all the options.

Conclusions and Tips

A proxy-cache server is a necessary service for almost any organization connected to the Internet. In this article, we have tried to show the whys and hows to implement this technology, and a brief tutorial on Squid, the most advanced and powerful tool for this purpose. Don't forget to read all the comments in the example configuration file. They are complete and useful and show a lot of features not mentioned in this article.

Perhaps in a few years, with the growth of PUSH technology and the use of dynamic content on the Web, caching won't be a solution to the bandwidth crisis. Today, it's the best we have.

One problem proxy caches don't solve is making certain your users configure their browsers to use the caches. Users can always choose to bypass your proxy server by not configuring their browsers. Some organizations have chosen to block port 80 in their routers except for the system running the proxy-cache server. It's a radical solution, but very effective.

Another thing you can do to improve the speed of your users' browsers is pre-fetching the most accessed web sites from your cache. Recursive web-fetching tools which support proxy connections can help do this task in non-peak hours, e.g., url_get, webcopy. Launching one of these retrieval tools with the standard output redirected to /dev/null updates the cache with fresh objects.

Resources

David Guerrero is a system and network manager for the Boletin Oficial del Estado. He has been using Linux since the .98plNN days and now is playing with some Alpha-Linux boxes. When not working or studying, he likes to spend time with his love Yolanda, travel, play guitar and synths, or go out with his “colegas”. He can be reached at david@boe.es.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Anti allergic

Rune's picture

Greeting. There ain't no free lunches in this country. And don't go spending your whole life commiserating that you got raw deals. You've got to say, 'I think that if I keep working at this and want it bad enough I can have it.'
I am from Sweden and now teach English, give true I wrote the following sentence: "Health wellness retreats seminars.Their susceptibility to inhibition by given anti allergic com pounds."

Waiting for a reply :-), Endora.

Squid file Descriptor problem

Shiv's picture

Hi David. I m having some problem with squid my squid log displays following warning.
WARNING! Your cache is running out of file descriptors.

What will be the solution for this any suggestion.

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix