An Internet Object is a file, docuument or response to a query for an Internet service such as FTP, HTTP, or gopher. A client requests an Internet object from a caching proxy; the proxy server fetches the object (either from the host specified in the URL or from a parent or sibling cache) delivers it to the client.
ICP is a protocol used for communication among squid caches. The ICP protocol is defined in two Internet Drafts (soon to be RFC's). One document describes version 2 draft document. You can find it at the the protocol itself and another describes the application of ICP to hierarchical Web caching.
ICP is primarily used within a cache hierarchy to locate specific objects in sibling caches. If a squid cache does not have a requested document, it sends an ICP query to its siblings, and the siblings respond with ICP replies indicating a ``HIT'' or a ``MISS.'' The cache then uses the replies to choose from which cache to resolve its own MISS.
ICP also supports multiplexed transmission of multiple object streams over a single TCP connection. ICP is currently implemented on top of UDP. Current versions of Squid also support ICP via multicast.
The dnsserver is a process forked by squid to
resolve IP addresses from domain names. This is necessary because
the gethostbyname(3)
function blocks the calling process
until the DNS query is completed.
Squid must use non-blocking I/O at all times, so DNS lookups are implemented external to the main process. The dnsserver processes do not cache DNS lookups, that is implemented inside the squid process.
The ftpget program is an FTP client used for retrieving files from FTP servers. Because the FTP protocol is complicated, it is easier to implement it separately from the main squid code.
It seems that FTP puts don't work through squid. Is there a fix and/or work-in-progress for this?
Not at the moment; supporting this would require an ftpput program.
A cache hierarchy is a collection of caching proxy servers organized in a logical parent/child and sibling arrangement so that caches closest to Internet gateways (closest to the backbone transit entry-points) act as parents to caches at locations farther from the backbone. The parent caches resolve ``misses'' for their children. In other words, when a cache requests an object from its parent, and the parent does not have the object in its cache, the parent fetches the object, caches it, and delivers it to the child. This ensures that the hierarchy achieves the maximum reduction in bandwidth utilization on the backbone transit links, helps reduce load on Internet information servers outside the network served by the hierarchy, and builds a rich cache on the parents so that the other child caches in the hierarchy will obtain better ``hit'' rates against their parents.
In addition to the parent-child relationships, squid supports the notion of siblings: caches at the same level in the hierarchy, provided to distribute cache server load. Each cache in the hierarchy independently decides whether to fetch the reference from the object's home site or from parent or sibling caches, using a a simple resolution protocol. Siblings will not fetch an object for another sibling to resolve a cache ``miss.''
The algorithm is somewhat more complicated when firewalls are involved.
The single_parent_bypass
directive can be used to skip
the ICP queries if the only appropriate sibling is a parent cache
(i.e., if there's only one place you'd fetch the object from, why
bother querying?)
There are several open issues for the caching project namely more automatic load balancing and (both configured and dynamic) selection of parents, routing, multicast cache-to-cache communication, and better recognition of URLs that are not worth caching.
The current Squid Developers to-do list is available for your reading enjoyment.
Prospective developers should review the resources available at the Squid developers corner
Workload can be characterized as the burden a client or group of clients imposes on system. Understanding the nature of workloads is important to the managing system capacity.
If you are interested in Internet traffic workloads then NLANR's Network Analysis activities is a good place to start.
The NLANR root caches are at the NSF supercomputer centers (SCCs), which are interconnected via NSF's high speed backbone service (vBNS). So inter-cache communication between the NLANR root caches does not cross the Internet.
The benefits of hierarchical caching (namely, reduced network bandwidth consumption, reduced access latency, and improved resiliency) come at a price. Caches higher in the hierarchy must field the misses of their descendents. If the equilibrium hit rate of a leaf cache is 50%, half of all leaf references have to be resolved through a second level cache rather than directly from the object's source. If this second level cache has most of the documents, it is usually still a win, but if higher level caches often don't have the document, or become overloaded, then they could actually increase access latency, rather than reduce it.
Please see the Firewalls mailing list and FAQ information site.
For example:
Storage LRU Expiration Age: 4.31 days
The LRU expiration age is a dynamically-calculated value. Any objects which have not been accessed for this amount of time will be removed from the cache to make room for new, incoming objects. Another way of looking at this is that it would take your cache approximately this many days to go from empty to full at your current traffic levels.
As your cache becomes more busy, the LRU age becomes lower so that more objects will be removed to make room for the new ones. Ideally, your cache ill have an LRU age value in the range of at least 3 days. If the LRU age is lower than 3 days, then your cache is probably not big enough to handle the volume of requests it receives. By adding more disk space you could increase your cache hit ratio.
The configuration parameter reference_age places an upper limit on your cache's LRU expiration age.
Consider a pair of caches named A and B. It may be the case that A can reach B, and vice-versa, but B has poor reachability to the rest of the Internet. In this case, we would like B to recognize that it has poor reachability and somehow convey this fact to its neighbor caches.
Squid will track the ratio of failed-to-successful requests over short time periods. A failed request is one which is logged as ERR_DNS_FAIL, ERR_CONNECT_FAIL, or ERR_READ_ERROR. When the failed-to-successful ratio exceeds 1.0, then Squid will return ICP_MISS_NOFETCH instead of ICP_MISS to neighbors. Note, Squid will still return ICP_HIT for cache hits.