Tuesday, October 17, 2006

Reverse Proxy using squid

what is proxy server caching? hmm... lets start with what apache or any other web server does. Whenever you send a request to apache, the request (whether static or dynamic) is processed. The appropriate file is located in the file system and, the content type is identified and data is then streamed from the file to the web and hence to the browser. This is what the apache does. So where does proxy server come in picture? Proxy servers like squid are set up at the gateway level in cyber cafes or large companies. The proxy server caches web content in its internal cache and whenever a request comes to the proxy server, the server matches the modified time of the web content with that at the server, and if the time is same, the content is served from the proxy cache, else the content is fetched from the remote server and served and cached for future purposes.

So, now, what is reverse proxy caching? Reverse proxy is something which is installed in the neighbourhood of a web server. All incoming traffic to the web server is passed through the reverse proxy server. This reduces the load on a busy server by using a web cache between the server and the internet.

Following benefits are derived by deploying reverse proxy servers alongside web servers:

1. increase capacity of existing servers and prevent purchase of new servers.
2. static content is easily cached at the proxy server level, leaving the web server to handle dynamic content.
3. Increase response time of any web request.
4. proxy server acts like an additional layer of defence against hacking.
5. Load balancing: reverse proxy can be used to distribute load on several web servers.
6. Compression: web content can be optimized and compressed to speed up download times.

A reverse proxy server intercepts requests to the Web server and responds to the request out of a store of cached pages. Dynamic web content cannot be cached. Reverse proxy caches static pages / images based on HTTP header tags that are returned from the web page. Important HTTP header tags are:

Last-Modified -> when was the page last modified
Expires -> when would the page expire so that it can be removed from proxy server cache
Cache-control -> should the page be cached
Pragma -> similar to Cache-control, deciding factor whether the page should be cached or not

Here is what i did to install squid:

>> download the squid source gz file.
>> tar -xvzf squid-2.6.STABLE4.tar.gz - creates a directory squid-2.6.STABLE4
>> cd squid-2.6.STABLE4
>> ./configure --disable-internal-dns - makes squid use the /etc/hosts file for dns lookup.
>> make - compile the files
>> make install (as root) - copies the compiled files to /usr/local/squid
>> vi /usr/local/squid/etc/squid.conf
>> make the following changes. The configuration file format has been changed in squid 2.6. I am putting the new configuration settings here. Older directives of httpd_accel_* have been depricated.

http_port 80 vhost
socket address where squid listens to httpd client requests. Default is 3128 (for proxy server). Various options can be put after the port number. Like
transparent : support for transparent proxies
vhost : accelerator using vhost directive
vport : accelerator with IP virtual host support
defaultsite= : main website name for accelerators
protocol= : protocol to reconstruct accelerated requests with. Default is httpd.
no-connection-auth : prevent forwarding of microsoft connection oriented authentication
tproxy : support for linux TPROXY for spoofing outgoing connections using the client IP address

cache_peer [options]

for apache running on localhost on port 81, the configuration for reverse proxy - cache_peer directive would be

cache_peer localhost parent 81 0 originserver
hostname : cache peer to which connection has to be established type : how cache peer is treaded (as parent, sibiling or multicast)
parent -> the child cache will forward requests to its parent cache. If the parent does not hold the requested object, it will forward the request on behalf of the child.
sibling -> a peer may only request objects already held in the cache. a sibling cannot forward cache misses on behalf of the peer.
multicast -> multicast packet is from one machine to one or more.
proxy port : port no where cache listens to the peer requests
icp port : used for querying neighbour caches about objects
options : lots of options available like
proxy-only -> objects fetched from this cache should not be saved locally
weight=n -> specify weight of parent
round-robin -> define a set of parents to be used in a round robin way.
weighted-round-robin -> define a set of parents to be used in a round robin way, frequency of each parent being based on the round trip time.
originserver -> contact this parent as a origin server. used for accelerator setups

Thats it, just start your apache and squid and everything should run fine.
Hope this helps...

Source : http://www.visolve.com/squid/whitepapers/reverseproxy.php with some customizations to upgrade it for the new version.