To test your webproxy in normal mode (i.e. having the test-webproxy program automatically start your webproxy), type
%./test-webproxy (path of your webproxy program) (test#)There are 3 available tests, the default is to run all of them.
To test your webproxy in debug mode, edit the test-webproxy.C file, change
the line
#define DEBUG_WEBPROXY 0
to
#define DEBUG_WEBPROXY 1
You can now run your webproxy in debugger(such as gdb) manually. When you
run test-webproxy program this time, you need to tell it the port the proxy
is running on:
%./test-webproxy (path of your webproxy) (proxy port) (test#)
As in the previous lab, we will have a list of FAQ to help you debug your webproxy.
For students who have done 6.033 lab before, refer to this document about what you do for this lab.
Now that you have programmed an asynchronous TCP proxy, you will enjoy creating an asynchronous caching web proxy. Unlike the TCP proxy, your web proxy will involve interactions between different connections (through the shared cache) as well as more involved treatment of individual connections.
In this handout, we use client to mean an application program that establishes connections for the purpose of sending requests[3]. Typically the client is a web browser (e.g., lynx or Netscape). We use server to mean an application program that accepts connections in order to service requests by sending back responses (e.g., the Apache web server)[1]. Note that a proxy can act as both a client and server. Moreover, a proxy could communicate with other proxies (e.g., a cache hierarchy).
The HTTP/1.0 spec, RFC 1945, defines a web proxy as a transparent, trusted intermediary between web clients and web servers for the purpose of making requests on behalf of clients. Requests are serviced internally or by passing them, with possible translation, on to other servers. A proxy must interpret and, if necessary, rewrite a request message before forwarding it. In particular, your proxy must address:
Your proxy should function correctly for any HTTP/1.0 GET, POST, or HEAD request. However, you may ignore any references to cookies and authentication.
Your web proxy should tolerate many simultaneous requests. The web proxy will accept connections from multiple clients and forward them using multiple connections to the appropriate servers. No client or server should be able to hang the web proxy by refusing to read or write data on its connection.
You should ensure that your proxy serves cached pages to clients when RFC 1945 allows, and only contacts a server when it has to. RFC 1945 specifies headers that clients and servers may provide to help control caching: Expires, If-Modified-Since, Last-Modified, and Pragma: no-cache. You should make sure that your software obeys these headers. However, you'll find you have a certain amount of freedom in exactly how you decide whether you can serve a cached page to a client, or whether you must re-fetch it from the server.
You'll want to search RFC 1945 for any warnings about ``proxy'' behavior. The lab TA's will test that your proxy handles requests as stated in RFC 1945.
The Hypertext Transfer Protocol (HTTP) is the most commonly used protocol on the web today. For this lab, you will use the somewhat out-of-date version 1.0 of HTTP.
The HTTP protocol assumes a reliable connection and, in current practice, uses the TCP protocol to provide this reliable connection. The TCP protocol provides the reliable transport of bytes between programs on two separate machines, even over an unreliable network. Luckily for us, the TCP protocol is built into the UNIX operating system.
The HTTP protocol is a request/response protocol. When a client opens a connection, it immediately sends its request for a file. A web server then responds with the file or an error message. You can try out the protocol yourself. For example, try:
(~/)% telnet web.mit.edu 80Then type
GET /6.033/www/ HTTP/1.0followed by two carriage returns. See what you get.
To form the path to the file to be retrieved on a server, the client takes everything after the machine name and port number. For example, http://www.mit.edu/original/ means we should ask for the file /original/. If you see a URL with nothing after the machine name and port, then / is assumed (The server determines what page to return when just given /. Typically this default page is index.html or home.html).
On most servers, the HTTP protocol lives on port 80. However, it turns out that port 80 is protected on most UNIX systems, so we will have to run our web proxy on a higher port (> 1023). To use other ports, we need to modify our URLs a bit, adding the port number after the machine name. For example, entering http://www.mit.edu:8008/ into your favorite web browser connects to the machine www.mit.edu on port 8008 using the HTTP protocol.
The format of the request for HTTP is quite simple. A request consists of a method followed by arguments, each separated by a space and terminated by a carriage return/linefeed pair. Your web proxy should support three methods: GET, POST, and HEAD[3]. Methods take two arguments: the file to be retrieved and the HTTP version. Additional headers can follow the request. The web proxy will especially care about the following headers: Allow, Date, Expires, From, If-Modified-Since, Pragma: no-cache, Server. However, your proxy must handle the other HTTP/1.0 headers[3]. Fortunately, the web proxy can forward most headers verbatim to the appropriate server. Only a handful of headers require proxy intervention.
Once the request line is received, the web proxy should continue reading the input from the client until it encounters a blank line. The proxy should then fetch the appropriate file and send back a response (usually the file contents) and close the connection.
To use a web proxy, you must configure your web browser. For Lynx, wget, or Mosaic, you must set an environment variable. The following sets your proxy to squid.lcs.mit.edu in csh.
(~/)% setenv http_proxy http://squid.lcs.mit.edu:3128/
In Netscape, find the Network Preferences and manually setup a proxy. For instance, you can set the HTTP proxy to squid.lcs.mit.edu and the port to 3128. Remember to revert your changes. Not all requests will work transparently through the squid.lcs.mit.edu proxy.
How does one watch an HTTP request in action? To make a simple HTTP request, most people will use telnet. However, telnet does not let you watch incoming HTTP requests. For a more sophisticated connection, use nc (NetCat). nc lets you read and write data across network connections using UDP or TCP[10].
If you use athena, type the following to get nc
(~/)% add sipbA standard Linux installation usually comes with nc. If you don't have nc on your machine, go to this site to download and install it.
To use nc: For instance, this listens to the network on port 8000:
(~/)% add sipb (~/)% nc -p 8000 -l -v listening on [any] 8000 ...
Now point your favorite web browser to http://localhost:8000/ (no proxy). My version of Netscape generates:
connect to [127.0.0.1] from localhost [127.0.0.1] 5854 GET / HTTP/1.0 Connection: Keep-Alive User-Agent: Mozilla/3.01 (X11; U; Linux 2.0.30 i586) Host: localhost:8000 Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
The first line asks for a file called / using HTTP version 1.0. Look in RFC 1945 for details on the remaining lines. Lynx produces a similar request:
connect to [127.0.0.1] from localhost [127.0.0.1] 5917 GET / HTTP/1.0 Host: localhost:8000 Accept: application/postscript, image/gif, application/postscript, */*;q=0.001 Accept-Encoding: gzip, compress Accept-Language: en User-Agent: Lynx/2.6 libwww-FM/2.14
Set your browser to use port 8000 of localhost as a proxy, and retrieve http://c0re.l0pht.com/weld/netcat/readme.html; this will produce something like:
connect to [127.0.0.1] from localhost.mit.edu [127.0.0.1] 2328 GET http://c0re.l0pht.com/~weld/netcat/readme.html HTTP/1.0 If-Modified-Since: Thursday, 12-Sep-96 02:25:13 GMT; length=63340 Referer: http://c0re.l0pht.com/~weld/netcat/ Proxy-Connection: Keep-Alive User-Agent: Mozilla/3.01Gold (X11; U; OpenBSD 2.2 i386) Pragma: no-cache Host: c0re.l0pht.com Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
The above shows what a web browser sends to a web proxy. Now we'll try to obtain sample data from a real web proxy (squid.lcs.mit.edu, port 3128). Set your browser's proxy to http://squid.lcs.mit.edu:3128/ and run the following command on your local machine, say abc.mit.edu (you can obtain the name of the local machine using command "hostname")
nc -p 8000 -v -l listening on [127.0.0.1] 8000 ...
When I ask my web browser for http://abc.mit.edu:8000/, nc reports:
connect to [18.26.4.118] from xyz.lcs.mit.edu [18.24.10.20] 3037 GET / HTTP/1.0 Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* Accept-Encoding: gzip Accept-Language: en User-Agent: ANONYM/0.0 (ITS; KL-10) Host: abc.mit.edu:8000 Cache-Control: max-age=259200 Connection: keep-alive
Try this on your machine. Look for differences between the web browser's request and the corresponding proxy request.
Read over some of the suggested literature at the end of this document. (If you use athena, there is no need for you to download the RFCs from this page. All RFCs are in the rfc locker.)
After you have a general understanding of the problem, play with nc and begin your web proxy design. Once you have convinced yourself of the correctness of your design, you should implement the web proxy. Likely you will discover new, fascinating problems and will need to modify your design appropriately.
For this lab, we will provide you with a simple HTTP/1.0 parser to save you the pain of parsing. Download it from here. The tar file also contains a sample Makefile which you could modify to suit your needs for this assignment.
% make % ./webproxy 8088
Hand in your lab by creating a tar file with all your source files (and Makefile), uuencoding it, and e-mailing it to 6.894-submit@lcs.mit.edu. For example:
% tar cf lab3.tar Makefile http.h http.il http.C webproxy.C (and other files you created) % uuencode < lab3.tar lab3.tar | Mail -s '6.894 lab3.tar' 6.894-submit@pdos.lcs.mit.eduFor students working on athena, type the following instead:
% tar cf lab3.tar Makefile http.h http.il http.C webproxy.C (and other files you created) % uuencode < lab3.tar lab3.tar | mhmail -subject '6.894 lab3.tar' 6.894-submit@pdos.lcs.mit.eduPlease don't send attachments in your email submission. If you cannot submit, email jinyang@lcs.mit.edu for help. :-)
We must be able to compile your software with our standard async library, so don't modify the async library.
The lab is due by the beginning of class on Thursday, October 12th.
Like the TCP Proxy, this is an individual project. But you are otherwise free (and encouraged) to discuss the design and implementation details with other 6.894 lab students. Include a README file in the tar archive you submit giving credit where it is due.
Your proxy should statisfy the following minimum criteria: