Version 1.0 of the HTTP protocol is specified by RFC1945. You can obtain a copy from the World Wide Web Consortium, (see [WWWC]). (Version 1.1 of the protocol is specified in RFC2616). Also see RFC2396 for the latest general URI syntax.
The following restrictions on the protocol will be made:
Only HTTP/1.0 will be implemented. The RFC also prescribes HTTP/0.9 but who uses that anymore?
Only the "http" scheme will be implemented, no FTP etc.
Only the GET, HEAD and POST methods will be implemented. GET is the basic page-fetch operation. HEAD is a variant of GET that only returnes the headers. POST will only be used to submit form data.
The full monty for the URL syntax, from the RFC, is
URL = ( absoluteURL | relativeURL ) [ "#" fragment ] absoluteURL = scheme ":" *( uchar | reserved ) relativeURL = net_path | abs_path | rel_path net_path = "//" net_loc [ abs_path ] abs_path = "/" rel_path rel_path = [ path ] [ ";" params ] [ "?" query ] path = fsegment *( "/" segment ) fsegment = 1*pchar segment = *pchar params = param *( ";" param ) param = *( pchar | "/" ) scheme = 1*( ALPHA | DIGIT | "+" | "-" | "." ) net_loc = *( pchar | ";" | "?" ) query = *( uchar | reserved ) fragment = *( uchar | reserved ) pchar = uchar | ":" | "@" | "&" | "=" | "+" uchar = unreserved | escape unreserved = ALPHA | DIGIT | safe | extra | national escape = "%" HEX HEX reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" extra = "!" | "*" | "'" | "(" | ")" | "," safe = "$" | "-" | "_" | "." unsafe = CTL | SP | <"> | "#" | "%" | "<" | ">" national = <any OCTET excluding ALPHA, DIGIT, |
In this syntax "*()" means zero or more repetitions and "1*()" means one or more. The URL syntax allows national characters such as accented letters as long as they are 8-byte characters and include the ASCII character set. For example ISO-8859-1 "Latin 1" would be fine. This doesn't restrict the characters allowed in pages though. They are only constrained by the MIME type for the page.
Since I am only implementing the "http" scheme I will actually be implementing this syntax:
http_URL = "http:" "//" host [ ":" port ] [ abs_path ] host = <A legal Internet host domain name or IP address (in dotted-decimal form), as defined by Section 2.1 of RFC 1123> port = *DIGIT |
The host and scheme names are case-insensitive. If the port is empty or not given, port 80 is assumed. Only TCP connections will be used. Only absolute paths are allowed and they are case-sensitive.
The canonical form for "http" URLs is obtained by converting any uppercase alphabetic characters in the host name to their lowercase equivalent (host names are case-insensitive), eliding the [":" port] if the port is 80, and replacing an empty abs_path with "/".
Characters may be encoded by the "%" escape sequence if they are unsafe or reserved. When parsing a URL the path will be split up according to the reserved characters before escapes are interpreted. So the path /%2Fabc/def has /abc as the name of its first segment and def as the name of the second segment. I will reject URLs having a forward slash or a NUL character in a segment so that they can be directly mapped to file names.
Each request is done using a separate TCP connection to the server. (Version 1.1 of the protocol allows more than one request per connection which is a lot more efficient). The RFC says
... current practice requires that the connection be established
by the client prior to each request and closed by the server
after sending the response. Both clients and servers should be
aware that either party may close the connection prematurely,
due to user action, automated time-out, or program failure, and
should handle such closing in a predictable fashion. In any case,
the closing of the connection by either or both parties always
terminates the current request, regardless of its status.
All lines in the message are supposed to be terminated with a CR-LF character pair but applications must also accept a single CR or LF character. In the body of the page the line termination will depend on the MIME type but CRLF should be used for text types.
A request message looks like:
Full-Request = Request-Line *( General-Header | Request-Header | Entity-Header ) CRLF [ Entity-Body ] Request-Line = Method SP Request-URL SP HTTP-Version CRLF Method = "GET" | "HEAD" | "POST" General-Header = Date | Pragma Request-Header = Authorization | From | If-Modified-Since | Referer | User-Agent Entity-Header = Allow | Content-Encoding | Content-Length | Content-Type | Expires | Last-Modified | extension-header |
An example request line is
GET http://www.w3.org/pub/WWW/TheProject.html HTTP/1.0 |
After the request line come zero or more headers and then a blank line to terminate the headers. The entity body is only used to supply data for the POST method.
Each header consists of a name followed immediately by a colon (":"), a single space (SP) character, and the field value. Field names are case-insensitive. Header fields can be extended over multiple lines by preceding each extra line with at least one SP or HT (horizontal tab), though this is not recommended.
HTTP-header = field-name ":" [ field-value ] CRLF field-name = token field-value = *( field-content | LWS ) field-content = <the OCTETs making up the field-value and consisting of either *TEXT or combinations of token, tspecials, and quoted-string> |
In this syntax the following definitions are used.
token = 1*<any character except CTLs or tspecials> tspecials = "(" | ")" | "<" | ">" | "@" | "," | ";" | ":" | "\" | <"> | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT TEXT = <any OCTET except CTLs, but including LWS> CTL = a control character or DEL (ASCII 127) LWS = [CRLF] 1*( SP | HT ) quoted-string = Any sequence of characters except double-quote and CTLs, but including LWS, enclosed in double-quote characters. There is no backslash quoting of characters within strings. |
The general headers are applicable to both requests and responses. They pertain to the message itself rather than the entity being transferred. The request headers provide extra information about the request. The entity headers provide information about the entity itself. I will use them only in the response. The next sections describe the headers.
This provides the data and time that the message was originated. The preferred format of the date is the RFC822 format used in e-mail. For example
Date: Tue, 15 Nov 1994 08:12:31 GMT |
A well behaved server should accept all of the following date formats:
Sun, 06 Nov 1994 08:49:37 GMT ; RFC 822, updated by RFC 1123 Sunday, 06-Nov-94 08:49:37 GMT ; RFC 850, obsoleted by RFC 1036 Sun Nov 6 08:49:37 1994 ; ANSI C's asctime() format |
All times are GMT (UTC). The following syntax describes all of the allowed date formats.
HTTP-date = rfc1123-date | rfc850-date | asctime-date rfc1123-date = wkday "," SP date1 SP time SP "GMT" rfc850-date = weekday "," SP date2 SP time SP "GMT" asctime-date = wkday SP date3 SP time SP 4DIGIT date1 = 2DIGIT SP month SP 4DIGIT ; day month year date2 = 2DIGIT "-" month "-" 2DIGIT ; day-month-year date3 = month SP ( 2DIGIT | ( SP 1DIGIT )) ; month day time = 2DIGIT ":" 2DIGIT ":" 2DIGIT ; 00:00:00 - 23:59:59 wkday = "Mon" | "Tue" | "Wed" | "Thu" | "Fri" | "Sat" | "Sun" weekday = "Monday" | "Tuesday" | "Wednesday" | "Thursday" | "Friday" | "Saturday" | "Sunday" month = "Jan" | "Feb" | "Mar" | "Apr" | "May" | "Jun" | "Jul" | "Aug" | "Sep" | "Oct" | "Nov" | "Dec" |
This is usually "Pragma: no-cache" to tell the recipient not to cache the entity. I won't generate it.
This header provides information such as a password to access secure information. I will support basic password protection. Typically what happens is that after a request has been received, if a password is needed, the server returns a status code of 401 along with a challenge header looking like:
WWW-Authenticate: Basic realm="WallyWorld" |
The client must resend the request with an Authorization header such as
Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ== |
which contains a user id and password encoded as Base64[1]. (It decodes to "Aladdin:open sesame"). The syntax for the Authorization is:
basic-credentials = "Basic" SP basic-cookie basic-cookie = <base64 encoding of userid-password, except not limited to 76 char/line> userid-password = [ token ] ":" *TEXT |
The client can send the Authorization with the initial request if it has already prompted the user for a password. See RFC1945 for more details for HTTP 1.0 or RFC2617 for HTTP 1.1.
This identifies the person sending the request.
From: webmaster@w3.org |
It's not normally used but I'll recognise it and pass it on.
The If-Modified-Since request-header field is used with the GET method to make it conditional: if the requested resource has not been modified since the time specified in this field, a copy of the resource will not be returned from the server; instead, a 304 (not modified) response will be returned without any Entity-Body.
An example of the field is:
If-Modified-Since: Sat, 29 Oct 1994 19:43:31 GMT |
I'll recognise but ignore this header.
This header provides the URL from which the request originated, if appropriate. For example if a user clicks on a link in a web page then the referer URL is the URL of the page. This is sometimes used to control access to pages for example to prevent a page from being accessed unless the user has passed through a sign-on page.
An example is
Referer: http://www.w3.org/hypertext/DataSources/Overview.html |
I'll recognise the header and pass it on.
This identifies the kind of browser or whatever that generated the request. I'll recognise the header and pass it on.
This is used in responses. I won't generate it. See the RFC for more details.
This is used to indicate if the entity is compressed or otherwise encoded. I won't generate it in responses. An example is:
Content-Encoding: x-gzip |
This provides the size of the entity in bytes starting at the first byte after the CR-LF that terminates the header. I will always generate a content length. An example is:
Content-Length: 3495 |
This provides the MIME type for the entity. An example is:
Content-Type: text/html |
I will generate: text/plain, text/directory, text/html, image/jpeg, image/gif, image/png where appropriate.
This is used in a response to tell the client how long to cache the document. I won't generate this.
This provides the date and time when the entity was last modified. I'll generate this. An example is:
Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT |
Any other headers are allowed as long as their syntax is valid. I'll just ignore them.
A response looks a lot like a request.
Full-Response = Status-Line *( General-Header | Response-Header | Entity-Header ) CRLF [ Entity-Body ] Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF Response-Header = Location | Server | WWW-Authenticate |
The first line returns the status of the request: success, failure etc. A numeric status code is provided for programs to read. A textual version is provided for (some kind of) human readability in error messages.
The standard status codes are:
Status-Code = "200" ; OK | "201" ; Created | "202" ; Accepted | "204" ; No Content | "301" ; Moved Permanently | "302" ; Moved Temporarily | "304" ; Not Modified | "400" ; Bad Request | "401" ; Unauthorized | "403" ; Forbidden | "404" ; Not Found | "500" ; Internal Server Error | "501" ; Not Implemented | "502" ; Bad Gateway | "503" ; Service Unavailable |
Full details of the status codes can be found in the RFC. I'll just describe the few that the server will use.
The entity follows in the Entity-Body section.
Something went wrong. The Entity-Body section is omitted.
The client must supply a password to get the URL.
You know what this means.
General cop-out.
I'll have a lot of this.
The response headers provide extra details for the response itself such as elaborating on the status code. They are described in the following sections. I will only use the WWW-Authenticate header.
This provides the location for status codes that redirect the client to some other location such as the 30x codes. An example is:
Location: http://www.w3.org/hypertext/WWW/NewLocation.html |
This provides identification for the server e.g. its name and version. I won't be using this.
This header is returned along with a 401 status code to request the client to authenticate itself. More details can be found in the section called The Authorization Header.
[1] | See RFC1521 for a description of Base64 encoding |