Gnutella HTTP file-transfer, Recommendation, Jukka Santala, donwulff@nic.fi, 11 May 2000
Revised May 15th: HTML:ized, corrected some minor phrasing problems.
Revised May 17th: Oops, my bad, forgot < and > - hope I caught all.

FOREWORD

There's lots of confusion, and subsequent trouble, with the Gnutella file-transfer protocol currently. The implementation is currently perhaps best described HTTP in name only; but from practical point of view it would probably be best if it came closer to the HTTP spec so that ready-written HTTP libraries and to some degree utilities could be used. For this reason the Gnutella HTTP file-transfer protocol is defined essentially as a subset of HTTP/1.1. In addition, the current chaos with regards to the implementation has caused several undesirale effects, and things don't seem to be getting better on their own, so this document outlines some recommendations for solving these problems as well.

Be aware that exactly implementing all the recommendations in this document (Or the HTTP specification) at the moment will render your servent almost inoperable in the current GnutellaNet. However, trust me when I say that standards is the only way to go, you can easily see many problems being already caused by the variations from the standards, and care has been taken in this document to outline the minimum changes to attain maximum standards compliance with minimum compatibility issues to current implementation.

The general protocol implementation rule to be strict in what you send and lenient in what you interpret is not fully useful if you want to maintain compatibility; instead the guideline of being strict in interpreting requests and sending anything while lenient with extra sanity-checks in interpreting reponses should be applied (And has been tried to be used in the recommendations herein). This is because the ability to reliably share files is central to the system, while failure to request files forces those servents to upgrade to a more standards-compliant version.

PROTOCOL

It is intended for the Gnutella file-transfer protocol to be a working subset of the HTTP 1.1 spec as specified in RFC 2616. Where this brief is in clear conflict with what is said in there, or there's a question of interpretation, what's said in the RFC overrules what's said in here. However, the Gnutella file-transfer protocol is intended to be only a subset of HTTP 1.1, and thus full conformance is not required. What is noted in this brief as required (Or described with the word SHOULD) should be the minimum required level of conformance.

This is not intended as a replacement for RFC 2616, but is written with the full knowledge that's how many people will probably use it. Please note that the header-separator used is "\r\n", or CR+LF, or 0x0d and 0x0a. Two subsequent pairs of these characters are used to signify end of headers. Headers aren't case-sensitive, and should not be parsed as such, but be aware that some implementations may be picky about the case.

IDENTIFICATION

To be compliant with HTTP requirements, a Gnutella servent should use the following conventions as the first line of the headers of each invidual reply or request:

Request:

GET <file-ID-string> HTTP/1.1\r\n

This is the form expected by well behaving HTTP servers. Unfortunately, many Gnutella clients are hard-coded to expect (and send) HTTP/1.0 code instead. This makes no sense, as many of the features already used aren't covered in HTTP/1.0 protocol, and the results are thus undefined. With the "Be conservative with what you send, and more allowing with what you receive" rule in mind, adjusted for this situation, Gnutella servents SHOULD accept any protocol identification beginning with "HTTP", including HTTP alone. For the time being, though, they MAY send HTTP/1.1 as request identifier, but these are unfortunately likely to get rejected by some clients currently. It is recommended that if the received request begins with something else, the connection be immediately dropped with no further action or reply to avoid protocol loops or interference.

Response:

HTTP/1.1 <Reponse-code>\r\n

The response-format is even more crucial than request, because on it depends Gnutella networks ability to offer files at all. A broken request-sequence just means people using it can't get what they want; broken response messes things for all of us, so pay attention here. For some reason, this is also the most grossly abused part of HTTP protocol in Gnutella implementations. The response-headers SHOULD begin with HTTP/1.1 followed by space and the three-digit response code and its textual equivalent. For parsing, a servent SHOULD accept anything beginning with HTTP to be compatible with earlier versions. Unfortunately many servents currently expect the reply to begin exactly "HTTP 200 " and otherwise fail the download; their inability to parse HTTP shouldn't be your concern, and at least they will have incencitive to fix and upgrade their servents, but during the transition period it's possible to still send "HTTP <response-code>" instead of "HTTP/1.1 <Response-code>". It is recommended that if the received response begins with something else, the connection be immediately dropped with no further action or reply to avoid DoS attacks and protocol loops/interference.

About response-code

Response-code is HTTP's way of indicating the result of the query made by the client. Contrary to popular belief, the response-code is NOT always 200. 200 means "OK", or "Request succesfull, go ahead". It should ONLY be used in that case. For full list of possible responses, see the HTTP RFC, however, the ones most useful are:

200 OK - Request okay, client should proceed
206 Partial Content - Partial content request understood and accepted, following data is only requested section, not whole document.
400 Bad Request - Request was badly formulated; there was a syntax error. Fix the program, then try again.
404 Not Found - The requested URL (Index/File name pair) could not be located. The file has likely been re-indexed, repeat search and send new query with current data.
503 Service Unavailable - The servent is currently busy serving other requests; the condition is not likely to be permanent, and the client should try again in a few minutes.

These response-codes should be always used on sending. For parsing response-codes, 2XX generally means successful request, 4XX check your bearings, and 5XX try again after a while. Unfortunately many servents currently send 200 as response in almost every situation; the data in such a response SHOULD be used as the whole file-content (Perhaps the file has changed, or the programmer has been too lazy to implement ranges?) instead of as a partial response, but it is up to the implementator what to do if other headers in the response indicate partial response. In any case during the transition period 200 could be accepted as file-continuation. If there's no content-length header, or the content returned is much shorter than expected, the client SHOULD assume this was actually an error-message sent by a broken servent and discard the read data.

URI or file-ID-code

The file-identification URI for Gnutella hasn't caused problems in most implementations by far, but just for completeness sake, the Gnutella protocol expects the URI to be "/get/<File-Index-From-Query-Response>/<File-Name-From-Query-Response>". An implementation SHOULD expect both parts to match in a request before starting a transfer. The index-number is neccessary because there's no prevention for files with same name on single host on Gnutella protocol, and because not requiring index-number could open door to access to nonshared files, and the file-name is required in case the shared files have been rescanned and the index-number now refers to a different file. Sending the wrong file when receiving end expects different one could lead to serious trouble.

To support download by normal HTTP clients and utilities, as well as maintain compatibility to HTTP libraries, an implementation SHOULD allow the usual HTTP URI escape-characters; + is equivalent for space, and % means the following two next letters form a hex (Base-16) character-code of the intended character in standard ISO-8859-1 characterset. Because a lot of the old clients don't follow the + escaping convention, an implementation MAY consider + and space equivalent for purposes of matching the request-file-name.

SUPPORTED HEADERS

As a subset of HTTP 1.1, Gnutella file transfer protocol SHOULD support both GET and HEAD methods, while support for other methods on the data-serving end is voluntary (And makes little sense). The default and minimum required content-encoding support SHOULD be "identity" when nothing else is specified with headers.

This is why said Accept-Encoding is listed in the required headers, altough it never actually needs to be transmitted as it's the default, just like "Connection: Keep-Alive" is the default. User-Agent: and Server: headers are strictly speaking voluntary for sending as well, but make sense for getting around version-specific problems or extensions. In other words, it is NOT required to send every header. You should refer to the RFC to find out which headers are mandatory, but Content-Type: and Content-Length: are the bare minimum for response, and for request it depends on the response you want to get.

Gnutella file-transfer implementation SHOULD also expect to receive any HTTP 1.1 headers, however it need not actually process or conform to them, other than for the following set of required headers that are considered minimum requirements:

For request:

User-Agent: gnut/0.3.29 Gnutella/0.56	(No action required)
Connection: Close			(See discussion on persistence)
Range: bytes=<start>-			(See discussion on partial downloads)
Accept-Encoding: identity		(No neeed to send; no action required)

For response:

Server: gnut/0.3.29 Gnutella/0.56	(No action required)
Connection: Close			(See discussion on persistence)
Content-Type: application/binary	(May be file-type dependent)
Content-length: <size>			(See discussion on content length)

In addition, the support of the following headers is very strongly recommended:

For request:

Range: bytes=<start>-<end>		(See discussion on partial downloads)
Range: bytes=-<end>			(See discussion on partial downloads)
Connection: Keep-Alive			(See discussion on persistence)

For response:

Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT	(SHOULD be made date of the received file)
Content-Range: <start>-<end>/<size>	(See discussion on partial downloads)
Content-MD5: <Base64 MD5-128>		(See discussion on MD5)
Connection: Keep-Alive			(See discussion on persistence)

Other headers that might come useful or have been requested and could be considered for inclusion (See the HTTP 1.1 RFC for details):

Request:

Accept-Encoding: compress, deflate, gzip
If-Modified-Since: Sat, 29 Oct 1994 19:43:31 GMT
If-Unmodified-Since: Sat, 29 Oct 1994 19:43:31 GMT
If-Range: Sat, 29 Oct 1994 19:43:31 GMT

Response:

Content-Encoding: compress, deflate, gzip
Content-Location: http://www.myhost.com/mirror
Location: http://www.myhost.com/getithereinstead

PERSISTENCE

Brief description on the implications of "Connection: Keep-Alive" header. This is voluntary for a request, but if it's used, the response should either respond with "Connection: Close" header or wait for further requests over the same connection after finishing the first. Either party can signal closing of the link after the associated data with "Connection: Close" header. HTTP/1.1 defaults to Keep-Alive, so if no Connection: header is given in request, response headers SHOULD include "Connection: Close".

It is generally recommended that "Connection: Close" be always sent in response-headers to model current standard servent behaviour, but perhaps on a later date when clients better obey the Connection: header, persistence could be implemented especially for the benefit of the MD5 queries.

CONTENT LENGTH

This is pretty self-evident; "Content-Length:" header is used by the responding servent to indicate how long the content following the headers is. The content is considered to begin from the byte immediately following two subsequent \r\n pairs and is expressed in bytes. After receiving that many bytes the client should disconnect, unless it has requested a persistent connection and has more requests to make. Content-length also applies to the HTTP-format entries describing the error-codes returned, and in case of file-continuation, should be the number of remaining bytes (Intended to be sent) and not the size of the full file. In a summary, "Content-Length: " header SHOULD exist in every response, and specify the exact number of bytes following the headers and their termination characters before the next set of headers, or disconnect, if none.

PARTIAL DOWNLOADS

This has already been briefly touched. Essentially, the request will include a header "Range: bytes=x-y" where either x or y may be left out to indicate beginning of file or end of file, respectively. HTTP 1.1 protocol specifies support for multiple ranges, but as this requires MIME encoding and parsing, which I'm sure most Gnutella clients aren't willing to deal with, it is recommended that requests with multiple ranges (separated by periods) be responded to with response-number "400 Bad Request" or "200 OK" and entire content once proper HTTP conformance becomes common. If the server succesfully parses the partial request and is ready to fullfill it, it SHOULD reply with response-ID "206 Partial Content".

The responding servent MAY also include "Content-range:" header to tell that the partial request has been honored, and remind the receiving party of the request range. The format is "Content-range: <first byte>-<last byte>/<size>". Note that <size> is <last byte> substracted by <first byte>, plus one!

Many servents currently support only specifying starting byte, and will always send to the end of the file. In such cases it's okay to drop the connection once wanted bytes have been received, but all servents SHOULD eventually support specifying both beginning and end of the range.

CHECKSUMS

Here we come to the tricky part... As the GnutellaNet grows, more and more files with the same name are offered, or even files with slightly differing content or outright errors, often due to variances from the HTTP transfer protocol standards or outright transfer errors. With these challenges in mind, the task became to design an extension to Gnutella protocol to allow ensuring transfer integrity, detecting errors, and fixing old tranfer errors against originals. Standards compliance remained an important consideration in all this, and here's outline of the recommendations. First note that the Gnutella standard for partial downloads has been extended to full standards compliant begin and end byte definition earlier in this document; this is required for checksum operation and transfer-fixing, and must be implemented before embarking on checksum work.

Content-MD5

"Content-MD5:" is the standards-specific way of communicating the expected checksum for the data being transmitted. The actual checksum is 128-bit MD5 checksum coded into Base-64. See RFC 1864 for general information on this header and references to more details. You should look elsewhere for detailed explanations of both MD5 algorithm and Base64 encoding, but suffice to say that MD5 is a basic 128-bit block-hash algorithm that is intended to be computationally impossible to fool into producing same hash for different sets of data.

Base64 on the other hand is a way of encoding binary-data with printable letters from A to Z, a to z, 0 to 9, + and / (64 alternatives) to signify each 6bit part of the binary input and = to pad the final string to a multiple of 4 characters. This format is familiar from uuencode and MIME. GNU implementations can be found in the GNU TextUtils package, while more commercially viable implementation of MD5 can be found on RSA Inc.'s pages. The Content-MD5: header, if present, SHOULD give the MD5 checksum for only the actually requested data, not the whole file.

Giving HEAD

To make our above discussion useful, there needs to be a way to get the MD5 checksum for potential candidate files. Sure enough, you can compare your MD5 hash against the MD5 hash in the header after a succesfull transfer, but that only takes you so far. Also, as an intermediate fix to the problem, you can formulate a GET request for the range you want to check, and then close the connection if the MD5 hashes agree. However, this way of doing things is a bit dirty and SHOULD not be used, as it will cause lots of extra book-keeping for the servents and traffic as some of the data will allready be queued and sent up over the link.

To fix this problem, all servents should support a file-request with HEAD method in place of the standard GET method. With this method, the behaviour is otherwise exactly same, but sending the actual data is skipped. Instead, only headers are sent. Confusingly enough, these headers SHOULD include the real Content-Length, but the servent on the other end SHOULD not try to read that content, but keep track that it was a HEAD request with no content.

Extended checksum protocol

For heavy-duty error detection an improved checksum scheme is needed. The one above is described mostly as it's a HTTP standard way, and arguably implementable and supportable using standard HTTP tools, servers and clients. But since with Gnutella we can extend our own protocol, within the standards, it makes sense to allow for a more powerful version. The extension is specified in request by using different form of URI, but keeping the GET method. In other words, the request will become:

GET /md5/<index>/<filename> HTTP/1.1\r\n

And the response will be a block of data consisting of an arbitrary number of 128-bit binary MD5 checksums for the range specified for the file in the request (Or whole file, if no range specified). The recommended number of checksums per request is 16, and the way the reply works is dividing the requested file to that many even-sized blocks and counting separate MD5 checksum for each. The block division is done at byte boundaries, and when not evenly divisible, rounded down. Thus a file of 55000 bytes, divided by 16 is 3437.5, and the first returned ranges would be:

1 - 3437
3438 - 6875
10312 - 13749

In other words, to get the next end-of-range, take (size/16)*range rounded down. The beginning of each range is ((size/16)*(range-1))+1, which even works for range #1 when (X*0)+1 is 1. (Altough, in your system, ranges and/or bytes may be numbered starting from 0, in which case you will have to work out the translation). MD5 checksums are counted for each range separately, and then tagged after each another in the HTTP reply. If different number of ranges from 16 is used, this can be told by dividing Content-Length with 16 (For 128 bit MD5's).

After you've located the sub-range(s) the error/difference is in, you can send in another request for MD5-sums inside that range. With 16 ranges, the minimum reasonable block-size to subdivide will be 2K (2048 bytes) since after that the checksums will become larger than the data that would be transfered, and you're better off just transmitting the data. Ofcourse, in practice you should stop much before that.

When using this comparing-protocol, be aware that an error-segment (Such as a continuation-data accidentally downloaded from different file with same name) can span several sub-blocks in MD5 sums. In such a case, it could make sense to save the overwritten data into a separate file (With nul-padding at beginning) in case the "wrong" file is ever found and wanted to be downloaded. This will ofcourse depend on the size of the incorrect block and user preference.

A trickier but common situation is when a file contains more than one error locations; a too coarse subdivision could conclude that everything between them is in error, while the real errors could be one bit difference at both ends. To check for this, you should always fit at least few MD5 calculation blocks between the ends of a found error location. This can be done by either sending a request for MD5 subdivision for the range between the known beginning and end of the error section.

An one-bit error in a 100 megs file can be located to <2K block in four MD5 subdivision requests. However, while bandwidth-saving, the MD5 requests are relatively CPU-extensive, so some DOS-attack protection should be added in recognition that it doesn't take many requests to properly locate a transfer error.