The recent introduction of Gnutella to the internet community has provided us with but a glimpse of what is to come in terms of reatlime search technology. Many problems plague its current incarnation, but as it is in practice a technology demonstrator, it gives us a jumping off point to improve on the concept and simultaneously provide a solid real-time search infrastructure. The key to this being able to work is a robust, reliable, and extensible way of providing results and access to the data.
The main objective of this design is to make the process of searching for data and documents as transparent and consistent to the end user as possible, regardless of what type of indexing system the target host uses. The secondary objective is to make the back-end as flexible as possible, so that content providers may use the most efficient method of providing data to the end-user, while protecting data that should not be made public.
The first key component of the system is the database itself. An indexer, which actually goes out and digs through the files, populates this database with pointers and descriptors. This indexer should be configurable to overlook specific files or directory structures, as well as flag certain entities as requiring authentication if needed, and if so, what method and ACL to use. In populating the database, it assigns a locally unique document identifier as well as some sort of hash to each document. The main objective of indexing documents in the database is to reduce system load every time a search comes in.
It should be noted here that the indexer/database combination could also be something OS-based such as the indexer in Windows 2000 or Sherlock under MacOS.
Once populated, the database is accessible to the actual search engine. At this point, using something standards-based such as LDAP seems like the best course of action, but with proper design, this could realistically be any database, and could evolve over time. Internally to the search engine are multiple parsers for various types of queries. Examples of this could be:
This piece handles all the communications between various hosts on the network and receives search queries and passes them on to the search engine layer. The actual mechanics of the communication protocol are beyond the scope of this document and will not be discussed.
The internal HTTP server handles all of the document exchange for the
system, from serving up query results, to supplying files and documents to
the requestor.
The requesting client's user is searching for a document such as a paper on ecological pest management. They enter a search query and select a query type (one mentioned above). The client then assigns a locally unique identifier to the search and sends it on through the network, attached to a globally unique identifier for the requesting system. When this search is received by a system, it is then passed on to the search engine layer and the search is performed.
Upon finding a match, the search engine dumps all matches, with their
descriptors, hashes, and identifiers into an XML document that is then
served up by the HTTP server. A packet is then sent back to the requesting
client, telling it that a search result was successful, and that the
results are at:
Since the requestor's client
assigns a unique search ID to each query, the XML documents that result
from it can be cached by the client for recall at a later date. As the
client retrieves these XML documents from the systems, it parses them and
then assembles them into a page that can then be displayed to the user.
The benefit of using hashes here is that this page can group identical
hashes together and present the user with a list of multiple links of
where a particular file or document can be obtained. Alternately, these
can be kept hidden, and the client will attempt to get the file from the
best available source. The client may also group files based on type for
ease of use.
When the user then selects a file, the request is then passed on to the
system that hosts the file via HTTP or HTTPS in a form resembling:
Having locally unique search identifiers also allows the client to cache results if the user wishes to revisit those searches at a later point in time.
In response to some questions I've received, It is theoretically possible to make this work of the existing gnutella protocol with a few slightly unorthodox methods. The first is that the search can be encapsulated into the existing search query field in a manner such as boolean(rhubarb AND strawberry), and then return the XML document containing search results as a single result field.
This design is to fullfill the following objectives:
This system can represent anything from a lightweight server within a standard client, on up to large database clusters
This design tries to maximize the ability of users to find what they're looking for with a variety of query types, to satisfy everyone from the casual user to power users
By sending a small result packet, overall load on the intercommunication layer is greatly reduced. Once the result is returned, all further communications are via standard HTTP.
This design also allows for the introduction of a secure communications
layer for dealing with the actual search results.