Databases and Search Tools on the World Wide Web

Communication information: Allen S. Firstenberg
Graduate
1 Dec 1995

PREFACE

In this paper, I examine several current indexing and searching tools that are in use in the World Wide Web to determine the best and most important features that each has to offer and then develop a profile of an improved tool, including an outline of possible methods of implementation.

The paper is divided into two major components. In the first part, I do a comprehensive review of search tools that are available on the World Wide Web (WWW) and examine what features they provide to the public that uses the tools, as well as several indexing tools that are available to build a searchable database. Based on the results of these surveys, the second part contains a proposal for a new searching tool providing the best features of existing tools, and extending their abilities to provide feedback current tools do not.

Introduction

Two of the primary obstacles to any information sharing scheme, from early libraries to documentation on the Internet are the conflicting requirements of comprehensive information location and information overload reduction. In short, information accessors need to be able to find exactly what they need -- no more and no less.

To solve these problems, indexes and categorizations have been designed over time to narrow searches down to a manageable set of information, but to also allow for "browsing" to the specific content required. This concept has been translated to the WWW in various forms -- but each lack certain features that can further enhance searching capabilities. I explore current search tools, and propose a tool that I feel further empowers people while they are in the browsing phase of their search.

Search and Indexing Tools

We first look at existing methods to search for information, and briefly examine the database or indexing facilities that are required to support such searches. These will give us features, most notably user interaction requirements, that will be necessary in designing an updated tool.

Pre-Internet Searching and Indexing

Libraries, references, and card-catalog systems existed before the Internet's requirements for for information searching existed. Search methods were created to handle the manual searching and sorting tools that have existed since the 1800's when modern scholarly publications became popular.

Books

For books, two popular cataloging and indexing methods were created -- the Dewey decimal system and the Library of Congress system (LOC). Both use hierarchical methods that pick broad categories and then break them down to more specific sub-topics. While both are indefinitely extendable, it is generally regarded that the LOC method is more capable of expansion as new categories are created.

For indexing, the traditional method has been to create "cards" for each book in a collection. Typically a title, author, and multiple subject cards are created for each books that indicate the section the book may be located in. Additionally, cards contain "tracings" which indicate other possible references to investigate. Typically, these cards (or at least the information for each card) and their categorizations are established by the Library of Congress when a book is copyrighted.

Journals and Magazines

Scholastic journals and papers are indexed through slightly different means. Most papers are collected in individual journals which are organized by the methods above and become a first cut for people wishing to stay abreast of a field. Locating information in a particular journal allows a researcher to read similar journals to find similar papers.

Papers also traditionally have keywords associated with them that are assigned by the author. These keywords may then be used as part of an index that is then compiled for one or a series of publications.

Historic Internet Tools

The Internet has contained information repositories since early in its existence, and has always required methods for people to locate this information.

NLS/Augment

As early as 1973, the ARPAnet was being used to exchange and index documentation about itself. Researchers at the Stanford Research Center under Douglas Englebart created a network based "journal" with all correspondence that occurred on the network.

Messages and all documents were linked to each other, and automatically indexed. Indexes were generated by author, subject keyword, and other keywords that the author could center for a document. Searching tools provided linked references to documents.

FTP and Archie

While NLS was a somewhat proprietary format, the File transfer Protocol (FTP) was one of the core application protocols that was provided in the early days of the Internet. With the adoption of an informal method to access file space without authentication, FTP became a frequently used way to exchange a wide variety of information and software.

Indexing, however, was practically non-existent for many years. Informal theme-based indices were distributed (usually from other FTP sites) containing lists of "well known" sites and their contents, but no effort was made to formalize these lists or the contents of files until the Archie service was unveiled. Archie was created as a resource discovery and indexing system that surveyed anonymous FTP sites for its files.

What Archie is unable to do, however, is analyze files for content. Since only file and directory names are known, queries are done against names of files or programs. Searches return all matches, placing "best matching" and most recent files at the beginning of the list.

Usenet

Usenet has served as an interactive information forum, where questions and answers can be posted and discussed in a wide variety of categories. Even in its early days, it became an information repository that required extensive filtering techniques so readers didn't suffer from information overload.

The first was the hierarchical grouping system that allowed users and sites to select categories of groups they were interested in. Under each hierarchy, words were used to further break down the categories of discussion, with clear procedures used to establish new groups in the hierarchy. Several research communities, such as those in computer science and biology own hierarchies early in Usenet's history, with later groups establishing themselves under more broad categories. Either way groups and hierarchies allow a very rough cut for information filtering.

Users can then apply additional filters to the articles in their selected groups. These filters, traditionally known as "killfiles" because they automatically kill undesired articles based on information contained in the article header. Traditionally, these messages included a keyword and subject header, which the author either specifies or uses from referenced articles.

Gopher and Veronica

The Gopher protocol was designed as a way for users to "tunnel" between Internet servers containing information. Designed around a Campus Wide Information System (CWIS), but intended to be scalable across the entire Internet. It makes the claim of "combin[ing] the best features of browsing through collections of information and fully indexed databases". [GPH]

Although Gopher servers were supposedly all registered, it was still not easy to navigate through a hierarchy to the centralized list of servers, and this still required the knowledge of which servers were "well known" for what resources. The veronica system was designed to provide a a keyword search of menu titles at all the gopher systems. Line Archie, however, veronica does not index content of most pages - only the titles.

Summary of Some Current Internet Search Tools

With he popularity of the World Wide Web, a number of sites have popped up to provide a quick reference to information that is available on the web. Each site is notable because of the interface features provided to the user. The sites examined are not comprehensive, but are indicative of the interfaces and features available.

Early Web Indexes

The earliest web indexes were all compiled by hand by organizations who were interested in promoting use of the web. Most notable among them were the list of web servers organized at the home of the WWW, and the "What's New" index that was organized at the home of the Mosaic browser.

There was no searching or indexing performed on these resources, and as the web grew, both became impossible to maintain and use because of the volume of new web sites and lack of filtering.

Yahoo

Yahoo provides a user-contributed or administrator-researched list of web sites, features, and Internet resources. Each item has a text description of the link. Users can either enter keywords to search for in the title or descriptive text, or navigate a category based hierarchy. Each page contains a list of links and descriptive text that belongs in one of the hierarchical categories.

The hierarchy is a text-based subject pseudo-tree, users can follow links down, but some branches actually link to other branches under a hierarchy that was "more appropriate". It is easy to move up to any level of the hierarchy, since the title of the current page points to every previous page in the hierarchy.

The search system looks for substrings in the text and title and returns a list of matches, their category, and the descriptive text. A user-settable number of items may be returned, and items beyond this number are inaccessible except by re-running the search. From the search results, users can go directly to the site or to the categories page to investigate related sites.

Lycos

At it's heart, Lycos provides searchers with a keyword based search on full pages. Collection is done by a web-wandering "spider" that stores and indexes the full text of a page, but there is a form for users to suggest pages that the spider should research. As a side benefit, the spider also determines the 250 sites most linked to and makes this available.

Searches are performed by keyword, and the user can indicate how "loosely" the matches are performed. Normally, 10 results at a time are presented and the user can continue to request the next or previous set of results. Users may also request how much information for each item is presented to them in the results, including an abstract, summary, fragment, and/or size of the page, words that were matched in the page, and a "relevancy" score between zero and one, along with the title and a link to the page being referenced. Normally, results are limited to ten to forty results per page, with a link to the next and previous pages.

Open Text

Open Text is similar to Lycos in many of its features, but has a wider variety of searching and reporting controls. Searches can be simple, have elaborate boolean search conditions, or indicate which set of keywords are more important than others.

Entries in the results contain a relevance score, and either the first few lines of the page, a page fragment surrounding the keywords, or a list of all the headings that the page contains as a summary. Each search result can be turned into a new search of "similar pages" by using the page to seed the list of keywords to search for.

Excite

Excite allows searches similar to Lycos and Open Text, but has its own variants on keyword entry and information display. Standard full-text keyword searches are augmented by the ability to enter "words describing a concept" which is more of a search for pages that contain the keywords specified as well as related keywords.

Results are presented in one of two modes, and the user may switch between the two modes without re-performing the search. The "confidence" view is similar to the results of a Lycos search with the addition of a link that will search for pages that are similar to a selected item. The "site" view lists just links to sites, but collects the page references in a hierarchical tree indicating the structure of pages at a site, indicating which sites have pages that are more or less relevant to a particular topic, and which area of the site contains the best information.

Deja News

Instead of providing indexes to the more conventional part of the web, Deja News provides a full-text keyword search on many Usenet groups. More notably, however, are the filtering methods that may be applied when doing a search. Articles can be removed from the search based on their age or groups they do or don't belong to. Results are presented in either a terse or verbose format and sorted by relevancy score, usenet group, date published, or author. The credibility of authors are rated by showing a posting history for that person.

Hyper Base

Hyper Base is similar to Yahoo, since it provides a hierarchical subject list and works from user contributions. Hyper Base is less structured, however, and new topics in the hierarchy and entries for each topic are automatically created upon user request, creating a somewhat haphazard effect. There is no search capability.

IBM infoMarket

IBM's infoMarket has a number of lofty claims, but was untestable at the time this survey was done. It aims to provide indexes for graphics, video, and sound, as well as the more mundane text, and has the future goal of being able to "securely distribute content and receive payment".

Currently, its only notable feature is that it provides a single, unified search of a number of other web databases, presents their results, and indicates which databases produced which hits and any overlapping results. Users can use this additional dimension to help determine which pages are of the most value.

Infohiway

Under the assumption that users are either trying to locate web sites in a particular geographic region, or knowing the first two letters of their name, Infohiway claims to have "pre-searched" other databases and eliminates "complex search expressions." In reality, the interface is a set of hierarchical pages organized alphabetically. It is not possible to search for descriptions of sites.

Inktomi

Inktomi is like other keyword search systems, but uses an easier query language. Keywords are specified, with mandatory keywords being prefixed by a + character, and keywords that should not be on a page being prefixed by a - character. Relevancy results are indicated graphically and rated between zero and ten, instead of printing a numeric value between zero and one.

MathSearch

Although containing only math related research, the MathSearch index provides an example of features that are targeted at a specific audience. The instructions are mathematically inclined (describing terms and results in terms of sets) and help and error messages are more polite than standard search tools.

Although searching is still keyword based, it only matches documents that have the keywords in the same sentence. The results page has similarly made the concept of relevance clearer. Instead of assigning an apparently arbitrary number, the number of matches in each document is returned as an absolute number.

WebHunter / WebHound

Although it is currently down, and in search of a name, this search tool is different from all others since it is based on intelligent agent techniques. According to the documentation, it picks up patters for a particular user and learns which pages are their favorites. From these, and from other user input, it makes suggestions about which other pages may be of interest.

Summary of Some Current Internet Indexing Tools

WAIS

The WAIS protocol was one of the original online content indexing tools, and is the cornerstone of Wais Inc's online publishing initiative. The WAIS system is a combination of a database format, and a standard protocol for clients to query the WAIS database server. The WAIS server and protocol are more functional than what is examined here, these features are ones that are considered key to our application.

Wais distributes a free version of its product, which is reputed to create indexes that are roughly 100% the size of the original data files. The commercial version will index files besides text documents, and claims to create indexes that are 30-50% the size of the original data. Indexes may be incrementally generated as new documents are added to the archive. Stopwords are customized by each server maintainer, and the system users various stemming techniques to derive variations of a queried word.

The server accepts queries structured as natural language questions, quoted keywords, boolean expressions, and references to other documents. Words and terms may be weighted, wildcarded, factored as part of the density of a document, and applied to specific fields in the source document if they existed. Results give extensive information about the query and documents returned.

Open Text

The Open Text text retrieval system is the commercial text database behind their indexing site. It is similar in many ways to the Wais product in its public features, but is more web oriented and does not use the WAIS protocol.

The Open Text database is structured somewhat differently, however. Instead of creating an index of words that point to various documents, the database is of the documents themselves, with methods to do string searching at the time requested. [OT2] Surprisingly, this still produces searches that are remarkably fast. This design has the side effect that there are no stop-words, and even the most common words and phrases may be part of a search.

Glimpse

Glimpse is a freely available indexing tool from the University of Arizona with a goal of duplicating the UNIX grep command over a set of indexed files. Queries in grep are normally substring or string pattern matching searches, which is extended in Glimpse to include approximate matching and boolean queries.

Unique in Glimpse is the ability for the administrator to make a trade-off between size of the index and speed that searches will be accomplished. Search results are also similar to grep since they are pattern matching, and not relevancy based. Indexes can be rapidly created and incrementally updated.

SWISH

SWISH is another freely available index, with licensing restrictions, available from the EIT Corporation. Its goal is for a basic tool with a small index, at the sacrifice of some fancier features. Stemming and synonyms are not supported, and search queries are composed of a basic boolean expression. Stopwords are drawn from a provided list, or automatically generated for "common words" that break an administrator defined threshold.

Results consist of some basic indexing information, and a relevancy number, file size, and title and path of matching documents. Since SWISH is designed to work on the Web, it can be configured to give a higher weighting if a word is found inside certain HTML tags.

An Improved Tool

Based on these search engines and tools, we will formulate the "best of the best" requirements and features, examine alternate approaches for some of their goals, and formulate a basic design for a new tool. Specific implementation and design details are not examined at this time. The tool as a whole is tentatively called Guide.

Intended Application

Guide is motivated by a project in user-contributed hypertextual fiction known as Addventure. Addventure currently has over 20000 pages that are arranged in a cyclic, directional, non-edge-weighted graph, but are usually accessed more as a tree, with cycles tending to be results of returning to an earlier part of a tree. Users have expressed a desire to find pages in the story that have themes they are interested in, and avoid themes they are not.

Requirements

Borrowing from some traditional systems, Guide must be able to

Provide keyword and subject based searches against a set of documents. The user should be able to
Keep indexes small. Addventure currently uses over 30M of data, and it is considered that the value added by using an additional 30M is minimal.
Automatically generate indexes, either as new pages are added, or periodically.

Since Addventure is a hypertextual story, it may not make sense for a reader to jump in at a particular page. This radically changes the format of search results, and is the unique feature of Guide.

Search results are not presented in a list, since this would allow jumping directly to the page. This would remove contextual information and could discourage "browsing" around similar topics. Since most links are directional, once you were at a page you could only view children, not parents which may be related to the page.
Guide will travel with the reader, indicating links that will lead them closer to pages that were in the result of the search. Clearly understood relevance numbers will indicate, in a concise format, which matches of the highest values are closest.
Guide will always give suggestions to paths, even if the reader travels off the shortest/best path to a goal. Since browsing and wandering is encouraged, the user may find a path that is interesting or related to the goal, but not directly in the path.
In addition to relevance values, Guide will indicate which choice leads to the highest ranking page and to the nearest page. When a target page is reached, this will be indicated and will not be counted in any other statistics.

These features of Guide merge the best capabilities of hierarchical organization and searching, in a fashion related to Yahoo's search results. "Directed browsing" in this fashion gives the reader the freedom to travel wherever they wish, but with some basic information indicating where they may wish to go.

Although using existing tools is desirable, since there is no point in reinventing existing systems, commercial products are undesirable because Addventure is provided as a free service.

Design

The Guide system is composed of five major components:

A Subject Editor to allow sets of keywords to be collected into a subject
The Indexer which creates the searchable index from the pages
The Searcher which accepts a query and returns matching pages
A Path Computer which determines the shortest path between the current point and all targets
And finally, the Guide, which takes the results from the Searcher and Path Computer, loads the original document, and annotates it for the reader with guidance suggestions

Indexer

The Indexer is responsible for creating the keyword index of all the pages participating in Guide. Since the survey has indicated that both SWISH and Glimpse are good candidates for a first implementation, our plans are to use SWISH because it is already in use for other portions of Addventure and because of its flexibility with HTML.

Searcher

The choice of SWISH for an indexing system drives the use of the SWISH query tool as a back end for a boolean query. The SWISH query program will return a list of pages and weights based on keywords and key-subjects. Since SWISH only searches for keywords, and not for related words or stems, a list of keywords that apply for a subject must be separately maintained.

Subject Editor

The responsibility of maintaining this list of keywords that apply to a subject is the job of the Subject Editor. In typical Internet and Addventure tradition, new subjects and keywords are added by visitors to the game and are highly dynamic. A simple design for this involves an associative array provided by Perl scripting, using either a flat text file, or a dbm hashed file as storage.

This module can be ignored completely if we are willing to do keyword searches only and not allow subject based searches.

Path Computer

The Path Computer is the cornerstone of the Guide system, since it provides the shortest path between the current episode and the target episodes. It is also responsible for collecting the link information from the pages. For Addventure - this information is already collected, but in a somewhat haphazard format. An automatically run process will periodically (either daily or twice daily) reorganize this data into a format for the Path Computer.

At this time, path computations will be done using Dijkstra's algorithm for each set of points. Given efficient implementations of the algorithm, we can achieve results between O(V^2) and O(E lg V). [COR, pg 531] Under reasonable machine load, our estimation is that path determination will not cause unreasonable delay for the user. Additional work is necessary to ensure this does not become an unreasonable burden for the machine and user.

The Guide

Bringing it all together is the Guide, which is responsible for gathering the original search terms, executing the Searcher and storing its results for the journey.

For each episode traveled through, the full page is loaded and displayed to the user, appending additional information at the bottom of the page. The additional information is created by running the targets through the Path Computer, and using the results of the path information to compute a weighted relevancy index for each possible exit from the page as well as other supplemental required information. The weighted relevancy index divides the relevancy for a page by the distance to that page, and then adds up all the values for each episode reachable for each exit from a page.

At any time, the user should be able to quit from the Guide, and the Guide will force a page reload without the presence of its own additional information.

Other Applications

The intent is for Guide to be useful in any case where directed browsing proves useful to the reader. Hypertextual fiction is only one such application area. Another prime candidate for Guide is a CWIS. Most users currently browse their way through a CWIS, and are frequently unable to find information if the paths are setup incorrectly. With Guide, they can search for words that show promise, and have a lead on paths that will get them closer to their goal.

Future Work

The obvious next step in creating Guide is to complete low-level design and begin prototyping and implementation of the actual system. In particular, attention needs to be paid to the path computation algorithms to ensure this does not cause excessive slowdown, and to continue research for other methods that may be faster under our particular environment.

After initial implementation and fielding, evaluations will be done to ensure that the results and performance are reasonable, and to investigate other methods of improving the results. Other indexing engines, such as Glimpse, will be tested to see if relevancy numbers can be generated from it and if it will work faster. Other relevancy formulas should also be investigated, to see if they provide better indications of what paths to travel.

Finally, work should be done further to improve the keyword or key subject to try and better capture user expectations of what they are looking for. AI methods, such as those used for WebHound, may contribute to this effort, but is not the only possible solution. Other fields besides keywords, such as author and date created, may also be search queries in the future. Any activity however, including the current solution, requires active involvement of the user.

Conclusion

We have examined a wide range of web searching and indexing tools, taking note of the best features for user interaction. These features have been rolled into Guide, a new search tool that permits directed browsing towards one or more pages that meet a users criteria. Future work in this area is expected.

References

[ARCH] Archie Information Page, available at http://services.bunyip.com:8000/products/archie/info.html

[BOR] Geoffrey Borggard; Hyper Base homepage, available at http://azole.res.wpi.edu:8080/~bogamo/hyper/top.html

[CON] Library of Congress; Guide to the Library of Congress. 1982, Library of Congress, District of Columbia.

[COR] Thomas Cormen, Charles Leiserson, Ronald Rivest; Introduction to Algorithms. The MIT Press and McGraw-Hill Book Company, 1992.

[DEJA] DejaNews Research Service - homepage, available at http://www.dejanews.com/

[ENG1] Douglas C. Engelbart, Richard W. Watson, James C. Norton; "The Augmented Knowledge Workshop." AFIPS Conference Proceedings, Volume 42, June 1973.

[ENG2] Douglas C. Engelbart; "Collaboration Support Provisions in Augment." AFIPS 1984 Office Automation Conference Digest, Feb 1984.

[EXTE] excite Netsearch homepage, available at http://www.excite.com/

[GPH] Frequently Asked Questions about Gopher, available at gopher://mudhoney.micro.umn.edu:70/00/Gopher.FAQ

[GLM] Glimpse Working Group - University of Arizona, CS Dept., available at http://harvest.cs.colorado.edu/

[HAR] Technical Discussion of the Harvest System, available at http://harvest.cs.colorado.edu/harvest/technical.html

[HIWY] "Show me the Way!", infoHiway home page, available at http://www.infohiway.com/

[IBM] infoMarket service Welcome home page, available at http://www.infomkt.ibm.com/

[INK] Inktomi home page, available at http://inktomi.berkeley.edu/query.html

[KATZ] William Katz; Your Library: A Reference Guide. 1984, CBS College Publishing, New York.

[LYC] Lycos, Inc. homepage, available at http://www.lycos.com/

[MATH] MathSearch -- search a collection of mathematical Web material, available at http://www.maths.usyd.edu.au:8000/MathSearch.html

[OT1] Open Text homepage, available at http://www.opentext.com/

[OT2] Open Text Products & Services, available at http://www.opentext.com/otmkt/allprod.html

[SWI] SWISH Documentation, available at http://www.eit.com/software/swish/swish.html

[WEBH] The WEBHUNTER WWW Interface home page, available at http://webhound.www.media.mit.edu/projects/webhound/www-face/

[YAH] Yahoo! homepage, available at http://www.yahoo.com/

A copy of this paper is available at http://www.cs.odu.edu/~asf/. A copy of this paper, and follow-on papers and works, will be available at http://www.addventure.com.

Databases and Search Tools on the World Wide Web

PREFACE

ADVERTISEMENT

Introduction

Search and Indexing Tools

Pre-Internet Searching and Indexing

Books

Journals and Magazines

Historic Internet Tools

NLS/Augment

FTP and Archie

Usenet

Gopher and Veronica

Summary of Some Current Internet Search Tools

Early Web Indexes

Yahoo

Lycos

Open Text

Excite

Deja News

Hyper Base

IBM infoMarket

Infohiway

Inktomi

MathSearch

WebHunter / WebHound

Summary of Some Current Internet Indexing Tools

WAIS

Open Text

Glimpse

SWISH

An Improved Tool

Intended Application

Requirements

Design

Indexer

Searcher

Subject Editor

Path Computer

The Guide

Other Applications

Future Work

Conclusion

References