Databases and Search Tools on the World Wide Web
- Communication information
- Allen S. Firstenberg
Graduate
1 Dec 1995
PREFACE
In this paper,
I examine several current indexing and searching tools that
are in use in the World Wide Web to determine the best and most
important features that each has to offer and then develop
a profile of an improved tool, including an outline of possible
methods of implementation.
ADVERTISEMENT
The paper is divided into two major components. In the first part, I
do a comprehensive review of search tools that are available on the
World Wide Web (WWW) and examine what features they provide to the
public that uses the tools, as well as several indexing tools
that are available to build a searchable database. Based on the
results of these surveys, the second part contains a proposal for a
new searching tool providing the best features of existing tools, and
extending their abilities to provide feedback current tools do not.
Introduction
Two of the primary obstacles to any information sharing scheme, from
early libraries to documentation on the Internet are the conflicting
requirements of comprehensive information location and information
overload reduction. In short, information accessors need to be able
to find exactly what they need -- no more and no less.
To solve these problems, indexes and categorizations have been
designed over time to narrow searches down to a manageable set of
information, but to also allow for "browsing" to the specific content
required. This concept has been translated to the WWW in various
forms -- but each lack certain features that can further enhance
searching capabilities. I explore current search tools, and propose a
tool that I feel further empowers people while they are in the
browsing phase of their search.
Search and Indexing Tools
We first look at existing methods to search for information, and
briefly examine the database or indexing facilities that are required
to support such searches. These will give us features, most notably
user interaction requirements, that will be necessary in designing an
updated tool.
Pre-Internet Searching and Indexing
Libraries, references, and card-catalog systems existed before the
Internet's requirements for for information searching existed.
Search methods were created to handle the manual searching and sorting
tools that have existed since the 1800's when modern scholarly
publications became popular.
Books
For books, two popular cataloging and indexing methods were created --
the Dewey decimal system and the Library of Congress system (LOC).
Both use hierarchical methods that pick broad categories and then
break them down to more specific sub-topics. While both are
indefinitely extendable, it is generally regarded that the LOC method
is more capable of expansion as new categories are created.
For indexing, the traditional method has been to create "cards" for
each book in a collection. Typically a title, author, and multiple
subject cards are created for each books that indicate the section the
book may be located in. Additionally, cards contain "tracings" which
indicate other possible references to investigate. Typically, these
cards (or at least the information for each card) and their
categorizations are established by the Library of Congress when a book
is copyrighted.
Journals and Magazines
Scholastic journals and papers are indexed through slightly different
means. Most papers are collected in individual journals which are
organized by the methods above and become a first cut for people
wishing to stay abreast of a field. Locating information in a
particular journal allows a researcher to read similar journals to find
similar papers.
Papers also traditionally have keywords associated with them that are
assigned by the author. These keywords may then be used as part of an
index that is then compiled for one or a series of publications.
Historic Internet Tools
The Internet has contained information repositories since early in its
existence, and has always required methods for people to locate this
information.
NLS/Augment
As early as 1973, the ARPAnet was being used to exchange and index
documentation about itself. Researchers at the Stanford Research
Center under Douglas Englebart created a network based "journal" with
all correspondence that occurred on the network.
Messages and all documents were linked to each other, and
automatically indexed. Indexes were generated by author, subject
keyword, and other keywords that the author could center for a
document. Searching tools provided linked references to documents.
FTP and Archie
While NLS was a somewhat proprietary format, the File transfer
Protocol (FTP) was one of the core application protocols that was
provided in the early days of the Internet. With the adoption of an
informal method to access file space without authentication, FTP
became a frequently used way to exchange a wide variety of information
and software.
Indexing, however, was practically non-existent for many years.
Informal theme-based indices were distributed (usually from other FTP
sites) containing lists of "well known" sites and their contents, but
no effort was made to formalize these lists or the contents of files
until the Archie service was unveiled. Archie was created as a
resource discovery and indexing system that surveyed anonymous FTP
sites for its files.
What Archie is unable to do, however, is analyze files for content.
Since only file and directory names are known, queries are done
against names of files or programs. Searches return all matches,
placing "best matching" and most recent files at the beginning of the
list.
Usenet
Usenet has served as an interactive information forum, where questions
and answers can be posted and discussed in a wide variety of
categories. Even in its early days, it became an information
repository that required extensive filtering techniques so readers
didn't suffer from information overload.
The first was the hierarchical grouping system that allowed users and
sites to select categories of groups they were interested in. Under
each hierarchy, words were used to further break down the categories
of discussion, with clear procedures used to establish new groups in
the hierarchy. Several research communities, such as those in
computer science and biology own hierarchies early in Usenet's
history, with later groups establishing themselves under more broad
categories. Either way groups and hierarchies allow a very rough cut
for information filtering.
Users can then apply additional filters to the articles in their
selected groups. These filters, traditionally known as "killfiles"
because they automatically kill undesired articles based on
information contained in the article header. Traditionally, these
messages included a keyword and subject header, which the author
either specifies or uses from referenced articles.
Gopher and Veronica
The Gopher protocol was designed as a way for users to "tunnel"
between Internet servers containing information. Designed around
a Campus Wide Information System (CWIS), but intended to be scalable
across the entire Internet. It makes the claim of "combin[ing] the
best features of browsing through collections of information and fully
indexed databases". [GPH]
Although Gopher servers were supposedly all registered, it was still
not easy to navigate through a hierarchy to the centralized list of
servers, and this still required the knowledge of which servers were
"well known" for what resources. The veronica system was designed to
provide a a keyword search of menu titles at all the gopher systems.
Line Archie, however, veronica does not index content of most pages -
only the titles.
Summary of Some Current Internet Search Tools
With he popularity of the World Wide Web, a number of sites have
popped up to provide a quick reference to information that is
available on the web. Each site is notable because of the interface
features provided to the user. The sites examined are not
comprehensive, but are indicative of the interfaces and features
available.
Early Web Indexes
The earliest web indexes were all compiled by hand by organizations
who were interested in promoting use of the web. Most notable among
them were the list
of web servers organized at the home of the WWW, and the
"What's New" index that was organized at the home of the Mosaic
browser.
There was no searching or indexing performed on these resources, and
as the web grew, both became impossible to maintain and use because of
the volume of new web sites and lack of filtering.
Yahoo
Yahoo provides a user-contributed or administrator-researched list of
web sites, features, and Internet resources. Each item has a text
description of the link. Users can either enter keywords to search
for in the title or descriptive text, or navigate a category based
hierarchy. Each page contains a list of links and descriptive text
that belongs in one of the hierarchical categories.
The hierarchy is a text-based subject pseudo-tree, users
can follow links down, but some branches actually link to other
branches under a hierarchy that was "more appropriate". It is easy to
move up to any level of the hierarchy, since the title of the current
page points to every previous page in the hierarchy.
The search system looks for substrings in the text and title and
returns a list of matches, their category, and the descriptive text.
A user-settable number of items may be returned, and items beyond this
number are inaccessible except by re-running the search.
From the search results, users can go directly to the site or to the
categories page to investigate related sites.
Lycos
At it's heart, Lycos provides searchers with a keyword based search on
full pages. Collection is done by a web-wandering "spider"
that stores and indexes the full text of a page, but there is a form
for users to suggest pages that the spider should research. As a side
benefit, the spider also determines the 250 sites most linked to and
makes this available.
Searches are performed by keyword, and the user can indicate how
"loosely" the matches are performed. Normally, 10 results at a time
are presented and the user can continue to request the next or
previous set of results. Users may also request how much information
for each item is presented to them in the results, including an
abstract, summary, fragment, and/or size of the page, words that were
matched in the page, and a "relevancy" score between zero and one,
along with the title and a link to the page being referenced.
Normally, results are limited to ten to forty results per page, with a
link to the next and previous pages.
Open Text
Open Text is similar to Lycos in many of its features, but has a wider
variety of searching and reporting controls. Searches can be simple,
have elaborate boolean search conditions, or indicate which set of
keywords are more important than others.
Entries in the results contain a relevance score, and either the first
few lines of the page, a page fragment surrounding the keywords, or a
list of all the headings that the page contains as a summary. Each
search result can be turned into a new search of "similar pages" by
using the page to seed the list of keywords to search for.
Excite
Excite allows searches similar to Lycos and Open Text, but has its own
variants on keyword entry and information display. Standard full-text
keyword searches are augmented by the ability to enter "words
describing a concept" which is more of a search for pages that contain
the keywords specified as well as related keywords.
Results are presented in one of two modes, and the user may switch
between the two modes without re-performing the search. The
"confidence" view is similar to the results of a Lycos search with the
addition of a link that will search for pages that are similar to a
selected item. The "site" view lists just links to sites, but
collects the page references in a hierarchical tree indicating the
structure of pages at a site, indicating which sites have pages that
are more or less relevant to a particular topic, and which area of the
site contains the best information.
Deja News
Instead of providing indexes to the more conventional part of the web,
Deja News provides a full-text keyword search on many Usenet groups.
More notably, however, are the filtering methods that may be applied
when doing a search. Articles can be removed from the search based on
their age or groups they do or don't belong to. Results are presented
in either a terse or verbose format and sorted by relevancy score,
usenet group, date published, or author. The credibility of authors
are rated by showing a posting history for that person.
Hyper Base
Hyper Base is similar to Yahoo, since it provides a hierarchical
subject list and works from user contributions. Hyper Base is less
structured, however, and new topics in the hierarchy and entries for
each topic are automatically created upon user request, creating a
somewhat haphazard effect. There is no search capability.
IBM infoMarket
IBM's infoMarket has a number of lofty claims, but was untestable at
the time this survey was done. It aims to provide indexes for
graphics, video, and sound, as well as the more mundane text, and has
the future goal of being able to "securely distribute content and
receive payment".
Currently, its only notable feature is that it provides a single,
unified search of a number of other web databases, presents their
results, and indicates which databases produced which hits and any
overlapping results. Users can use this additional dimension to
help determine which pages are of the most value.
Infohiway
Under the assumption that users are either trying to locate web sites
in a particular geographic region, or knowing the first two letters of
their name, Infohiway claims to have "pre-searched" other databases
and eliminates "complex search expressions." In reality, the
interface is a set of hierarchical pages organized alphabetically. It
is not possible to search for descriptions of sites.
Inktomi
Inktomi is like other keyword search systems, but uses an easier query
language. Keywords are specified, with mandatory keywords being
prefixed by a + character, and keywords that should not be on a page
being prefixed by a - character. Relevancy results are indicated
graphically and rated between zero and ten, instead of printing a
numeric value between zero and one.
MathSearch
Although containing only math related research, the MathSearch index
provides an example of features that are targeted at a specific
audience. The instructions are mathematically inclined (describing
terms and results in terms of sets) and help and error messages are
more polite than standard search tools.
Although searching is still keyword based, it only matches documents
that have the keywords in the same sentence. The results page has
similarly made the concept of relevance clearer. Instead of assigning
an apparently arbitrary number, the number of matches in each document
is returned as an absolute number.
WebHunter / WebHound
Although it is currently down, and in search of a name, this search
tool is different from all others since it is based on intelligent
agent techniques. According to the documentation, it picks up patters
for a particular user and learns which pages are their favorites.
From these, and from other user input, it makes suggestions about
which other pages may be of interest.
Summary of Some Current Internet Indexing Tools
WAIS
The WAIS protocol was one of the original online content indexing
tools, and is the cornerstone of Wais Inc's online publishing
initiative. The WAIS system is a combination of a database format,
and a standard protocol for clients to query the WAIS database server.
The WAIS server and protocol are more functional than what is examined
here, these features are ones that are considered key to our
application.
Wais distributes a free version of its product, which is reputed to
create indexes that are roughly 100% the size of the original data
files. The commercial version will index files besides text
documents, and claims to create indexes that are 30-50% the size of
the original data. Indexes may be incrementally generated as new
documents are added to the archive. Stopwords are customized by each
server maintainer, and the system users various stemming techniques to
derive variations of a queried word.
The server accepts queries structured as natural language questions,
quoted keywords, boolean expressions, and references to other
documents. Words and terms may be weighted, wildcarded, factored as
part of the density of a document, and applied to specific fields in
the source document if they existed. Results give extensive
information about the query and documents returned.
Open Text
The Open Text text retrieval system is the commercial text database
behind their indexing site. It is similar in many ways to the Wais
product in its public features, but is more web oriented and does not
use the WAIS protocol.
The Open Text database is structured somewhat differently, however.
Instead of creating an index of words that point to various documents,
the database is of the documents themselves, with methods to do string
searching at the time requested. [OT2]
Surprisingly, this still produces searches that are remarkably fast.
This design has the side effect that there are no stop-words, and even
the most common words and phrases may be part of a search.
Glimpse
Glimpse is a freely available indexing tool from the University of
Arizona with a goal of duplicating the UNIX grep command over a
set of indexed files. Queries in grep are normally substring
or string pattern matching searches, which is extended in Glimpse to
include approximate matching and boolean queries.
Unique in Glimpse is the ability for the administrator to make a
trade-off between size of the index and speed that searches will be
accomplished. Search results are also similar to grep since they are
pattern matching, and not relevancy based. Indexes can be rapidly
created and incrementally updated.
SWISH
SWISH is another freely available index, with licensing restrictions,
available from the EIT Corporation. Its goal is for a basic tool with
a small index, at the sacrifice of some fancier features. Stemming
and synonyms are not supported, and search queries are composed of a
basic boolean expression. Stopwords are drawn from a provided list, or
automatically generated for "common words" that break an administrator
defined threshold.
Results consist of some basic indexing information, and a relevancy
number, file size, and title and path of matching documents. Since
SWISH is designed to work on the Web, it can be configured to give a
higher weighting if a word is found inside certain HTML tags.
An Improved Tool
Based on these search engines and tools, we will formulate the "best
of the best" requirements and features, examine alternate approaches
for some of their goals, and formulate a basic design for a new tool.
Specific implementation and design details are not examined at this
time. The tool as a whole is tentatively called Guide.
Intended Application
Guide is motivated by a project in user-contributed hypertextual
fiction known as
Addventure. Addventure currently has over 20000 pages that are
arranged in a cyclic, directional, non-edge-weighted graph, but are
usually accessed more as a tree, with cycles tending to be results of
returning to an earlier part of a tree. Users have expressed a desire
to find pages in the story that have themes they are interested in,
and avoid themes they are not.
Requirements
Borrowing from some traditional systems, Guide must be able to
- Provide keyword and subject based searches against a set of
documents. The user should be able to
- Keep indexes small. Addventure currently uses over 30M of data,
and it is considered that the value added by using an additional 30M
is minimal.
- Automatically generate indexes, either as new
pages are added, or periodically.
Since Addventure is a hypertextual story, it may not make sense for a
reader to jump in at a particular page. This radically changes the
format of search results, and is the unique feature of Guide.
- Search results are not presented in a list, since this would
allow jumping directly to the page. This would remove contextual
information and could discourage "browsing" around similar topics.
Since most links are directional, once you were at a page you could
only view children, not parents which may be related to the page.
- Guide will travel with the reader, indicating links that will
lead them closer to pages that were in the result of the search.
Clearly understood relevance numbers will indicate, in a concise
format, which matches of the highest values are closest.
- Guide will always give suggestions to paths, even if the reader
travels off the shortest/best path to a goal. Since browsing and
wandering is encouraged, the user may find a path that is interesting
or related to the goal, but not directly in the path.
- In addition to relevance values, Guide will indicate which choice
leads to the highest ranking page and to the nearest page. When a
target page is reached, this will be indicated and will not be counted
in any other statistics.
These features of Guide merge the best capabilities of hierarchical
organization and searching, in a fashion related to Yahoo's search
results. "Directed browsing" in this fashion gives the reader the
freedom to travel wherever they wish, but with some basic information
indicating where they may wish to go.
Although using existing tools is desirable, since there is no point in
reinventing existing systems, commercial products are undesirable
because Addventure is provided as a free service.
Design
The Guide system is composed of five major components:
- A Subject Editor to allow sets of keywords to be collected
into a subject
- The Indexer which creates the searchable index from the
pages
- The Searcher which accepts a query and returns matching
pages
- A Path Computer which determines the shortest path between
the current point and all targets
- And finally, the Guide, which takes the results from the
Searcher and Path Computer, loads the original document, and
annotates it for the reader with guidance suggestions
Indexer
The Indexer is responsible for creating the keyword index of all the
pages participating in Guide. Since the survey has indicated that
both SWISH and Glimpse are good candidates for a first
implementation, our plans are to use SWISH because it is already in
use for other portions of Addventure and because of its flexibility
with HTML.
Searcher
The choice of SWISH for an indexing system drives the use of the SWISH
query tool as a back end for a boolean query. The SWISH query program
will return a list of pages and weights based on keywords and
key-subjects. Since SWISH only searches for keywords, and not for
related words or stems, a list of keywords that apply for a subject
must be separately maintained.
Subject Editor
The responsibility of maintaining this list of keywords that apply to
a subject is the job of the Subject Editor. In typical Internet and
Addventure tradition, new subjects and keywords are added by
visitors to the game and are highly dynamic. A simple design for this
involves an associative array provided by Perl scripting, using either
a flat text file, or a dbm hashed file as storage.
This module can be ignored completely if we are willing to do keyword
searches only and not allow subject based searches.
Path Computer
The Path Computer is the cornerstone of the Guide system, since it
provides the shortest path between the current episode and the target
episodes. It is also responsible for collecting the link information
from the pages.
For Addventure - this information is already collected, but in a
somewhat haphazard format. An automatically run process will
periodically (either daily or twice daily) reorganize this data into a
format for the Path Computer.
At this time, path computations will be done using Dijkstra's
algorithm for each set of points. Given efficient implementations of
the algorithm, we can achieve results between O(V^2) and O(E lg V).
[COR, pg 531]
Under reasonable machine load, our estimation is that path
determination will not cause unreasonable delay for the user.
Additional work is necessary to ensure this does not become an
unreasonable burden for the machine and user.
The Guide
Bringing it all together is the Guide, which is responsible for
gathering the original search terms, executing the Searcher and
storing its results for the journey.
For each episode traveled
through, the full page is loaded and displayed to the user, appending
additional information at the bottom of the page. The additional
information is created by running the targets through the Path
Computer, and using the results of the path information to compute a
weighted relevancy index for each possible exit from the page as well
as other supplemental required information. The weighted relevancy
index divides the relevancy for a page by the distance to that page,
and then adds up all the values for each episode reachable for each
exit from a page.
At any time, the user should be able to quit from the Guide, and the
Guide will force a page reload without the presence of its own
additional information.
Other Applications
The intent is for Guide to be useful in any case where directed
browsing proves useful to the reader. Hypertextual fiction is only
one such application area. Another prime candidate for Guide is a
CWIS. Most users currently browse
their way through a CWIS, and are frequently unable to find
information if the paths are setup incorrectly. With Guide, they can
search for words that show promise, and have a lead on paths that will
get them closer to their goal.
Future Work
The obvious next step in creating Guide is to complete low-level
design and begin prototyping and implementation of the actual system.
In particular, attention needs to be paid to the path computation
algorithms to ensure this does not cause excessive slowdown, and to
continue research for other methods that may be faster under our
particular environment.
After initial implementation and fielding, evaluations will be done to
ensure that the results and performance are reasonable, and to
investigate other methods of improving the results. Other indexing
engines, such as Glimpse, will be tested to see if relevancy numbers
can be generated from it and if it will work faster. Other relevancy
formulas should also be investigated, to see if they provide better
indications of what paths to travel.
Finally, work should be done further to improve the keyword or key
subject to try and better capture user expectations of what they are
looking for. AI methods, such as those used for WebHound, may
contribute to this effort, but is not the only possible solution.
Other fields besides keywords, such as author and date created, may
also be search queries in the future. Any activity however, including
the current solution, requires active involvement of the user.
Conclusion
We have examined a wide range of web searching and indexing tools,
taking note of the best features for user interaction. These features
have been rolled into Guide, a new search tool that permits directed
browsing towards one or more pages that meet a users criteria. Future
work in this area is expected.
References
[ARCH] Archie Information Page, available at
http://services.bunyip.com:8000/products/archie/info.html
[BOR] Geoffrey Borggard; Hyper Base homepage,
available at
http://azole.res.wpi.edu:8080/~bogamo/hyper/top.html
[CON] Library of Congress; Guide to the Library
of Congress. 1982, Library of Congress, District of Columbia.
[COR] Thomas Cormen, Charles Leiserson, Ronald
Rivest; Introduction to Algorithms. The MIT Press and
McGraw-Hill Book Company, 1992.
[DEJA] DejaNews Research Service - homepage,
available at
http://www.dejanews.com/
[ENG1] Douglas C. Engelbart, Richard W. Watson,
James C. Norton; "The Augmented Knowledge Workshop." AFIPS Conference
Proceedings, Volume 42, June 1973.
[ENG2] Douglas C. Engelbart; "Collaboration Support
Provisions in Augment." AFIPS 1984 Office Automation Conference
Digest, Feb 1984.
[EXTE] excite Netsearch homepage, available at
http://www.excite.com/
[GPH] Frequently Asked Questions about Gopher,
available at
gopher://mudhoney.micro.umn.edu:70/00/Gopher.FAQ
[GLM] Glimpse Working Group - University of Arizona,
CS Dept., available at
http://harvest.cs.colorado.edu/
[HAR] Technical Discussion of the Harvest System,
available at
http://harvest.cs.colorado.edu/harvest/technical.html
[HIWY] "Show me the Way!", infoHiway home page,
available at
http://www.infohiway.com/
[IBM] infoMarket service Welcome home page,
available at
http://www.infomkt.ibm.com/
[INK] Inktomi home page, available at
http://inktomi.berkeley.edu/query.html
[KATZ] William Katz; Your Library: A Reference
Guide. 1984,
CBS College Publishing, New York.
[LYC] Lycos, Inc. homepage, available at
http://www.lycos.com/
[MATH] MathSearch -- search a collection of
mathematical Web material, available at
http://www.maths.usyd.edu.au:8000/MathSearch.html
[OT1] Open Text homepage, available at
http://www.opentext.com/
[OT2] Open Text Products & Services, available at
http://www.opentext.com/otmkt/allprod.html
[SWI] SWISH Documentation, available at
http://www.eit.com/software/swish/swish.html
[WEBH] The WEBHUNTER WWW Interface home page,
available at
http://webhound.www.media.mit.edu/projects/webhound/www-face/
[YAH] Yahoo! homepage, available at
http://www.yahoo.com/
A copy of this paper is available at
http://www.cs.odu.edu/~asf/.
A copy of this paper, and follow-on papers and works, will be
available at
http://www.addventure.com.