6 Searching And Indexing

Robots, Spiders and Worms
Aliweb
SWISH
WAIS
CNIDR Isite
Further Information

The tremendous growth in the numbers and extent of information services on WWW has made net-surfing an ineffective way of finding useful information. Fortunately sophisticated indexing tools are being developed. Figure 6-1 shows a page which contains pointers to a number of searching tools.

Figure 6-1 A Collection Of WWW Search Engines.

A collection of WWW search engines is available at the URL http://cui.unige.ch/meta-index.html Some of the main searching tools are listed below:

CUI WWW Catalog: http://cuiwww.unige.ch/cgi-bin/w3catalog
Yahoo: http://www.yahoo.com/
Globewide Network Acadamy: http://uu-gna.mit.edu:8001/cgi-bin/meta/
EINet's Galaxy: http://galaxy.einet.net/
Aliweb: http://web.nexor.co.uk/public/aliweb/aliweb.html
Lycos: http://fuzine.mt.cs.colorado.edu/mlm/lycos-all.html
World-Wide Web Worm: http://www.cs.colorado.edu/home/mcbryan/WWWW.html
WebCrawler: http://webcrawler.cs.washington.edu/WebCrawler/WebQuery.html
RBSE URL: http://rbse.jcs.nasa.gov/eichmann/urlsearch.html
Nikos: http://www.rns.com/cgi-bin/nomad
Jumpstation Robot: http://www.stir.ac.uk/jsbin/js
World-Wide Web Wanderer: http://www.mit.edu:8001/cgi/wandex

Robots, Spiders and Worms

During 1993 many WWW users discovered resources by net-surfing: going to one WWW server, exploring what was available, and then following links to other WWW servers. A number of software developers produced software which automated this process, so that a program went from server to server, indexing information, such as contents of the <TITLE> tag or the contents of server home pages. Such programs became known as robots or spiders; one robot was called WWWW, the World-Wide Web Worm.

There are a number of problems with this approach to global indexing:

Server performance When a robot arrives at your server it can place a heavy load on the server.
Network performance Robots can place heavy loads on the network infrastructure.
Quality Robots index all files. They are unable to differentiate between valuable information resources (such as a collection of research reports) and low quality, transient information (such as an undergraduate's personal interests).
Maintenance The robot will not know if an information source that has been indexed is withdrawn from service.

A number of these issues have been addressed. Martijn Koster's Guidelines For Robots, which is available at the URL http://web.nexor.co.uk/mak/doc/robots/robots.html provides guidelines for developers of robots.

A list of robots is kept at the URL http://web.nexor.co.uk/mak/doc/robots/active.html

Aliweb

Aliweb (Archie Like Indexing In The Web) provides another approach to the indexing of WWW resources. With Aliweb each site is responsible for indexing files. The server administrator is responsible for choosing and describing the services to be indexed.

Further information about Aliweb is available at the URL http://web.nexor.co.uk/aliweb/doc/aliweb.html The paper ALIWEB - Archie-Like Indexing In the Web, which was presented at the WWW 94 conference in CERN, is available at the URL http://web.nexor.co.uk/mak/doc/aliweb-paper/paper.html

SWISH

SWISH, which stands for Simple Web Indexing System for Humans, was announced on 16 November 1994. It is a program that allows you to index your Web site and search for files using keywords in a fast and easy manner. Documentation is available at the URL http://www.eit.com/software/swish/swish.html The software is available at the URL ftp://ftp.eit.com/pub/web.software/swish/

WAIS

WAIS (Wide Area Information Server) is another mechanism for indexing resources. WAIS is used by the Computing Service, University of Leeds to index its documents and newsletters. An example of how the WAIS server and WAIS indexing software is used is given below.

The command:

waisserver -p 210 -d /apps/info/WWW/WAIS

is used to start the WAIS server software. The -p 210 argument specifies the name of the port on which the server runs while the -d argument gives the name of the directory which will contain WAIS databases. Note that since the WAIS server will normally be running continuously it will normally be initiated by the system administrator.

Newsletters are indexed by giving the command

waisindex -export -d /apps/info/WWW/ucs/newsletter/wais-sources/computing-service-newsletter -T HTML *.html

The name of the WAIS database is computing-service-newsletter This long name is used since a single directory is used for all WAIS databases - it will save confusion if other departments wish to index their own departmental newsletters.

The WAIS database can be accessed by a dedicated WAIS client or by a WWW browser which contains support for the WAIS protocol. The WAIS database can be accessed by giving the URL wais://www.leeds.ac.uk/computing-service-newsletter

WAIS Utilities

A number of utilities are available which can post-process the output from WAIS.

wais.pl is a CGI script which is distributed with the NCSA httpd server.

Son of wais.pl is a CGI script which is based on the wais.pl script.

SFGate is a CGI script which interfaces to WAIS servers. SFGate provides a forms interface which can be used to access a number of WAIS databases. It is available at the URL http://ls6-www.informatik.uni-dortmund.de/SFgate/SFgate.html A demonstration is available at the URL http://ls6-www.informatik.uni-dortmund.de/SFgate/multiple.html

wwwwais is a small ANSI C program that acts as gateway between waisq or waissearch (programs that search WAIS indexes) and a forms-capable World-Wide Web browser. With the freely distributable freeWAIS package, this program, and your local Web site, you can:

Create searchable databases of the information on your Web site.
Allow users to search multiple databases via their Web browser with customizable options.
Create a custom pop-up menu of servers to search through.
Produce hypertext search results, with file information and links directly to the relevant HTML documents.
Retrieve WAIS source descriptions and files.
Specify URLs and filters to map results to.
Only allow users from certain sites to search document.

Documentation is at the URL http://www.eit.com/software/wwwwais/wwwwais.html

You can FTP the source and related files from the URL ftp://ftp.eit.com/pub/web.software/wwwwais/

You can see how it looks at the URL http //www.eit.com/cgi-bin/wwwwais

A WAIS Application

One interesting application of the use of WAIS is the multimedia archive prototype developed by Andy Walker, formerly of the CBL/Multimedia Unit, University of Leeds. The prototype was developed to investigate the feasibility of providing an archive of multimedia objects for use in CBL applications by members of the University of Leeds.

A directory is created for each multimedia object. The directory contains the multimedia object itself (e.g. a graphical file, video clip or sound file) together with a keyword file which describes the object. The keyword files are indexed using WAIS. A WWW browser which supports forms is used to run a CGI script. The CGI script invokes the waisq command to search the WAIS database. The output from waisq is then used to create a HTML file which contains pointers to thumbnail images of matching multimedia objects.

Figure 6-2 Multimedia Archive.

Which WAIS?

A number of WAIS servers are available. The freeWAIS software is currently used at the University of Leeds. This software is maintained by CNIDR, the Clearinghouse For Networked Information Discovery and Retrieval. The freeWAIS software, however, is based on the 1988 version of the Z39.50 protocol. An implementation of WAIS based on the 1992 version of Z39.50 is also believed to be available from CNIDR. freeWAIS is available at the URL ftp://ftp.cnidr.org/pub/NIDR.tools/freewais

freeWAIS-sf is an implementation of WAIS developed at Dortmund University. It is available at the URL ftp://ls6-www.informatik.uni-dortmund.de/pub/wais/freeWAIS-0.2-sf-beta.tar.gz

CNIDR Isite

CNIDR Isite is an integrated software package including a text indexer, search engine and Z39.50 communication tools to access databases. Isite includes the CNIDR ZDist, Isearch and Search API distributions.

A mailing list has been established to discuss Isite. To join, send an -mail message to listserv@vinca.cnidr.org with the body of the message as subscribe ISITE-L your name To post messages to the list, send to isite-l@vinca.cnidr.org.

Further information is available at the URL http://vinca.cnidr.org/software/Isite/Isite.html

Further Information

A tutorial on Mosaic and WAIS is available at the URL http://wintermute.ncsa.uiuc.edu:8080/wais-tutorial/wais.html

A WAIS overview is available at the URL http://info.cern.ch/hypertext/Products/wais/sources/Overview.html

A list of resources about the Z39.50 information discovery protocol is available at the URL http://ds.internic.net/z3950/z3950.html

<< , >> , Title