8 Implementation Aspects

Throughout the chapters, we have presented many aspects of the technology behind the kind of search tool we outline in this thesis. Before we can conclude the thesis, some loose threads need to be tied up, strengthened or cut off.

8.1 Webrarians are Agents

We have claimed that the webrarians are agents. Our agency employs two main types of agents:

Autonomous information agents for retrieval and pre-classification of Web pages
Interface agents for supporting users searching the Web page index

We will now show that the webrarians meet the agent definition from Chapter 4.1, and explain what possibilities using agent technology opens for our system.

8.1.1 Retrieval and Pre-classification Agents

The information retrieval webrarians:

Are long-lived

Have goals, sensors and effectors

Are autonomous

The Web page pre-classification webrarians:

Are long-lived

Have goals, sensors and effectors

Are autonomous

8.1.2 Search Interface Agents

The search interface webrarians:

Are long-lived

Have goals, sensors and effectors

Are autonomous

8.1.3 Choosing A Webrarian Language

This report has concentrated on the search tool application only. This may give the impression that the agents do not have to communicate very much with each other, but can get the information they need directly from the Web and the index that is built. However, in the future it will be important that the webrarians are able to communicate with other agents and applications, so we need to base an agent system implementation on a widely “spoken” agent language. There are several reasons for this.

Most important, the architecture we have described is based on there being one central index covering all Web pages on the whole Internet. This is maybe not the most probable approach in real life. Instead, an alternative is to create a number of indexes, where each index covers either Web pages only from a certain geographical area or Web pages all over the world that cover some more or less specific subject. Each index will have retrieval agents trained to only report back information about pages that are of interest to that particular index. Distributing the indexing system like this has many advantages, as we shall see later. A situation like this will require a possibility for webrarians to communicate with agents in other indexes, to be able to tell the user where to look for the information the user is requesting.

Another very important reason for using agents with the ability to communicate with other agents, is that in the future more and more operations, especially when it comes to information handling, will be performed by more or less intelligent agents. Personal news agents should be able to look up breaking news on the Web from all kinds of categories for the user through our index. Broker agents may also be introduced, meaning that instead of the user doing the search assisted by a Webrarian, a personal agent that knows its owner’s desires and interests really well, may be able to talk directly to our webrarians and locate interesting information for the user. Web publishing software may come equipped with publishing agents, which at publication generates the necessary metadata and goes straight to our index to register the new Web pages quickly and correctly.

To make all this possible we must base our agent system on the de facto standard agent communication language, ACL, as described in Chapter 4.5. When an actual standard agent communication is decided on, that must be our choice as well. Communication with other agent systems is vital to the success of the webrarians.

8.2 Centralized vs. Distributed Indexing

As we have already briefly mentioned, an alternative to having one large “Mother of all indexes” is to divide the search tool’s functionality and contents into several separate but agent-connected units. An index in the distributed model can cover either a geographical area, a specific topic, a certain context or a combination of these. The main advantages of distributing the indexing and search system are:

A faster search process in more up-to-date data is made possible. This means that as soon as a user finds an index that covers his or her most important information needs, the user will get a faster service and higher quality data
Smaller indexes means that less raw computing and networking power is required, as the workload is divided on several computers. Therefore, instead of buying expensive equipment capable of handling massive amounts of requests and transactions and storing enormous amounts of data, we can use standard computers and Internet connections with a reasonable bandwidth.
With the distributed workload, the chance of a system breakdown is reduced dramatically. This automatically results in more robustness and flexibility in the system. If one index computer or index site goes out of service, temporarily or forever, the requests from users to this index can be routed on to another index which holds information that may be of interest to the user.

These advantages are so significant that as soon as a prototype of the search system is built, the next step should be to start distributing the index.

8.3 Preparations For System Start-Up

In addition to the technical part, the programming and hardware setup, the EDDIC system is dependent on a number of organizational and human factors. To create and maintain our search tool, we need to team up with a number of partners. These partners will provide the code framework for our indexing system, they will help with improving the quality of the contents and they will help us with comments and suggestions for general system improvements. Who these partners shall be and how we can cooperate with them must be clear before we start building the system.

8.3.1 The Codes

The main fundament of the EDDIC system suggested in this report is the Dewey Decimal Classification code, which provides us with codes for describing the context of Web pages. The DDC code is maintained by the Library of Congress (LOC), so this will be an important partner for us. To ensure international popularity, we also need to cooperate with organizations in various countries that are responsible for translations of the DDC. These organizations are already looking for new ways to ensure easy, public access to all the information that is put on-line on the Internet. Because of this, it should be possible to convince them of the importance of the EDDIC system and their cooperation with us.

To develop a Web version of the Dewey codes, similar to the Dewey for Windows software, we must cooperate with the Online Computer Library Center (OCLC). In addition to letting people navigate in the hierarchy of codes as described in chapter 7, we must also define rules for the agents to use based on the notes from LOC on how to classify information using Dewey. This partnership should come natural as the idea behind the EDDIC system coincides well with OCLC’s purpose, namely to be “…a nonprofit, membership, library computer service and research organization dedicated to the public purposes of furthering access to the world’s information and reducing information costs” [OCLC, 1998].

Finally, we must cooperate with other developers and suppliers of various popular rating and code systems for Web pages. For now that first of all means the World Wide Web Consortium’s PICS group. From these partners we need thorough descriptions of the code formats we are to include in our index, so that our system may contribute to making new codes popular faster, and make it easier to search and filter information from the Internet. These two main effects should be motivation enough for various rating and code format “owners” to cooperate with us.

8.3.2 The Co-workers

To run the EDDIC service we need a staff. The staff’s tasks can roughly be divided into technical maintenance and development work on one side and manual classification work to maintain the contents of the index on the other.

For the technical part, we should work together with people who already have experience in creating Web spiders for information retrieval. This means most search engine providers can be chosen. However, it will be best to start with a smaller domain, and choose to cooperate with the people behind a search engine that only covers one part of the Internet and has comprehensive data about this part. A very interesting partner would be the Nordic Net Centre [NNC, 1998]. This is the main actor in a joint effort from several large educational institutions in the Nordic countries, who is running a project where they are looking at metadata information and Web pages, scheduled to end in the spring of 1998; The Nordic Metadata Project, which is used in for example the Nordic Web Index [NWI, 1998]. Throughout this project a database of tens of thousands of pages with Dublin Core metadata has been built, as well as 5-6 millions of full-indexed Web pages, including link structure, from all over the Nordic region. The experience and data from this workgroup can be very useful to us, both as a help in creating the software and to test the hypothesis about the importance of using URLs to decide the context of a Web page.

In the long run, we will need a number of technical people working on keeping the search tool running and improving it. These should be full-time employees, working directly for the institution or company offering the search service.

For the manual work, we must employ people with classification experience, and the natural choice is to use librarians. The number of librarians to employ will be decided by how ambitious we want to be. In the beginning we do not need specialists in any fields, but just general classification personnel with a good knowledge of the languages we want to cover Web pages in. Later, we may need specialists on some of the subjects that has a large number of Web pages devoted to them. To find suitable librarians, we must contact a university library or a technical library which has resources for and is interested in participating in projects such as ours.

8.3.3 The Contents

To ensure a high quality database from the start, we should initiate it by mapping an existing directory into it. The directory should be divided into topics in a well-designed hierarchy. The most comprehensive directory suitable for the purpose is the Yahoo! service. Compared to building our index from scratch, it is an easier task to map the categories from Yahoo! into Dewey codes and, using the mapping, move Web page entries from the Yahoo! hierarchy into the EDDIC hierarchy. The amounts of data covered by the Yahoo! index should be sufficient to give our webrarians a good basis for their classification work.

In return for providing their index information, Yahoo! can later be given information useful to them, such as information about expired links and the possibility of copying information about high quality Web pages from our index.

Another option is to use the part of the Nordic Web Index that contains Web pages with Dublin Core metadata to build a base set of index entries. These about 80,000 entries contains information about title, author, subject, description and publication date, which should be a suitable training environment for the webrarian agents and the human classification personnel.

To reduce the amount of network traffic in the future, we may want to cooperate with one or more of the major search engine companies. As long as we verify the actual existence of a page, we may as well use the page as it is stored in a search engine index instead of the page itself, when we classify a Web page.

8.3.4 The Comments

The EDDIC system must be built iteratively, with a prototype for test users to use and give feedback on as soon as possible. We need test users both on the classification side and the search side. The test users may or may not have had any experience with existing systems for organizing and searching the Web or other large information spaces. Our system introduces innovative ways of doing both in a way not very similar to any existing system, so the most important thing is not the test users’ knowledge and experience, but that they represent the exact kind of users the system is intended for; everyone.

If the system behaves as we hope it will and actually makes it easier to locate the information you want, finding people to volunteer as test users should not be difficult. All feedback received from the users should be considered and rapidly result in changes to the system if this seems called for. Successful modifications are kept as part of the system.

The comments from users is a very important part of a rapid development of a functional system. Because of the way the Web is growing, it is necessary to get the system up and running as soon as possible. Having many satisfied test users is also important in the process of making the search tool known to the world. In addition to advertisements, word of mouth is the best way to spread the news about our system.

8.4 Financing the Service

To start and run the EDDIC system will be expensive, due to the cost of developing the software, buying the neccessary hardware, employing classification personnel and making the world aware of the system’s existence. We must consider how this all can be financed.

The EDDIC project can either be implemented within the university world or by a purely commercial company or coalition. In any case it is important that the actors understand that this service must be realized as soon as possible, and are willing to cover the costs as they appear. The system is more or less ready to be implemented, based on the specifications and references contained in this report. Possible financial supporters may be national research councils, money reserved for major university projects or one of the main actors in the Internet/search engine world who wants to be profiled as innovative.

Independently of who the investor behind the system is, we have several options to choose between when it comes to finding ways to return the investor’s money and more:

First the traditional solutions, easy to implement:

Advertisements

Subscription service

With the arrival of e-cash, we can start charging for the use of our service even in very small units of money:

Pay for indexing

- Pay per use – user: As an alternative to subscribing to faster and higher-quality information, we can charge all or certain groups of users a small fee for each search they perform using a special access-restricted, improved part of our system. As an alternative to paying per search, payment can also be calculated from how many of the links suggested by the webrarians the user actually follows.

- Pay per use – provider: For commercial pages that offer certain products and services for on-line purchase, we can charge the page owners a small fee for each user that follows a link from our search tool to their pages. These pages will have a special page class code, so that users know these are Web pages that pay for being found through our index. Introducing this mechanism must be done very carefully, as it borders to charging for leading unknowing users only to the commercial actors with large advertisement budgets.

Selling access to the whole search tool or just our index to other search tools and Web sites, may also be a way to finance the service.

8.5 Dissecting the Monster

To build a functional, robust and scalable system like the one suggested in this report, a number of subdisciplines from several research areas must be brought together. Presented here is an outline of what problem areas our system construction task consists of, based on a similar, more general analysis for electronic commerce / digital library systems in [Adam & Yesha, 1996].

Area 1: Acquiring and storing information.

Web spider technology must be combined with agent technology and classification algorithms.
The suggested extensions to the Dewey Decimal Classification (DDC) must be precisely defined.
How to classify and represent non-textual document objects must be decided.
The retrieval and classification system must be multilingual. We must therefore create translation mechanisms and/or a multilingual index.
Database techniques for providing efficient management of the large volumes of information we will handle must be developed .
Feature extraction techniques for recognizing specific classes of Web pages must be developed, based on text analysis, pattern recognition in keywords, URLs, file names, link structure, etc.
Identical information from several Web pages must be identified and collected, and not listed as several different findings.

Area 2: Finding and filtering information.

Design the system to describe Web pages by their context and capabilities rather than by more random qualities such as the name of Web page owner or the exact words used to describe the service.
Develop mechanisms to perform content- and context-based search through our index.
Since the Web can not be censored, it must be possible to censor what kind of information our search and filtering webrarians shows to the user.
Provide users at different levels of computer expertise with a variety of efficient ways to search for information on Web pages.
Create graphical representations of the search index, using maps and three-dimensional data to give the users new, intuitive ways to navigate in the data.
Take actions to ensure that users can search for information using their own language and terms from domains they are familiar with, their ontologies. This is done through mapping sets of keywords from different domains and languages to the DDC codes where they belong..
For commercial pages, design a code system following the EDDIC guidelines that allows buyers to locate products and services with specific characteristics.
Webrarians, the intelligent agents in the EDDIC system, must be equipped with a certain learning ability, so that their performance may improve over time.
A mechanism for creating a personal user profile to improve the webrarians’ search assistance for each user must be developed, without intruding each particular user’s privacy.
An agent communication language for exchanging information between agents within and between agencies and indexes, based on the Knowledge Query Manipulation Language (KQML) and the Knowledge Interchange Format (KIF) must be implemented.
Techniques from data mining can be used to extract patterns in the users’ interests, and give the webrarians further help in deciding what pages may interest a particular user.
The webrarians must offer both query refinement and query expansion abilities, so that it can guide users both to smaller and larger sets of findings, depending on what the situation demands.

Area 3: Securing information and auditing access.

Mechanisms to allow charging fees for the use of the EDDIC system without divulging the identity of the user are needed.
Mechanisms to register how many times what links are followed where must be implemented, to be able to introduce a billing system for advertisers on EDDIC pages.
Webrarians must respect that not all Web sites want their pages to be indexed.

Area 4: Universal access.

Access to the system should be possible with any widely used Web browser, no plug-ins demanded. This means the system must be available using the standard HTTP protocol and with pages formatted in standard HTML.
Good user interfaces must be developed. A highly interactive user interface with search find listings efficiently created on the fly is necessary.
The visual design of the search assistants and the user interface is important. The appearance of the webrarians must not indicate that they are more “intelligent” than they actually are, and it must be clear at all times how the webrarians come up with their suggestions.
Future (soon present) technology, such as XML, must be taken advantage of to offer a more personalized user interface to the search system.

Area 5: Cost management and financial instruments.

Ways to cover the costs of running the EDDIC search service must be found. This includes developing a number of usage-time- and per-request-based cost models, and identifying groups of customers that are willing to pay for a faster and advertisement-free service.
Electronic cash billing abilities must be part of the EDDIC system.

Area 6: Socioeconomic impact.

A search system that also is able to guide a user to Web pages where what the user requests is for sale, can be a very important factor in the transition into a global, electronic economy. We must ensure that this kind of user guidance is used for the best of the society and not just to support the largest actors on the commercial scene.
The system can and should be used to strengthen the position of languages spoken by minorities, to prevent the Web from turning the world into a purely English-speaking one.
Ways to adapt the EDDIC services to people with different cultural backgrounds must be introduced.

As we see, numerous fields from computer science, library science, mathematics, graphical design, social sciences, economy and psychology must be combined to successfully implement the EDDIC system. A broad cross-disciplinary effort is required.

Go to: Front page - Index - Ch. 1 - Ch. 2 - Ch. 3 - Ch. 4 - Ch. 5 - Ch. 6 - Ch. 7 - Ch. 8 - Ch. 9 - Glossary - References Visit the author's homepage : http://www.pvv.org/~bct/ E-mail the author, Bjørn Christian Tørrissen: bct@pvv.org