8 Implementation Aspects
Throughout the chapters, we have presented many aspects of the technology
behind the kind of search tool we outline in this thesis. Before we can
conclude the thesis, some loose threads need to be tied up, strengthened
or cut off.
8.1 Webrarians are Agents
We have claimed that the webrarians are agents. Our agency employs two
main types of agents:
-
Autonomous information agents for retrieval and pre-classification of Web
pages
-
Interface agents for supporting users searching the Web page index
We will now show that the webrarians meet the agent definition from Chapter
4.1, and explain what possibilities using agent technology opens for
our system.
8.1.1 Retrieval and Pre-classification Agents
The information retrieval webrarians:
- “Are long-lived”; The information retrieval agents are spawned
and sent out onto the Web to find new Web pages and gather all kinds of
information useful for the index. To avoid having too many agents consume
Internet resources, agents that have reached their goals and/or followed
a certain number of links without finding any new information must be programmed
to “die”.
- “Have goals, sensors and effectors”; Some agents will have
clearly defined goals, such as “verify the existence of this URL”, while
others have the more general goal of locating Web pages that are not in
the index yet. The “sensors and effectors”-bit is covered by the navigation
and reporting abilities the agents have through sending messages and requesting
information using the HTTP protocol and agent language.
- “Are autonomous”; Given a goal, the agents will use their
knowledge base to decide what links to follow in what order, how to act
when a link seems to be dead and so on. When an agent is activated it does
not need further instructions, the only thing to do is wait for the agent
to report its results.
The Web page pre-classification webrarians:
- “Are long-lived”; These agents’ life cycle has the phases
1. Birth, 2. Retrieve information,
3. Interpret information, 4. Report information, 5. Death.
- “Have goals, sensors and effectors”; The goal for this type
of agent is to make as much “sense” of a Web page as possible, and to come
up with information that can be of help to a human classifier of Web pages.
It senses through text analysis and lookups in the index and code tables,
and it “effects” through producing suggestions for meta information for
Web pages that it presents to a human classifier.
- “Are autonomous”; Based on experience, the agents build a
knowledge base. This experience is used to autonomously combine information
from the search index, the URL analyzer, the header analyzer and the text
analysis, and produce a set of metadata for any particular Web page. Humans
continuously correct the agents’ work, and the rules in the knowledge base
can be changed following the corrections.
8.1.2 Search Interface Agents
The search interface webrarians:
- “Are long-lived”; A search interface agent is spawned each
time a user comes to the search page and requires assistance. Its life
lasts until the user is satisfied or for other reasons leaves the search
site.
- “Have goals, sensors and effectors"; Their goal is to assist
the user by asking the user questions and filter information from the index
so that the user only gets to see the index entries that match the answers
the user gives to the agent’s questions. The agent senses the by reading
from the index and receiving the user’s answers to its questions. The agent’s
effector is the filtering mechanism that is used to create the hit lists
for the user.
- “Are autonomous”; The agent looks at the available data and
based on this generates questions for the user to answer, with narrowing
down the search as efficiently as possible in mind.
8.1.3 Choosing A Webrarian Language
This report has concentrated on the search tool application only. This
may give the impression that the agents do not have to communicate
very much with each other, but can get the information they need directly
from the Web and the index that is built. However, in the future it will
be important that the webrarians are able to communicate with other agents
and applications, so we need to base an agent system implementation on
a widely “spoken” agent language. There are several reasons for this.
Most important, the architecture we have described is based on there
being one central index covering all Web pages on the whole Internet. This
is maybe not the most probable approach in real life. Instead, an alternative
is to create a number of indexes, where each index covers either Web pages
only from a certain geographical area or Web pages all over the world
that cover some more or less specific subject. Each index will have retrieval
agents trained to only report back information about pages that are of
interest to that particular index. Distributing the indexing system like
this has many advantages, as we shall see later. A situation like this
will require a possibility for webrarians to communicate with agents in
other indexes, to be able to tell the user where to look for the information
the user is requesting.
Another very important reason for using agents with the ability to communicate
with other agents, is that in the future more and more operations, especially
when it comes to information handling, will be performed by more or less
intelligent agents. Personal news agents should be able to look up breaking
news on the Web from all kinds of categories for the user through our index.
Broker agents may also be introduced, meaning that instead of the user
doing the search assisted by a Webrarian, a personal agent that knows its
owner’s desires and interests really well, may be able to talk directly
to our webrarians and locate interesting information for the user. Web
publishing software may come equipped with publishing agents, which at
publication generates the necessary metadata and goes straight to our index
to register the new Web pages quickly and correctly.
To make all this possible we must base our agent system on the de facto
standard agent communication language, ACL, as described in Chapter
4.5. When an actual standard agent communication is decided on, that
must be our choice as well. Communication with other agent systems is vital
to the success of the webrarians.
8.2 Centralized vs. Distributed Indexing
As we have already briefly mentioned, an alternative to having one large
“Mother of all indexes” is to divide the search tool’s functionality and
contents into several separate but agent-connected units. An index in the
distributed model can cover either a geographical area, a specific topic,
a certain context or a combination of these. The main advantages of distributing
the indexing and search system are:
-
A faster search process in more up-to-date data is made possible. This
means that as soon as a user finds an index that covers his or her most
important information needs, the user will get a faster service and higher
quality data
-
Smaller indexes means that less raw computing and networking power is required,
as the workload is divided on several computers. Therefore, instead of
buying expensive equipment capable of handling massive amounts of requests
and transactions and storing enormous amounts of data, we can use standard
computers and Internet connections with a reasonable bandwidth.
-
With the distributed workload, the chance of a system breakdown is reduced
dramatically. This automatically results in more robustness and flexibility
in the system. If one index computer or index site goes out of service,
temporarily or forever, the requests from users to this index can be routed
on to another index which holds information that may be of interest to
the user.
These advantages are so significant that as soon as a prototype of the
search system is built, the next step should be to start distributing the
index.
8.3 Preparations For System Start-Up
In addition to the technical part, the programming and hardware setup,
the EDDIC system is dependent on a number of organizational and human factors.
To create and maintain our search tool, we need to team up with a number
of partners. These partners will provide the code framework for our indexing
system, they will help with improving the quality of the contents and they
will help us with comments and suggestions for general system improvements.
Who these partners shall be and how we can cooperate with them must be
clear before we start building the system.
8.3.1 The Codes
The main fundament of the EDDIC system suggested in this report is the
Dewey Decimal Classification code, which provides us with codes for describing
the context of Web pages. The DDC code is maintained by the Library of
Congress (LOC), so this will be an important partner for us. To ensure
international popularity, we also need to cooperate with organizations
in various countries that are responsible for translations of the DDC.
These organizations are already looking for new ways to ensure easy, public
access to all the information that is put on-line on the Internet. Because
of this, it should be possible to convince them of the importance of the
EDDIC system and their cooperation with us.
To develop a Web version of the Dewey codes, similar to the Dewey for
Windows software, we must cooperate with the Online Computer Library Center
(OCLC). In addition to letting people navigate in the hierarchy of codes
as described in chapter 7, we must also define rules for the agents to
use based on the notes from LOC on how to classify information using Dewey.
This partnership should come natural as the idea behind the EDDIC system
coincides well with OCLC’s purpose, namely to be “…a nonprofit, membership,
library computer service and research organization dedicated to the public
purposes of furthering access to the world’s information and reducing information
costs” [OCLC, 1998].
Finally, we must cooperate with other developers and suppliers of various
popular rating and code systems for Web pages. For now that first of all
means the World Wide Web Consortium’s PICS group. From these partners we
need thorough descriptions of the code formats we are to include in our
index, so that our system may contribute to making new codes popular faster,
and make it easier to search and filter information from the Internet.
These two main effects should be motivation enough for various rating and
code format “owners” to cooperate with us.
8.3.2 The Co-workers
To run the EDDIC service we need a staff. The staff’s tasks can roughly
be divided into technical maintenance and development work on one side
and manual classification work to maintain the contents of the index on
the other.
For the technical part, we should work together with people who already
have experience in creating Web spiders for information retrieval. This
means most search engine providers can be chosen. However, it will be best
to start with a smaller domain, and choose to cooperate with the people
behind a search engine that only covers one part of the Internet and has
comprehensive data about this part. A very interesting partner would be
the Nordic Net Centre [NNC, 1998].
This is the main actor in a joint effort from several large educational
institutions in the Nordic countries, who is running a project where they
are looking at metadata information and Web pages, scheduled to end in
the spring of 1998; The Nordic Metadata Project, which is used in for example
the Nordic Web Index [NWI, 1998].
Throughout this project a database of tens of thousands of pages with Dublin
Core metadata has been built, as well as 5-6 millions of full-indexed Web
pages, including link structure, from all over the Nordic region. The experience
and data from this workgroup can be very useful to us, both as a help in
creating the software and to test the hypothesis about the importance of
using URLs to decide the context of a Web page.
In the long run, we will need a number of technical people working on
keeping the search tool running and improving it. These should be full-time
employees, working directly for the institution or company offering the
search service.
For the manual work, we must employ people with classification experience,
and the natural choice is to use librarians. The number of librarians to
employ will be decided by how ambitious we want to be. In the beginning
we do not need specialists in any fields, but just general classification
personnel with a good knowledge of the languages we want to cover Web pages
in. Later, we may need specialists on some of the subjects that has a large
number of Web pages devoted to them. To find suitable librarians, we must
contact a university library or a technical library which has resources
for and is interested in participating in projects such as ours.
8.3.3 The Contents
To ensure a high quality database from the start, we should initiate it
by mapping an existing directory into it. The directory should be divided
into topics in a well-designed hierarchy. The most comprehensive directory
suitable for the purpose is the Yahoo! service. Compared to building our
index from scratch, it is an easier task to map the categories from Yahoo!
into Dewey codes and, using the mapping, move Web page entries from the
Yahoo! hierarchy into the EDDIC hierarchy. The amounts of data covered
by the Yahoo! index should be sufficient to give our webrarians a good
basis for their classification work.
In return for providing their index information, Yahoo! can later be
given information useful to them, such as information about expired links
and the possibility of copying information about high quality Web pages
from our index.
Another option is to use the part of the Nordic Web Index that contains
Web pages with Dublin Core metadata to build a base set of index entries.
These about 80,000 entries contains information about title, author, subject,
description and publication date, which should be a suitable training environment
for the webrarian agents and the human classification personnel.
To reduce the amount of network traffic in the future, we may want to
cooperate with one or more of the major search engine companies. As long
as we verify the actual existence of a page, we may as well use the page
as it is stored in a search engine index instead of the page itself, when
we classify a Web page.
8.3.4 The Comments
The EDDIC system must be built iteratively, with a prototype for test users
to use and give feedback on as soon as possible. We need test users both
on the classification side and the search side. The test users may or may
not have had any experience with existing systems for organizing and searching
the Web or other large information spaces. Our system introduces innovative
ways of doing both in a way not very similar to any existing system, so
the most important thing is not the test users’ knowledge and experience,
but that they represent the exact kind of users the system is intended
for; everyone.
If the system behaves as we hope it will and actually makes it easier
to locate the information you want, finding people to volunteer as test
users should not be difficult. All feedback received from the users should
be considered and rapidly result in changes to the system if this seems
called for. Successful modifications are kept as part of the system.
The comments from users is a very important part of a rapid development
of a functional system. Because of the way the Web is growing, it is necessary
to get the system up and running as soon as possible. Having many satisfied
test users is also important in the process of making the search tool known
to the world. In addition to advertisements, word of mouth is the best
way to spread the news about our system.
8.4 Financing the Service
To start and run the EDDIC system will be expensive, due to the cost of
developing the software, buying the neccessary hardware, employing classification
personnel and making the world aware of the system’s existence. We must
consider how this all can be financed.
The EDDIC project can either be implemented within the university world
or by a purely commercial company or coalition. In any case it is important
that the actors understand that this service must be realized as soon as
possible, and are willing to cover the costs as they appear. The system
is more or less ready to be implemented, based on the specifications and
references contained in this report. Possible financial supporters may
be national research councils, money reserved for major university projects
or one of the main actors in the Internet/search engine world who wants
to be profiled as innovative.
Independently of who the investor behind the system is, we have several
options to choose between when it comes to finding ways to return the investor’s
money and more:
First the traditional solutions, easy to implement:
- Advertisements: We are in a very good position for attracting
advertisers to our search page. This is partially because the page obviously
will attract a large number of people, but mainly because thanks to our
context search possibilities we can channel advertisements towards exactly
the kind of customers each particular advertiser wants to reach. Examples:
If someone seems to be looking for information about new cars, give them
the ad for the new Volvo. If a search seems to regard horseriding in South
America, advertisements from the largest Chilean chain of riding holiday
resorts may very well be called for.
- Subscription service: Special high-quality indexes for certain
groups of users can be established, where higher speed , more personal
selection and ranking algorithms are available. Access to this part of
the EDDIC system can be sold on a subscription basis. We can also keep
track of what domains our system receives especially many search requests
from, and ask for donations as some kind of subscription from major commercial
domains that has many employees using the system.
With the arrival of e-cash, we can start charging for the use of our service
even in very small units of money:
- Pay for indexing: A special subset of the index or even a
specific index may be established for commercial services that wants to
have their information classified instantly. While submitting personal
Web pages for classification and indexing should be free, we may collect
a small fee from commercial companies that wants to have particular pages
classified in our index. This “pay for indexing”-service can be introduced
if the EDDIC system turns out to be a popular tool for people who want
to buy things.
- Pay per use – user: As an alternative to subscribing to faster
and higher-quality information, we can charge all or certain groups of
users a small fee for each search they perform using a special access-restricted,
improved part of our system. As an alternative to paying per search, payment
can also be calculated from how many of the links suggested by the webrarians
the user actually follows.
- Pay per use – provider: For commercial pages that offer certain
products and services for on-line purchase, we can charge the page owners
a small fee for each user that follows a link from our search tool to their
pages. These pages will have a special page class code, so that users know
these are Web pages that pay for being found through our index. Introducing
this mechanism must be done very carefully, as it borders to charging for
leading unknowing users only to the commercial actors with large advertisement
budgets.
Selling access to the whole search tool or just our index to other search
tools and Web sites, may also be a way to finance the service.
8.5 Dissecting the Monster
To build a functional, robust and scalable system like the one suggested
in this report, a number of subdisciplines from several research areas
must be brought together. Presented here is an outline of what problem
areas our system construction task consists of, based on a similar, more
general analysis for electronic commerce / digital library systems in [Adam
& Yesha, 1996].
Area 1: Acquiring and storing information.
-
Web spider technology must be combined with agent technology and classification
algorithms.
-
The suggested extensions to the Dewey Decimal Classification (DDC) must
be precisely defined.
-
How to classify and represent non-textual document objects must be decided.
-
The retrieval and classification system must be multilingual. We must therefore
create translation mechanisms and/or a multilingual index.
-
Database techniques for providing efficient management of the large volumes
of information we will handle must be developed .
-
Feature extraction techniques for recognizing specific classes of Web pages
must be developed, based on text analysis, pattern recognition in keywords,
URLs, file names, link structure, etc.
-
Identical information from several Web pages must be identified and collected,
and not listed as several different findings.
Area 2: Finding and filtering information.
-
Design the system to describe Web pages by their context and capabilities
rather than by more random qualities such as the name of Web page owner
or the exact words used to describe the service.
-
Develop mechanisms to perform content- and context-based search through
our index.
-
Since the Web can not be censored, it must be possible to censor what kind
of information our search and filtering webrarians shows to the user.
-
Provide users at different levels of computer expertise with a variety
of efficient ways to search for information on Web pages.
-
Create graphical representations of the search index, using maps and three-dimensional
data to give the users new, intuitive ways to navigate in the data.
-
Take actions to ensure that users can search for information using their
own language and terms from domains they are familiar with, their ontologies.
This is done through mapping sets of keywords from different domains and
languages to the DDC codes where they belong..
-
For commercial pages, design a code system following the EDDIC guidelines
that allows buyers to locate products and services with specific characteristics.
-
Webrarians, the intelligent agents in the EDDIC system, must be equipped
with a certain learning ability, so that their performance may improve
over time.
-
A mechanism for creating a personal user profile to improve the webrarians’
search assistance for each user must be developed, without intruding
each particular user’s privacy.
-
An agent communication language for exchanging information between agents
within and between agencies and indexes, based on the Knowledge Query Manipulation
Language (KQML) and the Knowledge Interchange Format (KIF) must be implemented.
-
Techniques from data mining can be used to extract patterns in the users’
interests, and give the webrarians further help in deciding what pages
may interest a particular user.
-
The webrarians must offer both query refinement and query expansion abilities,
so that it can guide users both to smaller and larger sets of findings,
depending on what the situation demands.
Area 3: Securing information and auditing access.
-
Mechanisms to allow charging fees for the use of the EDDIC system without
divulging the identity of the user are needed.
-
Mechanisms to register how many times what links are followed where must
be implemented, to be able to introduce a billing system for advertisers
on EDDIC pages.
-
Webrarians must respect that not all Web sites want their pages to be indexed.
Area 4: Universal access.
-
Access to the system should be possible with any widely used Web browser,
no plug-ins demanded. This means the system must be available using the
standard HTTP protocol and with pages formatted in standard HTML.
-
Good user interfaces must be developed. A highly interactive user interface
with search find listings efficiently created on the fly is necessary.
-
The visual design of the search assistants and the user interface is important.
The appearance of the webrarians must not indicate that they are more “intelligent”
than they actually are, and it must be clear at all times how the webrarians
come up with their suggestions.
-
Future (soon present) technology, such as XML, must be taken advantage
of to offer a more personalized user interface to the search system.
Area 5: Cost management and financial instruments.
-
Ways to cover the costs of running the EDDIC search service must be found.
This includes developing a number of usage-time- and per-request-based
cost models, and identifying groups of customers that are willing
to pay for a faster and advertisement-free service.
-
Electronic cash billing abilities must be part of the EDDIC system.
Area 6: Socioeconomic impact.
-
A search system that also is able to guide a user to Web pages where what
the user requests is for sale, can be a very important factor in the transition
into a global, electronic economy. We must ensure that this kind of user
guidance is used for the best of the society and not just to support the
largest actors on the commercial scene.
-
The system can and should be used to strengthen the position of languages
spoken by minorities, to prevent the Web from turning the world into a
purely English-speaking one.
-
Ways to adapt the EDDIC services to people with different cultural backgrounds
must be introduced.
As we see, numerous fields from computer science, library science, mathematics,
graphical design, social sciences, economy and psychology must be combined
to successfully implement the EDDIC system. A broad cross-disciplinary
effort is required.
Go to: Front
page - Index - Ch.
1 - Ch. 2 - Ch.
3 - Ch. 4 - Ch.
5 - Ch. 6 - Ch.
7 - Ch. 8 - Ch.
9 - Glossary - References
Visit the author's homepage : http://www.pvv.org/~bct/
E-mail the author, Bjørn Christian Tørrissen:
bct@pvv.org