7 Another Look Ahead :
Agent-Supported Context Search

Creating a new kind of search tool is not a simple task. Most search engines are very similar to each other. The user interface is usually a field where the user can input keywords and in some cases fields for specifying the search a bit more. Using the input from the user, the search engine ranks the hits and presents them to the user. If the search tool is a directory, the user can either navigate through the hierarchy or do a text search in all or parts of the hierarchy, and the hits are listed as they are found in the hierarchy. None of these search tools offer much direct support in the search process, apart from supplying the search index/hierarchy. We must do better.

7.1 The Goal

The Internet search tools available today all suffer from at least one of these two problems: Poor precision or poor recall. The concepts are illustrated in Figure 7-1 and Figure 7-2, based on a presentation in [Salton, 1989]. The shaded areas represent the information that is returned from respectively a broad and a narrow search query.

If we let a denote relevant retrieved elements from the total information space, b nonrelevant retrieved elements and c relevant elements not retrieved, recall is defined as a/(a+c), and precision as a/(a+b).

In Figure 7-1 a broad search query has resulted in high recall, meaning that a lot of the relevant information available (the top “triangle” of the square) is returned, but low precision, meaning that a lot of the nonrelevant information is also returned by the search. In Figure 7-2, a narrow search query result, the precision is high, just a little nonrelevant information is returned, but the recall is low, with most of the relevant information not returned by the search.

In general we may say that search engines, depending on how they are used, suffer from both of these problems, while directories mainly have the “high precision and low recall”-problem, depending on how much of all available information on a topic that actually can be found in the directory. We want to create a search tool that refines the users’ search, making the search definition less rigid without turning “user-unfriendly”, until as much of the relevant information as possible and very little nonrelevant information is returned, as shown in Figure 7-3. In addition, the returned information should be presented ranked in an order decided by the user, instead of in an order assumed interesting to the user by the search tool.

7.2 A Typology Of Searching

Different people prefer to search in different ways. When you are new to a search system, you will want instructions and support in searching and understanding how the search is done. When you are an expert user of a search system, you just want to do what you want as fast as possible. Sometimes people search for something very specific, which can be formulated in a query. In other situations, people just look for general information on some topic, and instead of searching for something in particular, they prefer to browse more or less by random, and every now and then serendipity takes them to something interesting.

These ways of classifying searching can be shown in a two-dimensional grid, where our users need to be supported in different ways in the four main ways of searching:

Novice Users Expert Users

Navigate The hierarchy must be easy to browse, with good explanations on how to move in it and on what information the different parts of the hierarchy contain. The screen must not contain so much information that the user is over-whelmed.
Less information on how to use the system is required, so more actual hierarchy information can be on-screen at any time. Classification codes can be used to a larger extent, instead of explanations.

Query The user must be guided through the various options for refining the search in all different ways offered by the system. It must be ensured that the hits are ranked the way the user actually wants them to be ranked, and explanations on what criteria each hit matched must be available. Instead of guiding the user from step to step, queries can be done through forms, where the expert user uses the fields necessary to perform the search. Boolean expressions can be used to formulate the queries.

In other words; we need to come up with more than one way of using our hierarchy as a search tool. The same data must be processed and presented through different user interfaces. In addition to the search and navigate interfaces, we also need an interface for letting people submit their Web pages for indexing and updating.

7.3 Navigating The Index

Many spend a lot of time “surfing the Net”, meaning just following links from Web page to Web page, looking for something that somehow is of interest to them. We can let people “surf” our index, whether they are looking for something in particular or they are just browsing out of curiosity.

7.3.1 Navigation For Novice Users

Using the data collected and indexed by the agent-supported classification personnel, we can offer several approaches to navigating the index. The most natural approach may be to let users navigate in the extended Dewey hierarchy. The “Dewey for Windows” design should be converted to a Web version. Particularly the “DDC Summary”- and the “DDC Pages”-windows offered in DfW together give a very good impression of what categories and topics there are in the Dewey system, and what they contain.

It must be easy to see information about what area of the hierarchy you “are in” at any time. In addition to looking at what can be found in that area, it is necessary to show what can be found by moving one level up or down in the hierarchy.

While navigating in the hierarchy, the users must be able to toggle between only seeing the topic headings and seeing the actual entries that are indexed under the various topics. At any time, when having found a topic that is of interest to the user, it should be possible for the user to navigate among the Web pages along other dimensions, for example by page class, language, ratings or graphics use. Changing what dimension to navigate along can be done at any time, and if it is done systematically, it can be used to efficiently narrow down the range of pages to navigate between. The downside of this is that for document properties that are not required but optional for the indexing, this means that pages without a value for that particular property will not be found through hierarchy navigation.

When the number of pages within the navigation space is reduced to a relatively small number, an interesting option may be to retrieve the first few lines of text from these pages for the user to view, to help the user in deciding what page to go to. If a Web page has been given several codes for Dewey classification and/or page class, it should be possible to move directly to the other area(s) of the hierarchy where the page also can be found indexed.

7.3.2 Navigation For Expert Users

The difference between navigation for novices and experts does not have to be very large. The main difference is that experts do not need so much help in using the system, and they will not be so easily confused by large amounts of information on the screen. Expert users may also want to be able to toggle between using the numeric “machine-readable” codes and using the textual representations the codes map to.

The differences between navigating using the EDDIC system and a traditional directory like Yahoo! are the same for both novice and expert navigation:

The EDDIC index will contain more Web pages than any traditional directory.
There will be more topic headings to navigate and choose between, and the topics will be universally agreed upon Dewey classes instead of the proprietary classifications that can be found in directories today.
Navigation along other dimensions than topics.

7.4 Querying The Index

When users are looking for very specific information, they must be guided to that and nothing else as quickly as possible. As indicated in Chapter 3.1 the guides we want to introduce will act as Web librarians, “webrarians”. The webrarians will be some kind of interface agents, able to communicate with the autonomous information agents that builds the index, or at least able to read the index and take advantage of other interface agents’ knowledge bases.

In [Shneiderman, 1997] a four-phase framework for coordinating design of general search/query processes is mentioned. It aims to reduce user frustration and confusion while supporting the search process. Adapted to our use, this is what it looks like:

1. Formulation, expressing the search

2. Action, trig the creation of a response to the formulated search expression

3. Review, presenting the search results, including:

4. Refinement, guidance to further improvement of the quality of the hits

7.4.1 Querying For Novice Users

This is the kind of search that may require the most support, but the support must not be too over-whelming, as this may lead to some users feeling that they are treated like idiots. Thanks to the amounts of data our index contains, we can offer very advanced and flexible searching. One of the goals of our system is to come up with a search tool that is user-friendly enough that people who are not comfortable with using “normal” search engines can understand how to use it.

The search support we introduce is the webrarian, an interface agent capable of getting to “know” the users. The webrarian will communicate with the user, guiding the user to pages that match the user’s needs. Since we have a systematically built index equipped with a vocabulary, we want to use this to simplify the communication.

Although we introduce the webrarian as an assistant to the user, we do not want our agent to pretend to be human. To be even more specific, neither do we want it to appear to be intelligent. This means we must be very careful when visualizing the webrarian. The users should think of the webrarian as an assistant with a very limited ability to understand human behavior. The only thing the webrarian is capable of is guiding people to the right spot in the index. Similarly, the choice of words in the communication with the user should not let on that the agent is something else than a mindless, “narrow-minded” webrarian.

The search process must be quick, so we do not have time for any heavy natural language processing. The user’s written input should be limited to a very short typed input initially. Throughout the rest of the search the agent-human dialog should take place as questions from the agent, to which the user answers by picking the one of a few agent-generated responses that best matches the user’s desires. When the webrarian has enough information to narrow the search down to a certain number of matching Web pages, the list of hits should appear on the screen. Then the user will have to decide whether to start browsing the hits, or to continue answering questions from the webrarian to further refine the search. The “certain number” should be adjustable by the user, but per default be set to a relatively low number. This is to avoid the kind of situation that often takes place with the search engines of today, where a user gives the search engine one or more keywords, and the search engine enthusiastically claims to have found hundreds of thousands of matching documents and overwhelms the user.

A preliminary user interface is shown in Figure 7-4. Following [Shneiderman]’s framework, the “Formulation” phase is covered by the upper part of the window. The webrarian is depicted as some kind of expressionless agent asking questions. The system generates reponses it is capable of handling and offers them to the user in the top left area of the window. The user, who supplies a keyword, initiates the search. The agent compares this keyword with its knowledge base, the EDDIC classification codes and the index entries, and generates questions to narrow down the search scope based on this.

The second phase, “Action”, starts when the webrarian has narrowed the search down so much that the number of matching documents from the index is small enough that it is practical to start showing them. From then on, the Formulation and Action phases will both be active. Each time the user responds to a new question from the agent, the hit list is immediately updated. The user decides when to stop answering questions and start browsing the hits instead.

The “Review” phase is covered by the lower part of the window. Just below the upper area of the window there is a field explaining what the information listed below is based on. This is basically just a review of what answers the user has given to the webrarian’s questions. The main area of the window is the listing, in the figure depicted as lines of various lengths, of Web pages matching the user’s desires. The list has headers explaining what the listing fields are. This part of the window contains a scrollbar, since it will often contain more information than can be fitted on one screen. By clicking on one of these fields it should be possible to decide how the pages on the list are ranked, with the presumed most interesting pages moved to the top of the list. This kind of explanation for how the webrarian has decided to list exactly these pages, makes it easier for the user to trust the system. According to [Lieberman, 1997], if the user perceives the agent’s actions as actions “that I could have done myself”, the user is more willing to conceptualize the agent in the role of an assistant.

On the bottom of the page there is information about the search, telling how many documents that presently match the user input, as well as how many documents that matched the user input before the most recent refining question was answered. There is also a button to press to go back one step in the search process. This, together with the questions generated by the agent is how Shneiderman’s “Refinement” phase is covered by our system.

A similar agent-human dialog can be used to find the right spot in the hierarchy for a specific topic the user wants to either find information about or add information to.

The main challenge in creating a search tool like this is, in addition to building an index as already described, to create a system for generating questions based on the hierarchic structure of the information and the information itself. The questions asked must be formulated in a way making it is easy to see the difference between the alternatives.

7.4.2 Querying For Expert Users

When users become familiar with the way the index is built and what possibilities there are for sorting and ranking pages based on the document properties, they will probably find that always using the webrarian support is a bit cumbersome. These “expert users” are qualified and will prefer to have the agent-human dialog in the upper part of the window replaced with a more traditional Web-form, where the users can input their search queries for the different fields in the order they like, with the option of using Boolean operators. If there are very many hits, the system should just return the number of hits and ask the user to input more requirements to narrow down the search scope. Eventually the number will be low enough to start actually displaying the hits. Although the user interface looks a bit different, the computer processing for retrieving information from the index and creating the hit lists is the same for the novice and the expert queries.

For both the agent-supported and the form-based search, clicking on a Dewey classification code in the listings should bring up the Web pages that are indexed with that particular code. If very specific (extended) Dewey classification codes are chosen, this will be a very efficient way to find information closely related to the subject the user is interested in.

7.4.3 Supporting Web Page Submitting

A very important quality of a search tool is the ability to include new information as fast as possible. For example, as mentioned in Chapter 6.2, we want a page class code for “Breaking news”, so that this kind of information can be quickly located. However, it is difficult to guarantee that our Web search and retrieval agents will be able to find all new information as fast as we would like to. Therefore it must be possible for people and companies that have created Web pages about “hot” topics to tell us about it, so that we can send EDDIC agents their way immediately.

The minimum information needed as input to the system is the URL of the Web page someone wants to have classified and indexed. Given this URL, we can check that the page really can not be found in the index already. If so, we can instruct a retrieval agent to go to the page and collect information about that page as normal. When this information reaches the classification and indexing agency, it should be directed to an “express line”, so that it can be processed and included in the index quickly.

When creating a Web page for allowing submitting Web pages to the index, we should take advantage of the opportunity to collect meta information. This can be done by designing a good “Submit Web page” on-line form, as shown in Figure 7-5, where the user that submits the page can include meta information. The user can use any keywords he or she likes, while the other fields must be compatible with the EDDIC code system. Some of the fields can be multiple choice or menu-like.

If the user fills in all the fields correctly, the classification of the Web page can be done quickly. The user can find the correct codes by reading explanations and navigating to the correct place in the Dewey hierarchy as described earlier in Chapter 7. The information from the form should be stored with the URL as the key, later to be joined with the information gathered by the agents. Using this, the human classification personnel can quickly check that the page actually is what the submitter claims it to be, and put the page in the correct spot in the hierarchy and with the best keywords.

The user may also fill in only the URL field and press the Submit-button. In this case the classification may require a bit more human work, and may take a bit longer to perform. This will encourage people to supply as much meta information as they possibly can about their page when submitting it for indexing.

7.5 Personalized Webrarians

Up to this point we have only described webrarians that act the same way no matter whom they assist in querying, navigating in or adding to the index hierarchy. This may seem fine, but many people working in the field of more or less intelligent agents think that the most important aspect of agents is not what actions the agents are capable of performing, but how well they can learn to know each particular user and let this “personal” knowledge affect their actions.

If we want our agents to vary the way they act in accordance with the varying needs of the users, we need a way for the agents to identify the different users. When the agent knows what user or at least what kind of user it is to assist, it can retrieve the correct set of rules for behavior and follow these rules in its work. However, for the vast majority of the users an Internet search tool is something they want to pop up on their screen ready to be used without any further ado. This means we can not base our user identification on any user name + password login. It is also a matter of privacy to be able to search for information without having to identify yourself.

This means we have two choices:

Keep a central user profile at the index site for each unique host address. This will work fine, provided that the user accessing EDDIC from any specific computer connected to the Internet is the same person each time. In many cases this is true and in the future, with the personal computer gaining ever more ground, the principle of “One man – One Internet address” is likely to become more and more a reality. Keeping profiles for every person in the world may not be very practical, so it is better to store the profile information as a “cookie” on the user’s computer, and have the agent retrieve this cookie when the user enters the EDDIC search tool Web site.

Within this profile we need to store information about what Dewey codes and what kind of Web pages the person seems to repeatedly search in and for, what keywords most often occur on the pages the person actually follows the links to, what languages are preferred and so on. This information can, when it becomes comprehensive enough, be used to guide the agent in its work, ranking and sorting the hits, so that fewer questions needs to be asked to the user.

Both documents and queries can, in our system, be represented as weighted vectors, where each keyword or field code corresponds to one position in the vector. Documents from the search scope part of the index are ranked according to a normalized inner product between the query vector and the documents’ vectors. The user profile can be used to give certain keywords and field codes extra weight. Not all the query parameters need to be present in the document in order for the document to be retrieved. Likewise, a document that has many instances of a keyword or field with heavy weighting might be ranked higher than a document with few instances of several of the other query parameters. This is based on theory found in [Salton, 1989] and Chapter 5.2.

Introduce some kind of role concept to the system. Instead of being identified as “some specific person who usually have such and such interests”, this gives people the option to choose a “role” for “who” they want to be in each particular visit to the EDDIC search tool. A different set of rules will be connected to each available role. These rules describe a number of extended Dewey codes, page classes, keywords and possibly other document properties that should give especially high or low rankings to the Web pages the user searches among.

Examples of typical roles to choose between upon arrival at the search tool are:

Shopper

- Researcher : Someone who is using the Web to find research papers, calls for papers, technical journals, reports about university activities and so on. The rules give high priority to pages from educational and research institutions, commercial sites with the word “research” somewhere in the URL and pages with keywords like “research”, “CFP”, “conference”, “university”, “ph.d”, “paper”, “report” and so on.

- News reader : Someone who is looking for the latest news. The rules give priority to pages that very recently have been indexed, pages with the “Breaking news” page class code, pages from news sites and similar. The agent may also look at the user’s address and concentrate on providing news from the same part of the world as the user, since most available news is not of global interest.

- Child : Someone young who is looking for pages for children. The agents will only list pages that are classified to be made for children, and definitely remove all pages containing adult material from the listings.

If no role is chosen, a basic set of rules will be used, and rankings will be based on what answers the users give to the agents’ questions. It is important that it is the user and not the search tool that decides the ranking criteria. In this way there will be less confusion among the users caused by pages believed to be of interest and ranked high by the search tool, while the page actually has no interest to the user whatsoever.

Whether the profile model or the role model is chosen, the result should be that the user feels more at home when using the search tool, since it seems as if the agent “knows” a (little) bit about the user. Some roles may give the user a faster and in other ways higher quality search access, and may require special agreements and/or paid subscription to get access to.

Go to: Front page - Index - Ch. 1 - Ch. 2 - Ch. 3 - Ch. 4 - Ch. 5 - Ch. 6 - Ch. 7 - Ch. 8 - Ch. 9 - Glossary - References Visit the author's homepage : http://www.pvv.org/~bct/ E-mail the author, Bjørn Christian Tørrissen: bct@pvv.org

	Novice Users	Expert Users
Navigate	The hierarchy must be easy to browse, with good explanations on how to move in it and on what information the different parts of the hierarchy contain. The screen must not contain so much information that the user is over-whelmed.	Less information on how to use the system is required, so more actual hierarchy information can be on-screen at any time. Classification codes can be used to a larger extent, instead of explanations.
Query	The user must be guided through the various options for refining the search in all different ways offered by the system. It must be ensured that the hits are ranked the way the user actually wants them to be ranked, and explanations on what criteria each hit matched must be available.	Instead of guiding the user from step to step, queries can be done through forms, where the expert user uses the fields necessary to perform the search. Boolean expressions can be used to formulate the queries.

7 Another Look Ahead : Agent-Supported Context Search