9 Concluding Remarks

In the introduction chapter, we defined our main goals to be: 1) To find a way to efficiently gather and sort information about Web pages into suitable contexts, and 2) To come up with a good user interface for flexible searching inside a context, using our classified and indexed data. We will now see if we have succeeded.

9.1 The Problems Solved

In Chapter 2.4 we identified and listed seven problems that both information providers and information seekers face on the Web today. We set out to solve as many of them as possible, with emphasis on the first one. Let us see how our system addresses the problems:

context

The problem of magnitude: The classification agency is meant to support human classification personnel, so that high quality classification and indexing can be performed on a larger number of Web pages per time unit than is possible today. With the suggested Web page for letting page owners submit their pages for indexing, we have also created a way to have new pages added to the index faster than retrieval agents could have done through exploring the Web more or less by random.

The problem of classification: In addition to the traditional text analysis and keyword classification techniques, we suggest using URLs and the Web’s hyperlink structure to gather valuable classification clues. Because hyperlinks between Web pages most often are there because the page linked to has something to do with the subject and/or context of the page that contains the link, we suggest using the link as a context indicator. Similarities between URLs can also be used to identify the context of a Web page, since pages from the same particular site often share some document properties. Both of these novel techniques can to a large degree be automated and performed by agents. This can lead to high quality classification requiring a minimum of human efforts.

The problem of instability: We have suggested an automated periodical check of the status of links. Because of the nature of the Internet, it is not possible to immediately detect expired links automatically, but repeated periodical checks will reveal permanently dead links. Discovering duplicates of a page is not too difficult to implement, since similar pages will be classified and indexed in the same, specific subject area of the index hierarchy. By comparing the Title-field for pages carrying similar classification codes, duplicates can be found and listed, so that users can pick the one closest to themselves and presumably save download time, or find alternative links if the first one they try seems to be dead.

The problem of user-friendliness: Users who are comfortable with existing search tools will be able to use the same kind of user interface in our search tool if they like. However, we also suggest an agent-supported search interface that helps even unexperienced users with performing advanced searching. The possibility of contextual searching also adds user-friendliness by minimizing the possibility of finding totally non-relevant information through our search tool. Finally, users do not have to use very specific words to find what they want. The suggested code format has one vocabulary for agents and one vocabulary for human users. The human vocabulary can contain words from different languages and ontologies, all leading to the same concept covered by a certain classification code.

The problem of abuse: Because of the suggested human quality check of the classification performed by agents, it is not possible to fool the system. Web pages linked to from one spot in the subject hierarchy will contain information related to that subject. The retrieval and classification agents can be taught to ignore Web pages and submits from Web sites known to be “abusive” e.g. by listing a popular search term hundreds of
times in small, invisible writing on the bottom of the pages.

The problem of monopoly: Thanks to the compact code format we suggest, it will be possible to store search data about all Web pages in the world, now and in the future, using much less storage space than search engines that operate by storing Web pages in full. This means that we, unlike all search tools available today, do not need to introduce rules for example for how many pages per site we will store. With a minimum of human efforts everything can be indexed, so that all Web pages can have the same possibility of being found.

As we see, the system we suggest provides at least partial solutions to all the main problems we encounter when indexing and searching the Web today. This implies that we have successfully addressed the goal of the thesis.

9.2 Summary: The Novelties

When the EDDIC system is fully implemented, it will not mean the end of the world to the rest of the Internet search tools. They will still have their use, but EDDIC introduces new possibilities for everyone.

Our index is in theory capable of at all times holding relatively up-to-date information about all pages on the World Wide Web. For storage efficiency reasons, the information held in our index can not be the actual, raw Web pages themselves as in most search engines. Instead we store metadata, that is information describing the location, contents and properties of the Web pages, much like Web directories do. In this way we combine the advantages from both search engines and directories, and add a few advantages on top of that as well. This is how some main aspects of search tools are improved by our system:

Size

An important issue when it comes to how many of all available pages that can be indexed, is that to give everyone an equal opportunity to present their information, all pages must be indexed. This is one of the prerequisites for fair competition in electronic commerce.

Indexing metadata in such amounts as we want to has not been possible to do using traditional methods, because of the computation costs of analyzing text and putting it in a context with a high probability of success. By using agents to preformat codes and by taking advantage of the information hidden in the URL of the page, in the URLs of the pages that link to it as well as in the pages it links further on to, we can classify a large number of pages with a minimum of human efforts.

Quality: The information we store about a page is a certain classification code, based on international standards that has been used in libraries all over the world for many decades already and are very well documented. Using these standards minimizes the possibility of confusion about what topic belongs to what code. Current directories may also have good hierarchies of topics, but they are all proprietary and probably not as thoroughly considered as the standards we have chosen.

In addition to the topic codes, we also store codes about other properties of the Web pages. These properties can be used to refine the search and filter Web pages in ways no current search engine or Web directory can. Using the codes wisely, we can allow users to search by context. This means that although we can not let users search for the text strings “ABBA” and “Money, money, money” to find the lyrics of a certain pop song, we can let them go to the place in the topic hierarchy called “Western popular music from the 1970’s in English”, and let them search for the keyword “ABBA”. Which method is faster is hard to tell, but using our index, the user is guaranteed to only find music-related Web pages, and no economy-related pages or other kinds of pages with the word “money” repeated hundreds of times to attract attention.

Another quality improvement is the ability of our index to quickly incorporate new information, news articles from the fields of politics, sports, the computer scene and so on. This means that anyone can get their Web pages classified and indexed while the news on their pages still is news, and that people who search using EDDIC can find pages with fresh information.

Guidance: Another novel main feature of our system is the search webrarian, an agent that assists the user in searching the index. The webrarian asks the user questions about what information the user is looking for, and finds Web pages that match the user’s search criteria. To save computing time for interpreting user responses, most of the human-webrarian dialog will be the user choosing one of a few agent-generated responses to questions asked by the webrarian. The questions in turn are chosen so that a limited number of pages matching the user’s desires accurately can be quickly identified by the webrarian.

An important point is that while the users see textual descriptions of the Web pages contents, the agents work with numbers only. This means that people can search the index using the terms they are used to in their own language and their own ontology. These terms are mapped to numeric codes, ideal for computer sorting and manipulation. Neither the agents nor the humans have to compromise their language to adapt to each other.

Use of the EDDIC tool requires that the users describe the context and subject of the information they want. This is a more natural way to search, and is opposed to the situation today, where people have to provide the search tool with words which they believe can be found on the pages they are interested in.

The core idea in our indexing system is that instead of describing each page by a large number of keywords connected to each other, we classify the page and its contents by connecting it to the context that best describes it, and store only a few of the most important keywords for the page in the actual Web page entry. The context code itself is equipped with many sets of keywords from a number of different ontologies, and these keywords can be used to guide the user to interesting documents.

9.3 The Next Steps To Take

This report has presented a framework for a new kind of search tool. Several areas must be explored further before an implementation can take place. First and most important, we have to perform a statistical study of the hyperlink structure of the Web, to decide the reliability of using URLs, in combination with traditional text analysis methods, as context indicators. This is necessary to prove that it is possible to create agents capable of assisting human classification personnel in the enormous task of classifying millions of Web pages. This matter can be settled quickly, provided that suitable data is gathered or found and made available for analysis.

If research shows that hypertext links and/or similarities in URLs in many cases can be used as an indication on the type of page and on what kind of material the page contains, the next step is to identify the most important document properties for searching among and filtering Web pages. Suggestions for what these properties may be are listed earlier in this report, but these suggestions are not necessarily the final selection of properties to include in the index. When the properties are identified, a set of codes that shall cover all thinkable Web page material must be constructed. An extension of the Dewey Decimal Classification for covering special modern technology- and electronic commerce-related subjects must probably also be created. At this stage classification and database experts must be consulted.

In parallell with the classification code creation, the work with implementing the software for the agent system can be done, and contracts with the necessary partners may be written. A strategy for the introduction of the new EDDIC tool must be chosen. Lately, most search tool Web sites have not concentrated on finding new and better ways to search the Web, but have added features like stock tickers, translation of Web pages, links to on-line bookstores, horoscopes, electronic versions of yellow pages, thesauruses and dictionaries, news services and weather forecasts. In short, the focus has shifted from improving the search service towards providing entertainment and newspaper-/magazine-like material. Teaming up with a major search site may help the search tool industry start concentrating on how to improve Internet search again. A future goal should be, when technology permits it, to combine the classification system suggested in this report with an index where pages are indexed with their full textual contents. This will result in very flexible and powerful search possibilities. If it is chosen not to cooperate with any existing search site, we must decide what profile our search page shall have.

It is very important that the work begins as soon as possible, as the Web is certainly not waiting for us to catch up. The world needs a better way to organize Web searching. There has not been much development among the search engines and directories lately, except that the indexes have grown a bit and that a lot of strategical partnerships and cross-promotion deals have been agreed upon. The user interface for searching the Web has basically not changed since search tools were introduced. Something needs to be done, and this report has suggested what to do.

9.4 The Conclusion

Through our report we have shown that by combining theory from library science with agent technology and novel classification techniques, it is possible to create a search tool capable of offering new possibilities for searching the Web. Compared to the search tools currently available for searching the Web, our proposition is that the two main achievements of implementing the system we have described will be:

We can build larger high quality indexes, thanks to agent support of manual work.
We can offer more flexible and powerful navagation and search mechanisms, thanks to how we classify and sort Web pages by subject and certain other properties.

While previous search tools often have disappointed and confused many of their users, with this system we may offer our guests a large and elegant search structure with detailed maps and personal, guided tours. With the situation being that the Web keeps growing, our main conclusion is that the challenge of classifying and indexing the Web must be met by an implementation of the system we suggest, starting with taking the next necessary steps as described above, as soon as possible.

Go to: Front page - Index - Ch. 1 - Ch. 2 - Ch. 3 - Ch. 4 - Ch. 5 - Ch. 6 - Ch. 7 - Ch. 8 - Ch. 9 - Glossary - References Visit the author's homepage : http://www.pvv.org/~bct/ E-mail the author, Bjørn Christian Tørrissen: bct@pvv.org