In the introduction chapter, we defined our main goals to be: 1) To
find a way to efficiently gather and sort information about Web pages into
suitable contexts, and 2) To come up with a good user interface for flexible
searching inside a context, using our classified and indexed data. We will
now see if we have succeeded.
The problem of magnitude: The classification agency is meant to support human classification personnel, so that high quality classification and indexing can be performed on a larger number of Web pages per time unit than is possible today. With the suggested Web page for letting page owners submit their pages for indexing, we have also created a way to have new pages added to the index faster than retrieval agents could have done through exploring the Web more or less by random.
The problem of classification: In addition to the traditional text analysis and keyword classification techniques, we suggest using URLs and the Web’s hyperlink structure to gather valuable classification clues. Because hyperlinks between Web pages most often are there because the page linked to has something to do with the subject and/or context of the page that contains the link, we suggest using the link as a context indicator. Similarities between URLs can also be used to identify the context of a Web page, since pages from the same particular site often share some document properties. Both of these novel techniques can to a large degree be automated and performed by agents. This can lead to high quality classification requiring a minimum of human efforts.
The problem of instability: We have suggested an automated periodical check of the status of links. Because of the nature of the Internet, it is not possible to immediately detect expired links automatically, but repeated periodical checks will reveal permanently dead links. Discovering duplicates of a page is not too difficult to implement, since similar pages will be classified and indexed in the same, specific subject area of the index hierarchy. By comparing the Title-field for pages carrying similar classification codes, duplicates can be found and listed, so that users can pick the one closest to themselves and presumably save download time, or find alternative links if the first one they try seems to be dead.
The problem of user-friendliness: Users who are comfortable with existing search tools will be able to use the same kind of user interface in our search tool if they like. However, we also suggest an agent-supported search interface that helps even unexperienced users with performing advanced searching. The possibility of contextual searching also adds user-friendliness by minimizing the possibility of finding totally non-relevant information through our search tool. Finally, users do not have to use very specific words to find what they want. The suggested code format has one vocabulary for agents and one vocabulary for human users. The human vocabulary can contain words from different languages and ontologies, all leading to the same concept covered by a certain classification code.
The problem of abuse: Because of the suggested human quality
check of the classification performed by agents, it is not possible to
fool the system. Web pages linked to from one spot in the subject hierarchy
will contain information related to that subject. The retrieval and classification
agents can be taught to ignore Web pages and submits from Web sites known
to be “abusive” e.g. by listing a popular search term hundreds of
times in small, invisible writing on the bottom of the pages.
The problem of monopoly: Thanks to the compact code format we suggest, it will be possible to store search data about all Web pages in the world, now and in the future, using much less storage space than search engines that operate by storing Web pages in full. This means that we, unlike all search tools available today, do not need to introduce rules for example for how many pages per site we will store. With a minimum of human efforts everything can be indexed, so that all Web pages can have the same possibility of being found.
Our index is in theory capable of at all times holding relatively up-to-date information about all pages on the World Wide Web. For storage efficiency reasons, the information held in our index can not be the actual, raw Web pages themselves as in most search engines. Instead we store metadata, that is information describing the location, contents and properties of the Web pages, much like Web directories do. In this way we combine the advantages from both search engines and directories, and add a few advantages on top of that as well. This is how some main aspects of search tools are improved by our system:
An important issue when it comes to how many of all available pages that can be indexed, is that to give everyone an equal opportunity to present their information, all pages must be indexed. This is one of the prerequisites for fair competition in electronic commerce.
Indexing metadata in such amounts as we want to has not been possible to do using traditional methods, because of the computation costs of analyzing text and putting it in a context with a high probability of success. By using agents to preformat codes and by taking advantage of the information hidden in the URL of the page, in the URLs of the pages that link to it as well as in the pages it links further on to, we can classify a large number of pages with a minimum of human efforts.
Quality: The information we store about a page is a certain classification code, based on international standards that has been used in libraries all over the world for many decades already and are very well documented. Using these standards minimizes the possibility of confusion about what topic belongs to what code. Current directories may also have good hierarchies of topics, but they are all proprietary and probably not as thoroughly considered as the standards we have chosen.
In addition to the topic codes, we also store codes about other
properties of the Web pages. These properties can be used to refine the
search and filter Web pages in ways no current search engine or Web
directory can. Using the codes wisely, we can allow users to search by
context. This means that although we can not let users search for the text
strings “ABBA” and “Money, money, money” to find the lyrics of a certain
pop song, we can let them go to the place in the topic hierarchy called
“Western popular music from the 1970’s in English”, and let them search
for the keyword “ABBA”. Which method is faster is hard to tell, but using
our index, the user is guaranteed to only find music-related Web pages,
and no economy-related pages or other kinds of pages with the word “money”
repeated hundreds of times to attract attention.
Another quality improvement is the ability of our index to quickly
incorporate new information, news articles from the fields of politics,
sports, the computer scene and so on. This means that anyone can get their
Web pages classified and indexed while the news on their pages still is
news, and that people who search using EDDIC can find pages with
fresh information.
Guidance: Another novel main feature of our system is the search webrarian, an agent that assists the user in searching the index. The webrarian asks the user questions about what information the user is looking for, and finds Web pages that match the user’s search criteria. To save computing time for interpreting user responses, most of the human-webrarian dialog will be the user choosing one of a few agent-generated responses to questions asked by the webrarian. The questions in turn are chosen so that a limited number of pages matching the user’s desires accurately can be quickly identified by the webrarian.
An important point is that while the users see textual descriptions of the Web pages contents, the agents work with numbers only. This means that people can search the index using the terms they are used to in their own language and their own ontology. These terms are mapped to numeric codes, ideal for computer sorting and manipulation. Neither the agents nor the humans have to compromise their language to adapt to each other.
The core idea in our indexing system is that instead of describing each
page by a large number of keywords connected to each other, we classify
the page and its contents by connecting it to the context that best describes
it, and store only a few of the most important keywords for the page in
the actual Web page entry. The context code itself is equipped with many
sets of keywords from a number of different ontologies, and these keywords
can be used to guide the user to interesting documents.
If research shows that hypertext links and/or similarities in URLs in many cases can be used as an indication on the type of page and on what kind of material the page contains, the next step is to identify the most important document properties for searching among and filtering Web pages. Suggestions for what these properties may be are listed earlier in this report, but these suggestions are not necessarily the final selection of properties to include in the index. When the properties are identified, a set of codes that shall cover all thinkable Web page material must be constructed. An extension of the Dewey Decimal Classification for covering special modern technology- and electronic commerce-related subjects must probably also be created. At this stage classification and database experts must be consulted.
In parallell with the classification code creation, the work with implementing the software for the agent system can be done, and contracts with the necessary partners may be written. A strategy for the introduction of the new EDDIC tool must be chosen. Lately, most search tool Web sites have not concentrated on finding new and better ways to search the Web, but have added features like stock tickers, translation of Web pages, links to on-line bookstores, horoscopes, electronic versions of yellow pages, thesauruses and dictionaries, news services and weather forecasts. In short, the focus has shifted from improving the search service towards providing entertainment and newspaper-/magazine-like material. Teaming up with a major search site may help the search tool industry start concentrating on how to improve Internet search again. A future goal should be, when technology permits it, to combine the classification system suggested in this report with an index where pages are indexed with their full textual contents. This will result in very flexible and powerful search possibilities. If it is chosen not to cooperate with any existing search site, we must decide what profile our search page shall have.
It is very important that the work begins as soon as possible, as the
Web is certainly not waiting for us to catch up. The world needs a better
way to organize Web searching. There has not been much development among
the search engines and directories lately, except that the indexes have
grown a bit and that a lot of strategical partnerships and cross-promotion
deals have been agreed upon. The user interface for searching the Web has
basically not changed since search tools were introduced. Something needs
to be done, and this report has suggested what to do.