The Letizia agent, as described in chapter 4.6, is only one of several examples of projects attempting to support the user in the searching and browsing activities. This kind of support is important, but given the way the search engines have problems with following the growth of the Web, it is necessary to think more radically. We must give agents new tasks and make them help us build better search tools.
The most important function we need to add to the spiders and robots that create the search tool indexes today, is the ability to decide what context each particular Web page occurs in. Our search agents must also be able to recognize other Web page properties, as mentioned in chapter 3. However, the context problem is most important to us, so we will begin with providing a solution to this.
We start out with giving an overview of the classificiation part of
the system we want to build. The aim is not to create the ultimate automatic
text analyzer, but to come up with a system that can help us build bigger
and better indexes with more searchable meta information than any of the
search tools of today do. The guiding idea is to keep the classification
process a manual one, but to support the human classification personnel
with webrarians that can preprocess the indexing data to make the manual
task easier and quicker. The last part of the chapter presents research
that can be taken advantage of when the system is to be implemented.
The Classification & Indexing Agency (CIA) launch web search and retrieval agents into the Web. These agents do not have to be very smart, but they should follow the up-to-date rules for Web robots defined and agreed upon by the Internet World Wide Web community [Robots, 1997]. This is necessary to optimize the work strategy and the number of robots on the Net, and to make sure that agents, robots, spiders, wanderers etc do not overload the Web. These agents will in some cases be instructed to look for specific pages, while in other cases they will just “wander” from site to site and page to page on the Web, looking for new pages that has appeared on the Net and are ready to be indexed. When such a page is found, the agent supplies it with some extra information and sends it back to the CIA by using standard network protocols.
At the arrival to the agency, the information about the Web page is split into:
A) URL information. Most important is the URL of this particular page, but information about the page or cgi-script that lead the agent to this page and what page(s) this page links to is also included, as are any URLs that the page links to.
B) The HTML-formatted contents of the Web page. This includes the text
visible through a standard Web browser as well as meta information that
is hidden to the normal user of the Web.
I. Locate URLs from the index similar to the URL of the page to be classified.
II. If the site (and directory) has so many pages indexed that it has been given a site contents code or a site contents code range, this code should be the initial “guesstimated” code for the page as well.
III. If there is no site contents code: Based on how similar the URL is to one or more indexed URLs (same site/directory/filename), gather probable keywords and indexing codes for the page.
IV. If there are no similar URLs: See if the page that lead the agent to this page or if pages that the page links to already exist in the index, and gather probable keywords and indexing codes for the page from these.
V. In some cases, the location of the document and what language the document is likely to be written in can be read from the domain-part of the URL. This is also information that should be noted by the system at this stage.
VI. Send the gathered information (the probable page contents code,
keywords and potentially other information that can be extracted from the
URL) to module 3, the indexing agents module.
I. Scan the document header for meta information provided in any of a number of standardized ways and formats. This includes use of the <META> tag as defined in existing as well as in future classification standards.
II. Scan the document for occurrences of quoted keywords, such as the text string “Keywords:” and consider the following text string to be meta information for the contents of the document.
III. Strip all HTML code except header codes and all in-line Java-scripting etc., so that all that remains of the page is basically the actual text visible through a standard Web browser.
IV. Send the information gathered from <META> tags and the remaining
text to the automatic indexing agents in module 3.
The procedure that takes place in module 3 can be described in this way:
I. Receive gathered info from module 1.
II. Receive gathered info from module 2.
III. Join information from module 1 and 2 by finding information packages carrying the same URL as identification.
IV. Analyze the remaining text filtered through module 2. The analysis procedure to be followed will be described in chapters 5.2 – 5.3.
V. Perform a variety of text analyses to decide different attributes of the document. The more that can be said about the document, the better. What document properties the system shall identify, will depend on what type, as discussed in chapter 3.3, the system identifies the page to be.
VI. Combine the results from step IV with the information provided by module 1 and module 2. The agents should weight the probabilities for the correctness of the information provided by the different modules and come up with a final suggested EDDIC contents code for the Web page. To help in making this code, the agents have a number of tables containing information about different classification standards, in addition to the knowledge base contained by the agents themselves.
VII. Pass the EDDIC contents code for the page and the page itself on to a human classification expert, shown as “module” 4 in the figure. An explanation of what decisions the agents have made and the background for them must be included.
To utterly reduce the manual efforts, the agent can route the Web page to someone who knows the particular part of the Dewey code hierarchy the agent believes the page to belong to especially well, for a manual check and correction. Instead of having people know the whole classification code hierarchy, they can concentrate on one area and handle indexing to that area faster. If the agent misdirects the Web page, the page can either be handled by the person who receives it or he or she can pass it on to the right person manually. When the index comes as far as being multi-lingual, it is also necessary to route the Web page for final classification by some person who also knows the language used on the page
What distinguishes this attempt on an automated classification process from earlier attempts, is the use of the URLs, and the idea of not automating the classification, but automating classification support instead. Similar to how citations have been exploited in earlier classification systems for paper-based printed matter, the hypertext link structure between documents will hopefully make it possible to perform a high quality classification.
A project under development at the University of Lund [Lund, 1997] aims at providing a database of “all engineering resources on the Internet”. The database is automatically created. Starting from 14 engineering link collections already on the Internet, robots index all pages linked to from these collections, and follow the links on these pages to a depth of 2-3 levels. This is all based on the idea that pages about engineering should link to other pages also about engineering. Since the indexing process was completely automated, some pages in the index are not directly related to engineering. The resulting index contains 32,000 engineering pages indexed in full, and an additional 323,000 links that can be found on the engineering pages, whether they are actual engineering links or not. This shows that the idea deserves to be explored further in the future.
Our indexing system has the potential of being much more accurate than the Lund project, since our agents will do an analysis based on three different “kinds” of URLs:
It is difficult to estimate how much indexing work there is to do to accomplish our goal. Alta Vista receives about 20,000 notifications of new Web pages from all over the world each day. Of these Web pages about half is considered to be "spam", meaning information that is something else than it is presented to be. Most often, this is adult material disguised in popular search terms, or any kind of page consisting mainly of the same term hundreds of times, to attract attention. If we could automatically identify and remove these “spam” pages from the classification process, that would mean an actual new page was submitted to Alta Vista for indexing every 8,64 seconds.
In most cases, it is not just single pages that are submitted for indexing, but front-pages to whole sites consisting of a number of pages. For simplicity, let us think of a page as just a page for now. So, if Alta Vista wanted to classify all these pages using our system and had 20 employees working 24 hours a day (3 shifts of 20, altogether 60 people), they could use 172,8 seconds, that is close to 3 minutes, per page on average to index all pages submitted by users. Because of the necessity of indexing Web sites about “breaking news”, working around the clock is necessary anyway. This shows that if we can develop a system that brings the average indexing time down to a matter of a minute or two per Web page, it will be a system of significant use. Considering the number of people working on indexing the Web in different ways today, the number of people necessary to build our index is not frightening.
To help them in their work, the classification personnel have their own experience, the existing index, a number of standard classification code tables and the suggested codes from the indexing agents. The procedure for the final indexing is as follows:
I. Retrieve a page accompanied by the EDDIC code suggestions from the agents. The actual Web page is automatically retrieved and shown in the browser, so that the classification expert does not have to wait for it to download from the Net.
II. Manually perform a quality check of the suggested code from the agents by looking through all available information, especially the actual Web page.
III. If necessary, correct the code so that it fits the contents and properties of the Web page.
IV. Give feedback to the agent system in a format that makes it possible for the agent system to “learn” from it by changing the rules in the knowledge base. This should happen whether the agent-suggested code was correct or not.
V. Update the index with a new entry for the Web page. The index will automatically produce a Web site code or code range as soon as it has enough information about a site to do so.
The critical point in the procedure we have just described, is that the agent system must be able to do its job both so well that the human classification personnel actually feel supported by it in their tasks, and so fast that the human classifiers never have to wait for data to arrive. Luckily, this is not a completely new research area, so we have some theory to build on.
We obviously need a very fast way to interpret the contents and context of a Web document. As we have seen, there is a lot of information on Web pages that can be used to support and enhance the quality of a context analysis of the page contents, but the context computing will have to be based on keywords that can be extracted from the text on the pages. The methods described here are based on theory presented in [Salton, 1989].
In such a general index as the one we will build, we can not start by putting up a definite, complete list of all the keywords we want to be able to search for. The keywords must instead be found in actual, existing Web pages and added to a dictionary one at a time, as our agents collect them. Incidentally, Gerry Salton, Chris Buckley and other members of the SMART group at Cornell University did a lot of research in this field, mainly in the 1960s and 1970s, a time when data storage was much more expensive than it is today.
[Salton, 1989] presents a “blueprint for automatic indexing” for single-term word extraction systems. With some modifications, it can be used to index our universe, the World Wide Web. Based on general theory suggested by Salton, the process to determine the keywords for our use from a document on the Web should follow these steps:
|
|
|
|
|
69,791
|
|
|
36,411
|
|
|
28,852
|
|
|
26,149
|
|
|
23,237
|
|
|
21,341
|
|
|
10,595
|
|
|
10,049
|
|
|
9,816
|
|
|
9,543
|
Using the vector space information retrieval paradigm, documents are represented as vectors. A dictionary vector D, where each element, d(i), is a word, must be built. Using this, each document T can be assigned a document vector V, where element v(i) is the weight of word d(i) for that document.
To determine the word weights, we use the tf*idf scheme from [Salton, 1989]. We want the keywords that appear in the document and that are most unusual in the index overall to be given the most weight. The basic formula for attaching a weight, vi, to a word from the index dictionary is:
Since we will be working with HTML documents that have huge variations in size, measured in number of words, we will need a more sophisticated tf*idf scheme. [Balabanovic, 1995] presents the following formula for weighting, which normalizes for document length:
Extra weight may be given to words found in headers and in the beginning and the end of the document, as summaries of Web pages are often found here. Names of people and products, that is words where the first letter always is a capital one or where all the letters are in upper case, may also be given extra weight.
8. Having calculated the weights of all the words in the document, we keep the words with weights exceeding a defined value and dispose of the rest. The value can be fixed, or set for each particular document so that we get for example a maximum of 10 ranked keywords for any document. Now the remaining words can be used as the keywords for the document. Terminate keyword identification process.
[Salton, 1989] suggests an outline for how to form term phrases from keywords and use them as content identifiers, with the intention of refining the meaning of the single term identifiers that would be too broad to be used alone:
I. The principal phrase component, known as the phrase head, should be a term with a high appearance frequency in the document, exceeding a stated threshold.
II. The other components of the phrase should be medium- or low-frequency terms with stated co-occurrence relationships with the phrase head. For example, the phrase components should co-occur in a common document sentence with the phrase head within a stated number of words of each other.
III. Common function words, that is stop words, should not be used in
the phrase-formation process.
Since we practically never see the keywords stated in the META tag of HTML ourselves, it is easy to assume that it is not frequently used. According to a survey [Meta, 1997] of more than 60,000 Web pages, however, the META tag was used to quote keywords and a description of the document in a considerable percentage (11.81% had “keywords” and 9.63% had “description”) of all Web pages. The documents were picked from the Yahoo! directory down to depth 4 and from regular runs on SearchBC, a Canadian search engine, so they may not represent a correct average of all Web pages when it comes to how much effort is put into making the page easy to find. Also, some of these were probably not very meaningful, but just automatically included meta tags, generated by the HTML editing tool used to create the page.
Another survey [SiteMetrics, 1998], covering commercial pages only, shows that just below one third of the 25,000 commercial Web sites owned by 20,000 of the largest U.S. companies used “META keywords” and/or the “META description” tags on their pages. The use rate of the tags was independent of the size of the company. This tells us that the technique is widely known and in use, and will probably become even more popular as soon as some search tool starts handling these tags seriously.
To be able to use the keywords and key phrases to classify information,
we need an index with a classification hierarchy where all kinds of topics
are systematically arranged. Classifying a document means to find the spot
in the hierarchy that fits the document’s context as accurately as possible.
The hierarchy must be continuously maintained, so that information regarding
new topics that may emerge in the world can also be inserted in the index.
The system should not be too rigorous, but should allow some flexibility
in mapping keywords to topics. Creating a framework like this is very time-consuming
and expensive, so instead of creating a new one, we should find an existing
framework that is in use and well maintained, and take advantage of the
work already put into it. Our choice is a system based on the Dewey Decimal
Classification, as described in chapter 2.3.2. To improve the functionality
of the classification system, we will need to add some extensions, as we
will see later.
As shown in Figure 5-2, Dewey for Windows offers a number of ways to navigate the Dewey Decimal Classification system. In the top left window of the screenshot, the “DDC Summary” window, the different main categories are listed. By clicking on a category group, the sub-categories are shown. This provides an overview of the top levels of the schedules and tables, and is the most basic way to navigate in the hierarchy.
To do a more thorough search, the top right window, “Search”, can be used. Any keyword or key phrase can be searched for, and all Dewey listings containing the key will be listed in the window. The search takes place in an index file of about 40 Megabytes on the CD-ROM, and usually takes less than a second to perform. In the screenshot the search key is “classification”, which gave 82 hits, ranging from “library and information sciences” to “specific kinds of viruses” and “instruments and their music”. All hits have the Dewey number in the front, so a quick glance to the top left window can tell us something about the context of the hits (200’s = something concerning religions, 400 = somehow about language, and so on).
Double-clicking on one of the search hits, e.g. “025.431 Dewey Decimal Classification”, takes us to the bottom right window, where the whole record for this particular Dewey code is displayed. This includes related Library of Congress headings and notes on how the code is to be used. Very specific instructions on what is supposed to be classified under this code are available for codes that may seem ambiguous. The bottom left window contains the same as the bottom right window, but here it is possible to scroll the window up or down to look at the descriptions for neighboring Dewey topics as well.
The “Dewey for Windows” user interface is obviously meant to be used
as a reference for people who need to use the DDC for either classifying
material or for searching in a Dewey-based index. The data that this application
offers, however, can also be used to implement a semi-automatic indexing
system.
If the agent detects page elements it is not capable of making any sense of, it routes the document to manual classification. If there are no such elements, it will not be necessary for the agent to further process the page at all. However, if we have good keywords for a page, an agent could take these keywords and look them up and compare them with Dewey records from the contents of the database that “Dewey for Windows” uses. This comparison will result in a list of Dewey codes, of which at least one should fit the Web page well.
The suggested Dewey codes are then sent to a human classifier together with the actual page and other aspects of classification of the page that the agents are instructed to come up with. The classifier will have to decide whether the agent was right in its attempted classification or not. If the agent was right, the Web page is indexed in the place in the hierarchy corresponding to the Dewey Decimal Classification code the agent suggested. The agent should also be notified of its success, and so increase the confidence level of the agent. If the agent was wrong, the human classification expert can use personal knowledge and experience and a “Dewey for Windows”-like user interface to find the DDC code that best suits the page and do the classification manually. The agent must also be told it was wrong, and what should have been the result of its work, in a way that can help the agent to establish new and more accurate rules for how to classify Web pages.
If the agent can not find any suitable Dewey categories for a Web page with the keywords the agent has found, maybe the page is not of sufficient interest to be indexed. This will have to be decided manually by the human classifier.
Because of the way the Web is constructed, two Web pages located at the same site and in the same catalog are likely to belong to the same context. Therefore, as we have seen, when an agent is set to classify a page whose URL is very similar to an already classified and indexed Web page, the agent can use this information as an extra help when deciding what context the Web page belongs to. If we know the classification code for any of the pages a page links to, or the classification code for a page that links to this page, that may also be of help in deciding the classification code for the page.
When the index contains many pages from the same site or the same catalog on a site, an “average DDC code” or maybe a “DDC code range” for the site or catalog may be computed and added to the index hierarchy. This is for the convenience of the users who by following such site or catalog links can find good start pages for manual browsing. It is also convenient for the classification agents to quickly get an indication of what a new Web page may be all about, based on the URL of the page.
It is not necessary to make one whole Dewey indexing database for each language from which we want to be able to classify Web documents. Instead, we can translate the contents of a page to the most frequently used language on the Web, English, and use the English Dewey indexing database. Although we do not yet have software that translates poems very well from any language to any other language, we do have translation software that can translate a text so that it is possible for a human being to at least get a good idea of what the original author may have meant to say. For our use, namely grasping the important keywords from documents, these translations will suffice. The same procedure can be used when someone wants to search the index in their own language; Instead of providing a search index for all languages, the search requests should be translated to English. If desired, only hits in one or more user-specified languages are presented to the user, so that the user is given the impression that he or she has been “talking” to the search tool in his or her own language.
Before we go on to show how the index this system builds can be used
to offer a search tool better than any of the search engines and directories
in existence today, we need to decide on a format for the entries in the
index. Cue chapter 6.