5 A Look Ahead:
Context Computing and Automatic Indexing

The Letizia agent, as described in chapter 4.6, is only one of several examples of projects attempting to support the user in the searching and browsing activities. This kind of support is important, but given the way the search engines have problems with following the growth of the Web, it is necessary to think more radically. We must give agents new tasks and make them help us build better search tools.

The most important function we need to add to the spiders and robots that create the search tool indexes today, is the ability to decide what context each particular Web page occurs in. Our search agents must also be able to recognize other Web page properties, as mentioned in chapter 3. However, the context problem is most important to us, so we will begin with providing a solution to this.

We start out with giving an overview of the classificiation part of the system we want to build. The aim is not to create the ultimate automatic text analyzer, but to come up with a system that can help us build bigger and better indexes with more searchable meta information than any of the search tools of today do. The guiding idea is to keep the classification process a manual one, but to support the human classification personnel with webrarians that can preprocess the indexing data to make the manual task easier and quicker. The last part of the chapter presents research that can be taken advantage of when the system is to be implemented.

5.1 How It All Connects - An Indexing System Overview

To make it a bit clearer how our system is meant to support manual indexing work, an overview of the system components is offered in Figure 5-1. In the upper corner of the figure, the World Wide Web is shown as a cloudy mass “somewhere out there”. The agent support is shown as a classification agency in the middle, between the Web and the human classification personnel at the bottom of the figure.

The Classification & Indexing Agency (CIA) launch web search and retrieval agents into the Web. These agents do not have to be very smart, but they should follow the up-to-date rules for Web robots defined and agreed upon by the Internet World Wide Web community [Robots, 1997]. This is necessary to optimize the work strategy and the number of robots on the Net, and to make sure that agents, robots, spiders, wanderers etc do not overload the Web. These agents will in some cases be instructed to look for specific pages, while in other cases they will just “wander” from site to site and page to page on the Web, looking for new pages that has appeared on the Net and are ready to be indexed. When such a page is found, the agent supplies it with some extra information and sends it back to the CIA by using standard network protocols.

At the arrival to the agency, the information about the Web page is split into:

A) URL information. Most important is the URL of this particular page, but information about the page or cgi-script that lead the agent to this page and what page(s) this page links to is also included, as are any URLs that the page links to.

B) The HTML-formatted contents of the Web page. This includes the text visible through a standard Web browser as well as meta information that is hidden to the normal user of the Web.

5.1.1 URL Analysis

The URL information is taken to the URL analyzer module, shown as module 1 in Figure 5-1. This module’s task is to guess as much about the contents of the page as possible, without looking at the text at all. To do this, the module compares the URLs related to the page to be indexed with the URLs that already are indexed, and tries to gather information through this comparison. The following procedure is used:

I. Locate URLs from the index similar to the URL of the page to be classified.

II. If the site (and directory) has so many pages indexed that it has been given a site contents code or a site contents code range, this code should be the initial “guesstimated” code for the page as well.

III. If there is no site contents code: Based on how similar the URL is to one or more indexed URLs (same site/directory/filename), gather probable keywords and indexing codes for the page.

IV. If there are no similar URLs: See if the page that lead the agent to this page or if pages that the page links to already exist in the index, and gather probable keywords and indexing codes for the page from these.

V. In some cases, the location of the document and what language the document is likely to be written in can be read from the domain-part of the URL. This is also information that should be noted by the system at this stage.

VI. Send the gathered information (the probable page contents code, keywords and potentially other information that can be extracted from the URL) to module 3, the indexing agents module.

5.1.2 Header Analysis

The HTML-formatted textual content of the Web page is directed to module 2 in the figure, the Header analyzer module. The task for this module is to find any meta information about the Web page directly included in the text by the author of the page. The URL is routed through this module as well as through module 1, so that the information from modules 1 and 2 later can be joined in module 3. However, module 2 does not use the URL for its analyses. The procedure for this module is:

I. Scan the document header for meta information provided in any of a number of standardized ways and formats. This includes use of the <META> tag as defined in existing as well as in future classification standards.

II. Scan the document for occurrences of quoted keywords, such as the text string “Keywords:” and consider the following text string to be meta information for the contents of the document.

III. Strip all HTML code except header codes and all in-line Java-scripting etc., so that all that remains of the page is basically the actual text visible through a standard Web browser.

IV. Send the information gathered from <META> tags and the remaining text to the automatic indexing agents in module 3.

5.1.3 Automatic Indexing Agents

By combining the information gathered through the URL analysis and the header analysis with the results from traditional text analysis, it is the hope that we can come up with an accurate list of keywords describing the page. By comparing these keywords with all the keywords from the vocabulary offered by the classification system chosen to be used, namely Dewey Decimal classification, the agents can suggest one or more possible Dewey codes that may be suitable for the page. In addition, various extensions to the Dewey code will be included in the code. By putting these different codes together, an Extended Dewey Decimal Internet Classification (EDDIC) code can be constructed. (From [Webster, 1998], Merriam Webster Online Dictionary: "Eddic: Of, relating to or resembling the Old Norse Edda", a poetic text summing up the old Norse mythology.)

The procedure that takes place in module 3 can be described in this way:

I. Receive gathered info from module 1.

II. Receive gathered info from module 2.

III. Join information from module 1 and 2 by finding information packages carrying the same URL as identification.

IV. Analyze the remaining text filtered through module 2. The analysis procedure to be followed will be described in chapters 5.2 – 5.3.

V. Perform a variety of text analyses to decide different attributes of the document. The more that can be said about the document, the better. What document properties the system shall identify, will depend on what type, as discussed in chapter 3.3, the system identifies the page to be.

VI. Combine the results from step IV with the information provided by module 1 and module 2. The agents should weight the probabilities for the correctness of the information provided by the different modules and come up with a final suggested EDDIC contents code for the Web page. To help in making this code, the agents have a number of tables containing information about different classification standards, in addition to the knowledge base contained by the agents themselves.

VII. Pass the EDDIC contents code for the page and the page itself on to a human classification expert, shown as “module” 4 in the figure. An explanation of what decisions the agents have made and the background for them must be included.

To utterly reduce the manual efforts, the agent can route the Web page to someone who knows the particular part of the Dewey code hierarchy the agent believes the page to belong to especially well, for a manual check and correction. Instead of having people know the whole classification code hierarchy, they can concentrate on one area and handle indexing to that area faster. If the agent misdirects the Web page, the page can either be handled by the person who receives it or he or she can pass it on to the right person manually. When the index comes as far as being multi-lingual, it is also necessary to route the Web page for final classification by some person who also knows the language used on the page

What distinguishes this attempt on an automated classification process from earlier attempts, is the use of the URLs, and the idea of not automating the classification, but automating classification support instead. Similar to how citations have been exploited in earlier classification systems for paper-based printed matter, the hypertext link structure between documents will hopefully make it possible to perform a high quality classification.

A project under development at the University of Lund [Lund, 1997] aims at providing a database of “all engineering resources on the Internet”. The database is automatically created. Starting from 14 engineering link collections already on the Internet, robots index all pages linked to from these collections, and follow the links on these pages to a depth of 2-3 levels. This is all based on the idea that pages about engineering should link to other pages also about engineering. Since the indexing process was completely automated, some pages in the index are not directly related to engineering. The resulting index contains 32,000 engineering pages indexed in full, and an additional 323,000 links that can be found on the engineering pages, whether they are actual engineering links or not. This shows that the idea deserves to be explored further in the future.

Our indexing system has the potential of being much more accurate than the Lund project, since our agents will do an analysis based on three different “kinds” of URLs:

The URL to the page that leads to the page we want to classify
The URL of the page we want to classify
The URLs of the pages linked to from the page we want to classify

As our index grows, the probability that some of these URLs are already classified and indexed grows as well. This should lead to an increase in the quality of the agents’ work, and an easier as well as more interesting task for the classification personnel, who can use their time not to correct the agents’ work, but to improve it.

5.1.4 Agent-Supported Manual Indexing

The last step in the indexing process is done manually by a human classification expert. The idea is that the agents’ pre-processing of the data will reduce the time needed for an expert to give a Web page an accurate index code. The code must contain enough information to offer context-search of the index. The manual indexing efforts must be reduced so much that indexing the whole World Wide Web will become a feasible operation.

It is difficult to estimate how much indexing work there is to do to accomplish our goal. Alta Vista receives about 20,000 notifications of new Web pages from all over the world each day. Of these Web pages about half is considered to be "spam", meaning information that is something else than it is presented to be. Most often, this is adult material disguised in popular search terms, or any kind of page consisting mainly of the same term hundreds of times, to attract attention. If we could automatically identify and remove these “spam” pages from the classification process, that would mean an actual new page was submitted to Alta Vista for indexing every 8,64 seconds.

In most cases, it is not just single pages that are submitted for indexing, but front-pages to whole sites consisting of a number of pages. For simplicity, let us think of a page as just a page for now. So, if Alta Vista wanted to classify all these pages using our system and had 20 employees working 24 hours a day (3 shifts of 20, altogether 60 people), they could use 172,8 seconds, that is close to 3 minutes, per page on average to index all pages submitted by users. Because of the necessity of indexing Web sites about “breaking news”, working around the clock is necessary anyway. This shows that if we can develop a system that brings the average indexing time down to a matter of a minute or two per Web page, it will be a system of significant use. Considering the number of people working on indexing the Web in different ways today, the number of people necessary to build our index is not frightening.

To help them in their work, the classification personnel have their own experience, the existing index, a number of standard classification code tables and the suggested codes from the indexing agents. The procedure for the final indexing is as follows:

I. Retrieve a page accompanied by the EDDIC code suggestions from the agents. The actual Web page is automatically retrieved and shown in the browser, so that the classification expert does not have to wait for it to download from the Net.

II. Manually perform a quality check of the suggested code from the agents by looking through all available information, especially the actual Web page.

III. If necessary, correct the code so that it fits the contents and properties of the Web page.

IV. Give feedback to the agent system in a format that makes it possible for the agent system to “learn” from it by changing the rules in the knowledge base. This should happen whether the agent-suggested code was correct or not.

V. Update the index with a new entry for the Web page. The index will automatically produce a Web site code or code range as soon as it has enough information about a site to do so.

The critical point in the procedure we have just described, is that the agent system must be able to do its job both so well that the human classification personnel actually feel supported by it in their tasks, and so fast that the human classifiers never have to wait for data to arrive. Luckily, this is not a completely new research area, so we have some theory to build on.

5.2 Identifying Keywords

Research in the field of natural language processing has come a long way in creating software that is able to somehow “understand” human language, both written and spoken. This can be useful in many ways, for instance in connection with translation between different languages, to provide cheap telephone information services and to have the grammar automatically corrected by the word processor. There are many problems left to solve, though, especially when it comes to doing it quickly enough.

We obviously need a very fast way to interpret the contents and context of a Web document. As we have seen, there is a lot of information on Web pages that can be used to support and enhance the quality of a context analysis of the page contents, but the context computing will have to be based on keywords that can be extracted from the text on the pages. The methods described here are based on theory presented in [Salton, 1989].

In such a general index as the one we will build, we can not start by putting up a definite, complete list of all the keywords we want to be able to search for. The keywords must instead be found in actual, existing Web pages and added to a dictionary one at a time, as our agents collect them. Incidentally, Gerry Salton, Chris Buckley and other members of the SMART group at Cornell University did a lot of research in this field, mainly in the 1960s and 1970s, a time when data storage was much more expensive than it is today.

[Salton, 1989] presents a “blueprint for automatic indexing” for single-term word extraction systems. With some modifications, it can be used to index our universe, the World Wide Web. Based on general theory suggested by Salton, the process to determine the keywords for our use from a document on the Web should follow these steps:

Retrieve the raw HTML-formatted contents of a page from the Web.
Scan the document for keywords presented in HTML using the <META> tag, and set the keywords for the page to be the words defined this way. This is done in module 2 in Figure 5-1. If successful, terminate keyword identification process.
Scan the document for the word “Keywords:” and set the keywords for the page to be the words following the “Keywords:” occurence. This is also done in module 2 in Figure 5-1. If successful, terminate keyword identification process.
If neither step 2 nor step 3 returned any keywords: Remove all HTML markup tags except the header tags, and identify all remaining individual words occurring in the document. This step as well as the remaining steps described here takes place in module 3 in figure 5-1.
Use a stop list of common function words to delete from the text the high-frequency function words, meaning words that are insufficiently specific to represent context. Table 5-1 contains examples of such stop words, namely the ten most usual words in written English [Salton, 1989]. These are among a number of words that appear too frequently in texts to be of any use for separating between documents. If it is desirable to store the whole document, a large reduction of necessary storage space can be made through removing stop words. We will only use it to reduce the number of potential keywords.

Rank	Word	Frequency/ million words
1	the	69,791
2	of	36,411
3	and	28,852
4	to	26,149
5	a	23,237
6	in	21,341
7	that	10,595
8	is	10,049
9	as	9,816
10	he	9,543

Table 5-1, The most common words in written English

6. Use an automatic suffix-stripping routine to reduce each remaining word to word-stem form. This reduces all words exhibiting the same stem to a common form. Example: “computer”, “computation” and “computing” are all reduced to the stem “comput”. This is another way of reducing the number of potential keywords for the document.

Using the vector space information retrieval paradigm, documents are represented as vectors. A dictionary vector D, where each element, d(i), is a word, must be built. Using this, each document T can be assigned a document vector V, where element v(i) is the weight of word d(i) for that document.

To determine the word weights, we use the tf*idf scheme from [Salton, 1989]. We want the keywords that appear in the document and that are most unusual in the index overall to be given the most weight. The basic formula for attaching a weight, vi, to a word from the index dictionary is:

Since we will be working with HTML documents that have huge variations in size, measured in number of words, we will need a more sophisticated tf*idf scheme. [Balabanovic, 1995] presents the following formula for weighting, which normalizes for document length:

Extra weight may be given to words found in headers and in the beginning and the end of the document, as summaries of Web pages are often found here. Names of people and products, that is words where the first letter always is a capital one or where all the letters are in upper case, may also be given extra weight.

8. Having calculated the weights of all the words in the document, we keep the words with weights exceeding a defined value and dispose of the rest. The value can be fixed, or set for each particular document so that we get for example a maximum of 10 ranked keywords for any document. Now the remaining words can be used as the keywords for the document. Terminate keyword identification process.

Whether the algorithm terminated in step 2, 3 or 8, we now have a number of keywords ready, identified from the contents of one particular Web page. A problem may be that we only have single terms, while what we really want to have as “keywords” are phrases. For instance, the phrase “house plant” will probably be a better search and classification key than the two separate keywords “house” and “plant” are.

[Salton, 1989] suggests an outline for how to form term phrases from keywords and use them as content identifiers, with the intention of refining the meaning of the single term identifiers that would be too broad to be used alone:

I. The principal phrase component, known as the phrase head, should be a term with a high appearance frequency in the document, exceeding a stated threshold.

II. The other components of the phrase should be medium- or low-frequency terms with stated co-occurrence relationships with the phrase head. For example, the phrase components should co-occur in a common document sentence with the phrase head within a stated number of words of each other.

III. Common function words, that is stop words, should not be used in the phrase-formation process.

5.3 Resolving the Context From the Keys

After the thorough process of gathering keywords and key phrases from a document, as described in chapter 5.2, we are equipped with a number of keys that can be of help in deciding what context the document belongs in. In many cases the keywords and phrases will be very accurate, especially when they are found through step 2 or 3 of the keyword identification process.

Since we practically never see the keywords stated in the META tag of HTML ourselves, it is easy to assume that it is not frequently used. According to a survey [Meta, 1997] of more than 60,000 Web pages, however, the META tag was used to quote keywords and a description of the document in a considerable percentage (11.81% had “keywords” and 9.63% had “description”) of all Web pages. The documents were picked from the Yahoo! directory down to depth 4 and from regular runs on SearchBC, a Canadian search engine, so they may not represent a correct average of all Web pages when it comes to how much effort is put into making the page easy to find. Also, some of these were probably not very meaningful, but just automatically included meta tags, generated by the HTML editing tool used to create the page.

Another survey [SiteMetrics, 1998], covering commercial pages only, shows that just below one third of the 25,000 commercial Web sites owned by 20,000 of the largest U.S. companies used “META keywords” and/or the “META description” tags on their pages. The use rate of the tags was independent of the size of the company. This tells us that the technique is widely known and in use, and will probably become even more popular as soon as some search tool starts handling these tags seriously.

To be able to use the keywords and key phrases to classify information, we need an index with a classification hierarchy where all kinds of topics are systematically arranged. Classifying a document means to find the spot in the hierarchy that fits the document’s context as accurately as possible. The hierarchy must be continuously maintained, so that information regarding new topics that may emerge in the world can also be inserted in the index.
The system should not be too rigorous, but should allow some flexibility in mapping keywords to topics. Creating a framework like this is very time-consuming and expensive, so instead of creating a new one, we should find an existing framework that is in use and well maintained, and take advantage of the work already put into it. Our choice is a system based on the Dewey Decimal Classification, as described in chapter 2.3.2. To improve the functionality of the classification system, we will need to add some extensions, as we will see later.

5.3.1 Ten Reasons For Choosing Dewey

Library and information science researcher Mary [Micco, 1996] lists some desirable characteristics of a classification scheme for organizing and indexing large quantities of documents:

Hierarchical numbering scheme. In a decimal system where each set from 0 to 9 is reserved for a topic, each number can be subdivided by 10, which can again be subdivided. Not having to deal with more than 10 choices per level of the hierarchy coincides fairly well with the popular theory from cognitive psychology, saying that humans in general are most comfortable with dealing with 7 plus/minus 2 items.
Explode feature. If you want to search a general topic you can locate the broad class number, e.g. 636.1 for “Horses – equines”. To look at more specific headings, you can simply explode it to capture all decimal subdivisions at a more specific level. For example, 636.11-19 offer information on topics covering everything from miniature horses to mules.
Specificity levels. The level of specificity can be selected to fit the contents of the item. For instance, a general encyclopedia is Dewey-classified under 000, while information about ponies would be classified under 636.16. This is a contrast to the search engine indexes today, which basically offer searching in flat lists of indexed words.
Effective filtering. If you only want to see items about ponies, other information can be eliminated from view by restricting a filter to only show information classified under 636.16.
Synonym control. Many different words, including differences from language to language, may be used to describe a topic, but the Dewey topic number will pull together all the possible variations, including translations. This should help the user to move effortlessly from her terminology to the terms in use in the controlled vocabulary of the indexing system, both in the search process and the index process.
Standard subdivisions. Standard subdivisions can be generated and used throughout. Things like geographic subdivisions for place and period can be easily generated and used.
Efficient computer manipulation. Numbers, even very long ones, are perfect for manipulation by machine. The way the Dewey system maps numbers to descriptive phrases makes it possible also for humans to understand and navigate in the number system. The same phrases and keywords are useful for indexing.
Semi-automatic indexing possible. Based on similarities between documents through their keywords, it is possible to support the indexing process. The hypertext nature of the World Wide Web gives additional help in revealing the similarities.
Significant existing base of classified documents. More than 22 million books and periodicals already have Dewey class numbers assigned to them. This means we have a large knowledge base in machine-readable form to start from.
Updates. The Library of Congress has collected and maintained Dewey records for more than 100 years already, and the Dewey classification scheme’s widespread use guarantees that this will continue.

5.3.2 Dewey – Status Quo

The latest revision of the Dewey Decimal Classification codes comprises four volumes, and the Library of Congress adds new codes to the list. There is an on-line searchable version of the Library of Congress classification records, including Dewey codes, located at the Library of Congress’ Web site. A more Dewey-centered search tool is a CD-ROM product from OCLC called “Dewey for Windows”.

As shown in Figure 5-2, Dewey for Windows offers a number of ways to navigate the Dewey Decimal Classification system. In the top left window of the screenshot, the “DDC Summary” window, the different main categories are listed. By clicking on a category group, the sub-categories are shown. This provides an overview of the top levels of the schedules and tables, and is the most basic way to navigate in the hierarchy.

To do a more thorough search, the top right window, “Search”, can be used. Any keyword or key phrase can be searched for, and all Dewey listings containing the key will be listed in the window. The search takes place in an index file of about 40 Megabytes on the CD-ROM, and usually takes less than a second to perform. In the screenshot the search key is “classification”, which gave 82 hits, ranging from “library and information sciences” to “specific kinds of viruses” and “instruments and their music”. All hits have the Dewey number in the front, so a quick glance to the top left window can tell us something about the context of the hits (200’s = something concerning religions, 400 = somehow about language, and so on).

Double-clicking on one of the search hits, e.g. “025.431 Dewey Decimal Classification”, takes us to the bottom right window, where the whole record for this particular Dewey code is displayed. This includes related Library of Congress headings and notes on how the code is to be used. Very specific instructions on what is supposed to be classified under this code are available for codes that may seem ambiguous. The bottom left window contains the same as the bottom right window, but here it is possible to scroll the window up or down to look at the descriptions for neighboring Dewey topics as well.

The “Dewey for Windows” user interface is obviously meant to be used as a reference for people who need to use the DDC for either classifying material or for searching in a Dewey-based index. The data that this application offers, however, can also be used to implement a semi-automatic indexing system.

5.3.3 Suggestions for Techniques to Improve Classification Efficiency

In chapter 5.2 we saw how we can obtain keywords and key phrases from any Web page. The keys we get from this should be the most important words mentioned in the text of the page, giving an indication on what context the page belongs to. At least, this will be true for any normal, informative Web page. If our keyword identification process does not result in any keywords, the Web page probably is not of interest to anyone anyway, or it is a page containing graphics or animations that must be classified manually.

If the agent detects page elements it is not capable of making any sense of, it routes the document to manual classification. If there are no such elements, it will not be necessary for the agent to further process the page at all. However, if we have good keywords for a page, an agent could take these keywords and look them up and compare them with Dewey records from the contents of the database that “Dewey for Windows” uses. This comparison will result in a list of Dewey codes, of which at least one should fit the Web page well.

The suggested Dewey codes are then sent to a human classifier together with the actual page and other aspects of classification of the page that the agents are instructed to come up with. The classifier will have to decide whether the agent was right in its attempted classification or not. If the agent was right, the Web page is indexed in the place in the hierarchy corresponding to the Dewey Decimal Classification code the agent suggested. The agent should also be notified of its success, and so increase the confidence level of the agent. If the agent was wrong, the human classification expert can use personal knowledge and experience and a “Dewey for Windows”-like user interface to find the DDC code that best suits the page and do the classification manually. The agent must also be told it was wrong, and what should have been the result of its work, in a way that can help the agent to establish new and more accurate rules for how to classify Web pages.

If the agent can not find any suitable Dewey categories for a Web page with the keywords the agent has found, maybe the page is not of sufficient interest to be indexed. This will have to be decided manually by the human classifier.

Because of the way the Web is constructed, two Web pages located at the same site and in the same catalog are likely to belong to the same context. Therefore, as we have seen, when an agent is set to classify a page whose URL is very similar to an already classified and indexed Web page, the agent can use this information as an extra help when deciding what context the Web page belongs to. If we know the classification code for any of the pages a page links to, or the classification code for a page that links to this page, that may also be of help in deciding the classification code for the page.

When the index contains many pages from the same site or the same catalog on a site, an “average DDC code” or maybe a “DDC code range” for the site or catalog may be computed and added to the index hierarchy. This is for the convenience of the users who by following such site or catalog links can find good start pages for manual browsing. It is also convenient for the classification agents to quickly get an indication of what a new Web page may be all about, based on the URL of the page.

It is not necessary to make one whole Dewey indexing database for each language from which we want to be able to classify Web documents. Instead, we can translate the contents of a page to the most frequently used language on the Web, English, and use the English Dewey indexing database. Although we do not yet have software that translates poems very well from any language to any other language, we do have translation software that can translate a text so that it is possible for a human being to at least get a good idea of what the original author may have meant to say. For our use, namely grasping the important keywords from documents, these translations will suffice. The same procedure can be used when someone wants to search the index in their own language; Instead of providing a search index for all languages, the search requests should be translated to English. If desired, only hits in one or more user-specified languages are presented to the user, so that the user is given the impression that he or she has been “talking” to the search tool in his or her own language.

Before we go on to show how the index this system builds can be used to offer a search tool better than any of the search engines and directories in existence today, we need to decide on a format for the entries in the index. Cue chapter 6.

Go to: Front page - Index - Ch. 1 - Ch. 2 - Ch. 3 - Ch. 4 - Ch. 5 - Ch. 6 - Ch. 7 - Ch. 8 - Ch. 9 - Glossary - References Visit the author's homepage : http://www.pvv.org/~bct/ E-mail the author, Bjørn Christian Tørrissen: bct@pvv.org

5 A Look Ahead: Context Computing and Automatic Indexing