6 The EDDIC Code Format

This chapter contains a discussion about the requirements for the Extended Dewey Decimal Internet Classification (EDDIC) code. We start by defining the factors that must be considered when designing the format, and present a suggestion for how the index entries should be built and what they will look like. An EBNF-like definition of the EDDIC code concludes the chapter.

6.1 Critical Factors For The Code Design

When designing the format for a code to be used for our index entries, there are a number of factors we have to keep in mind.

The code must create a basis for accurate, flexible searching. Document properties that can be used to efficiently distinguish among Web pages must be identified, ordered systemically and incorporated into the code system.
The code must be efficient in terms of disk space. Because of the size of an index covering the number of Web pages we want to classify, this is an aspect we need to emphasize. Although the solution for storage problems often has been to wait for new technology offering cheaper space, we do not have time to wait for new hardware this time. Already no search engines are capable of indexing information covering more than about half the Web pages in existence, and the number of Web documents is increasing drastically.
The code must be scalable. If the code can be used to index smaller units, such as a company’s common files or the files on any particular personal computer, it will be possible to create applications that offer seamless integration of files from any local computer and from the Internet. Alta Vista through its Personal Search 97 package offers the option of indexing your personal files and search among them alongside searching on the Internet. There is one major downside to this product that makes people seem reluctant to use it though, namely that the index files quickly fill your hard drive.
The code must result in a balanced hierarchy. To make possible quick insertion and update of (Web) document records and a similarly efficient read access to them, the records must be organized into a well-balanced hierarchy. This means we need to make wise decisions on what branches our index tree shall have, finding how the branches best can be used to do indexing and searching.
The code must allow for changes and additions. In addition to the Dewey-part of the EDDIC code, a number of extensions are needed to be able to offer a flexible but still accurate search. Instead of coming up with all kinds of special codes for every possible property of a document, we should be able to incorporate any useful, popular classification standard that already exists or appears on the stage in the future. An example of a standard to include in the EDDIC-code is the PICS rating system, as described in Chapter 3.5.
Code readability. While the code shall be numeric in its basic form, so that it is suitable for computing, a dictionary for translating between the numeric form and a form readable to humans with different levels of code reading skills must be made. This dictionary will also be used to translate search requests from people into search agent instructions.

6.2 Important Document Properties

To be able to offer a flexible narrowing of search scope, we need to decide which document properties are most important to store information about in our index. We must pick properties that can be automatically detected in as many cases as possible. The most important properties are probably the kind of properties that people quickly perceive themselves when they read and look at Web pages, as these are the ones people are most likely to use when they search for a particular Web page.

6.2.1 Unique Identificator

The only attribute of a Web document that is guaranteed to be unique to that particular page is the URL. Therefore, we pick the URL to be used as an identificator. This is automatically collected by the Web search agents, together with the HTML <TITLE> tag, which is a very short description of the page provided by the creator of the page.

Using the URL as the identificator will result in very large differences in the length and contents between the different index entries, but we have no other options. It is necessary to provide robust mechanisms for handling the situations where URLs cease to exist and where a totally new page appears at the same URL as a page that is already in the index.

6.2.2 Topic / Category

The single most important document property when it comes to searching among and separating between home pages is the extended Dewey code that provides the context for the Web page. As a basis for the hierarchy of interest area codes, we have chosen to use the Dewey Decimal Classification system.

One criticism that may be raised against it is that it was made a long time ago and does not cover all newer areas very well. As described earlier, we are convinced that there are enough useful properties of the DDC to justify using it as a basis. An interesting addition to the 000-999 codes is to introduce new codes related to areas such as electronic commerce and specific kinds of Internet services. This should be kept in a format similar to the original Dewey codes, for example as E00-E99. Most users of the system will not see the actual codes anyway. Instead, they see the full title of the topic each particular code covers, automatically translated from the code numbers by the system.

6.2.3 Web Page Class

As discussed in Chapter 3.3, Web pages can be classified not only by what topic they cover, but also by what kind of page it is. Different classes of Web pages are personal home pages, fan pages/pages dedicated to someone or something, major link pages, news pages, sports news pages, public information pages and so on. This is a document property that may be very useful to include in the index, particularly for pages that carry contents of high quality. Also, “Breaking news” could be a page class, making it easy to find news about very recent happenings. When the news is not so new anymore, the code can be removed from the page’s index entry. A two-digit code should be sufficient for indicating Web page classes; 00-99.

6.2.4 Language

To indicate what language is used in a Web page we use the Z39-53 standard codes from the National Information Standard Organization. This is a three-character text code, covering some 400 different languages as of today.

6.2.5 Contents Ratings

Several formats for rating the contents of Web pages in different ways have appeared. There is no de facto standard for this rating system yet, but it seems as if the PICS project, described in Chapter 3.5, will be a major system in the near future. We shall support new systems as they appear on the scene, but for now we should start out with incorporating PICS codes in the EDDIC code where available. This means ratings level codes for violence, nudity, sex and profane language on the Web pages.

6.2.6 Graphics Use

To people using modems to access the Internet, an important factor when choosing what page to look at is how much graphics, that is pictures and animation, the page contains. If you are looking for textual information, you will probably be more interested in pages with a lot of text than pages using a lot of graphics, which may make retrieval of the page slow. On the other hand, if you are looking for pictures of someone or something, you may want to concentrate your search on pages containing much graphics.

How much graphics there is on the page may be measured by how many <IMG> tags the HTML-formatted page contains. It may be more fair to large Web documents to calculate a “graphics value” by counting how many <IMG> tags there are per 1,000 words. No matter what approach is chosen, some kind of numeric value for graphical contents level should be calculated and included in the index.

6.2.7 Periodicity

How often a Web page is updated may be of interest in many situations. Therefore the index should contain information about how often the page is believed to change contents in one way or another. Values may be codes indicating if it changes for example “All the time”, “Daily”, “Weekly”, “Bi-weekly”, “Monthly”, “Annually” or “Never”.

6.2.8 Keywords

Since we do not want to store whole Web pages in the index, it is necessary to offer a mechanism for allowing some kind of search using keywords and search strings within the context the user is interested in. Each page can have up to ten keywords for this purpose. The words should be as distinctive for that page as possible. Typical examples of good keywords are names of people and geographic locations that do not have their own special code within the DDC system, registered trademarks, brand names and other words that are given sufficiently high values in the word weighting procedure described in Chapter 5.2.

Setting the maximum number of allowed keywords to 10 is merely a preliminary suggestion. What the ideal number of keywords is should be settled by future research. An alternative to include and use keywords in the index like it is suggested here, is to first use the other document properties covered by the EDDIC system to limit the number of possibly interesting URLs, and then perform a free-text search on a “normal” search engine, limiting the search to only return hits from the Web sites assumed to be of most interest by EDDIC.

6.2.9 Future Extensions

To prevent our system from becoming obsolete within a few years, it must be possible to add and remove Web document search properties. When new formats and standards for including meta information are introduced and turn out to be in significant use, this should be incorporated by our system. If the system has enough capacity, additional Web document qualities, such as Dublin Core elements, Java scripts and different plug-ins, should be detected and registered by the EDDIC index.

An extension we already now may consider to include is expiration dates for Web pages. For certain happenings and business offers, the Web page may contain information on when the page will be taken off the Net, most often because the information has no value anymore. The agents should be able to perceive such meta information and make sure the index entry is removed at the same date as the page itself is taken off the Web, instead of having the system detect it automatically.

Other interesting extensions may be for describing any costs related to accessing certain Web pages, what plug-ins or other technology is required to fully experience a Web page, codes for pages that offer special security for money transactions and whether the page contains advertisements or not.

6.2.10 Index Maintenance Data

Unfortunately, even though we get a Web page classified and indexed, the work is not over. Due to the dynamic nature of the Web, pages sometimes move from one URL to another, or they may even be taken off the Web completely. To prevent our index from containing links to non-existing pages, we need to check on every single link every now and then. This can be done automatically, but to do it systematically, we introduce the expiration check field to the index entries.

Each page must be checked on at least once every month. If the expiration value is set to 30 when the page is indexed, and the value is decreased by 1 every day, an expiration check can be initiated when the value reaches 0 (zero). If the page seems to be ok, the counter is reset to 30. If the page can not be found, a new existence check must be done in a few days to make sure a page is not prematurely removed from the index. Because of high load, server failures and temporary malfunctions in computers, networks and cables, a page is sometimes off the Web for a short period. This way of dividing the expiration check dates between all the pages will soon divide the work evenly from day to day. The value of 30 is only a preliminary suggestion that guarantees that “dead” links are removed from the index after just over a month in the worst case.

6.3 Automatic and Analytic Meta Information

The document properties we have decided to include in the index entries for Web pages can either be found automatically by the agents or they need to be settled by manual classification personnel, supported by the agents’ preparatory work. In general, we can say that the properties will be found like this:

Automatically: URL, Title, Language, Ratings, Graphics, future extensions and expiration data.
Through agent-supported manual analysis: DDC code, Page Class, Periodicity, Keywords

In some cases it may be impossible to find suitable information to put into each of the possible fields of the index entry of a Web page. Instead of having an “Undecided” value for each field, we will leave these fields empty for that particular Web page. This means that each field must be headed by a short code saying what kind of information the field contains. This is also the most practical way of doing it when considering that we may also add new fields to index entries in the future. The suggested field codes and some additional information are shown in Table 6-1.

Field codes the “Required” column is ticked for must be a part of each index entry. Other field codes are optional. An index entry can contain several values for the field codes where the “Multivalue” column is ticked.

Field Field Code Short for Required Optional Multivalue

URL AD Address *

Title HD Header *

Dewey Decimal
Classification DC Dewey Classification * *

Page Class PC Page Class * *

Language, Z39-53 LC Language Code *

Ratings RL Ratings Levels * *

Graphics Use GR Graphics *

Periodicity PE Periodicity *

Keywords Kn Keyword n * *

Extensions Enn Extension nn * *

Expiration check Xnn Page eXisitence due in nn days *

Table 6-1, Field codes for use in the EDDIC index

Based on the table, this is what an index entry may look like, using only ad-hoc codes for now:

ADhttp://www.ntnu.no/indexe.html;;HDWelcome to NTNU;;DC378.05;;
PC41;;LCENG;;GR2;;PE0;;K1NTNU;;K2university;;K3Trondheim;; K4faculties;;X10;;

Explanation: The entry tells us

The URL, http://www.ntnu.no/indexe.html
The title, “Welcome to NTNU”
The Dewey code, 378.05, “Public colleges and universities”
Page class 41, Front page of educational institution
Language used, English
Graphics use, 2, meaning there is “some but not heavy” graphics
Periodicity, 0, meaning the page changes contents often but not regularly
Keywords, “NTNU”, “university”, “Trondheim” and “faculties” considered important ones
Expiration status, in 10 days the next page existence check existence will be initiated

Including the field delimiter “;;”, chosen because it is a string that is unlikely to occur in the fields themselves, all this information is contained in a 142 character entry. The codes used for page class, graphics use and periodicity are not actual codes, but codes created for the example only.

Because we do not index the actual contents about the Web page, but only its “properties” including topic/context, we also avoid any copyright infringements. This point has become more important as there have appeared cases of lawsuits between search engine companies and Web page owners concerning the rights to offer “second-hand” information.

6.4 EDDIC – The Code

To describe the EDDIC index entry in its general form, we use an EBNF-like [Backus, 1959] (Extended Backus-Naur Form) syntax, where “[ .. ]” means optional field(s), “{ .. }” means “one or more entry fields”, < .. > means “all required fields” and “|” indicates alternatives:

IndexEntry          ::= RequiredProperties [ OptionalProperties ]
RequiredProperties ::= < ReqProperty >
OptionalProperties ::= { OptProperty }
ReqProperty         ::= ReqPropertyCode Value Delimiter
OptProperty         ::= OptPropertyCode Value Delimiter
ReqPropertycode     ::= AD | { DC } | GR | { Kn } | Xnn
OptPropertyCode     ::= HD | { PC } | LC | { RL } | PE | { Enn }
Delimiter           ::= ;;
Value               ::= LegalString
LegalString         ::= { LegalCharacter }
n                   ::= 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
LegalCharacter      ::= Any character except ”;” when following a “;”

Since standard EBNF have no simple way of indicating what terms must be a part of an expression, let us stress that the above means that the index entry of any Web page shall contain at least the fields containing the URL (AD), one or more Dewey codes (DC), a graphics value (GR), one or more keywords (Kn) and an expiry code, Xnn. Optionally an index entry may also include a title field (HD), one or more page class codes (PC), a language code (LC), one or more ratings for the page (RL), the periodicity of the updating of the page (PE) and one or more extensions (Enn).

The string “;;AD” will indicate the start of a new index entry in a file containing several entries.

6.4.1 Meeting the Requirements

Our code meets the requirements from chapter 6.1, where we listed a number of important factors when designing a code framework for indexing use:

By creating and storing a code describing the page instead of indexing the page contents itself, we save a lot of space for every page indexed. Although we can not offer free-text search in the Web pages we index, we can offer search within a context to a much larger degree than traditional search engines can, through intelligent use of the different document properties we have recorded information about.
The code is scalable. If someone finds that this seems like a good way to organize documents, they can use the same system to index their own documents and files in general on their own computer. Each user will have to decide which properties to include in the index. Instead of using URL as unique identificator, the disk drive path and file name may be used. This can later be mapped to a network address if several people using several computers want to create a shared index.
To optimize the organization of the index database it is necessary to know much about the data that are to be organized. The best access times can probably be achieved by having several indexes, each based on a different sorting key. The properties identified in this chapter should cover and allow the most important kinds of searches people may want to perform.
Future additions are easily implemented as a consequence of the extension and field code system. When it is decided to add a new index field, the existing index entries do not have to be changed in any way. While updating old records can be done by agents in time, most important is it that the agents and the indexing system can easily be made to incorporate desired changes when indexing all new entries.
Changes to the codes can easily be introduced, since it will be only a matter of mapping codes from one number to another. This requires a minimum of manual work, and the process of actually implementing the change is a fast one, since it consists of having computers do what they do best, namely manipulate numbers.
The readability of the code is optimal. The agents work with numbers that are mapped by the system to words and expressions that are meaningful to the human users of the system.

The parts of the EDDIC code that are directly related to describing contextual information about a Web page also answer to the factors mentioned in Chapter 3.2:

Unambiguity and full coverage: Any context can be described very accurately using the extended Dewey codes. If necessary, new and more precise codes may be added by classification experts. A document may be given several codes if they all describe the document's contents well.
Simplicity and agent-friendly: The code is available in two formats, one optimized for the use of agents, the other optimized for humans to read.
Compactness: As we have seen, the code is very compact, especially when compared to storing whole Web pages.
Navigable and Visuality: The codes are all arranged in a hierarchy with a limited number of choices when going from one level of details to another. Such hierarchies are well suited for creating visual navigation environments. In the beginning we will want to visualize the hierarchy as a traditional tree structure, but eventually we may move on to for example showing the index as a library, where each index entry can be found as a book of a certain color and a certain size in the right shelf in the right part of the library, depending on what codes it has been given.

Hence, we can conclude the chapter feeling satisfied about the outlines for the code, and go on to look at how our index can be used to offer new ways to search the Internet.

Go to: Front page - Index - Ch. 1 - Ch. 2 - Ch. 3 - Ch. 4 - Ch. 5 - Ch. 6 - Ch. 7 - Ch. 8 - Ch. 9 - Glossary - References Visit the author's homepage : http://www.pvv.org/~bct/ E-mail the author, Bjørn Christian Tørrissen: bct@pvv.org

Field	Field Code	Short for	Required	Optional	Multivalue
URL	AD	Address	*
Title	HD	Header		*
Dewey Decimal Classification	DC	Dewey Classification	*		*
Page Class	PC	Page Class		*	*
Language, Z39-53	LC	Language Code		*
Ratings	RL	Ratings Levels		*	*
Graphics Use	GR	Graphics	*
Periodicity	PE	Periodicity		*
Keywords	Kn	Keyword n	*		*
Extensions	Enn	Extension nn		*	*
Expiration check	Xnn	Page eXisitence due in nn days	*