6 The EDDIC Code Format
This chapter contains a discussion about the requirements for the Extended
Dewey Decimal Internet Classification (EDDIC) code. We start by defining
the factors that must be considered when designing the format, and present
a suggestion for how the index entries should be built and what they will
look like. An EBNF-like definition of the EDDIC code concludes the chapter.
6.1 Critical Factors For The Code Design
When designing the format for a code to be used for our index entries,
there are a number of factors we have to keep in mind.
-
The code must create a basis for accurate, flexible searching. Document
properties that can be used to efficiently distinguish among Web pages
must be identified, ordered systemically and incorporated into the code
system.
-
The code must be efficient in terms of disk space. Because of the
size of an index covering the number of Web pages we want to classify,
this is an aspect we need to emphasize. Although the solution for storage
problems often has been to wait for new technology offering cheaper space,
we do not have time to wait for new hardware this time. Already no search
engines are capable of indexing information covering more than about half
the Web pages in existence, and the number of Web documents is increasing
drastically.
-
The code must be scalable. If the code can be used to index smaller
units, such as a company’s common files or the files on any particular
personal computer, it will be possible to create applications that offer
seamless integration of files from any local computer and from the Internet.
Alta Vista through its Personal Search 97 package offers the option of
indexing your personal files and search among them alongside searching
on the Internet. There is one major downside to this product that makes
people seem reluctant to use it though, namely that the index files quickly
fill your hard drive.
-
The code must result in a balanced hierarchy. To make possible quick
insertion and update of (Web) document records and a similarly efficient
read access to them, the records must be organized into a well-balanced
hierarchy. This means we need to make wise decisions on what branches our
index tree shall have, finding how the branches best can be used to do
indexing and searching.
-
The code must allow for changes and additions. In addition to the
Dewey-part of the EDDIC code, a number of extensions are needed to be able
to offer a flexible but still accurate search. Instead of coming up with
all kinds of special codes for every possible property of a document, we
should be able to incorporate any useful, popular classification standard
that already exists or appears on the stage in the future. An example of
a standard to include in the EDDIC-code is the PICS rating system, as described
in Chapter 3.5.
-
Code readability. While the code shall be numeric in its basic form,
so that it is suitable for computing, a dictionary for translating between
the numeric form and a form readable to humans with different levels of
code reading skills must be made. This dictionary will also be used to
translate search requests from people into search agent instructions.
6.2 Important Document Properties
To be able to offer a flexible narrowing of search scope, we need to decide
which document properties are most important to store information about
in our index. We must pick properties that can be automatically detected
in as many cases as possible. The most important properties are probably
the kind of properties that people quickly perceive themselves when they
read and look at Web pages, as these are the ones people are most likely
to use when they search for a particular Web page.
6.2.1 Unique Identificator
The only attribute of a Web document that is guaranteed to be unique to
that particular page is the URL. Therefore, we pick the URL to be used
as an identificator. This is automatically collected by the Web search
agents, together with the HTML <TITLE> tag, which is a very short description
of the page provided by the creator of the page.
Using the URL as the identificator will result in very large differences
in the length and contents between the different index entries, but we
have no other options. It is necessary to provide robust mechanisms for
handling the situations where URLs cease to exist and where a totally new
page appears at the same URL as a page that is already in the index.
6.2.2 Topic / Category
The single most important document property when it comes to searching
among and separating between home pages is the extended Dewey code that
provides the context for the Web page. As a basis for the hierarchy of
interest area codes, we have chosen to use the Dewey Decimal Classification
system.
One criticism that may be raised against it is that it was made a long
time ago and does not cover all newer areas very well. As described earlier,
we are convinced that there are enough useful properties of the DDC to
justify using it as a basis. An interesting addition to the 000-999 codes
is to introduce new codes related to areas such as electronic commerce
and specific kinds of Internet services. This should be kept in a format
similar to the original Dewey codes, for example as E00-E99. Most users
of the system will not see the actual codes anyway. Instead, they see the
full title of the topic each particular code covers, automatically translated
from the code numbers by the system.
6.2.3 Web Page Class
As discussed in Chapter 3.3, Web pages can be
classified not only by what topic they cover, but also by what kind of
page it is. Different classes of Web pages are personal home pages, fan
pages/pages dedicated to someone or something, major link pages, news pages,
sports news pages, public information pages and so on. This is a document
property that may be very useful to include in the index, particularly
for pages that carry contents of high quality. Also, “Breaking news” could
be a page class, making it easy to find news about very recent happenings.
When the news is not so new anymore, the code can be removed from the page’s
index entry. A two-digit code should be sufficient for indicating Web page
classes; 00-99.
6.2.4 Language
To indicate what language is used in a Web page we use the Z39-53 standard
codes from the National Information Standard Organization. This is a three-character
text code, covering some 400 different languages as of today.
6.2.5 Contents Ratings
Several formats for rating the contents of Web pages in different ways
have appeared. There is no de facto standard for this rating system yet,
but it seems as if the PICS project, described in Chapter
3.5, will be a major system in the near future. We shall support new
systems as they appear on the scene, but for now we should start out with
incorporating PICS codes in the EDDIC code where available. This means
ratings level codes for violence, nudity, sex and profane language on the
Web pages.
6.2.6 Graphics Use
To people using modems to access the Internet, an important factor when
choosing what page to look at is how much graphics, that is pictures and
animation, the page contains. If you are looking for textual information,
you will probably be more interested in pages with a lot of text than pages
using a lot of graphics, which may make retrieval of the page slow. On
the other hand, if you are looking for pictures of someone or something,
you may want to concentrate your search on pages containing much graphics.
How much graphics there is on the page may be measured by how many <IMG>
tags the HTML-formatted page contains. It may be more fair to large Web
documents to calculate a “graphics value” by counting how many <IMG>
tags there are per 1,000 words. No matter what approach is chosen, some
kind of numeric value for graphical contents level should be calculated
and included in the index.
6.2.7 Periodicity
How often a Web page is updated may be of interest in many situations.
Therefore the index should contain information about how often the page
is believed to change contents in one way or another. Values may be codes
indicating if it changes for example “All the time”, “Daily”, “Weekly”,
“Bi-weekly”, “Monthly”, “Annually” or “Never”.
6.2.8 Keywords
Since we do not want to store whole Web pages in the index, it is necessary
to offer a mechanism for allowing some kind of search using keywords and
search strings within the context the user is interested in. Each page
can have up to ten keywords for this purpose. The words should be as distinctive
for that page as possible. Typical examples of good keywords are names
of people and geographic locations that do not have their own special code
within the DDC system, registered trademarks, brand names and other words
that are given sufficiently high values in the word weighting procedure
described in Chapter 5.2.
Setting the maximum number of allowed keywords to 10 is merely a preliminary
suggestion. What the ideal number of keywords is should be settled by future
research. An alternative to include and use keywords in the index like
it is suggested here, is to first use the other document properties covered
by the EDDIC system to limit the number of possibly interesting URLs, and
then perform a free-text search on a “normal” search engine, limiting the
search to only return hits from the Web sites assumed to be of most interest
by EDDIC.
6.2.9 Future Extensions
To prevent our system from becoming obsolete within a few years, it must
be possible to add and remove Web document search properties. When new
formats and standards for including meta information are introduced and
turn out to be in significant use, this should be incorporated by our system.
If the system has enough capacity, additional Web document qualities, such
as Dublin Core elements, Java scripts and different plug-ins, should be
detected and registered by the EDDIC index.
An extension we already now may consider to include is expiration dates
for Web pages. For certain happenings and business offers, the Web page
may contain information on when the page will be taken off the Net, most
often because the information has no value anymore. The agents should be
able to perceive such meta information and make sure the index entry is
removed at the same date as the page itself is taken off the Web, instead
of having the system detect it automatically.
Other interesting extensions may be for describing any costs related
to accessing certain Web pages, what plug-ins or other technology is required
to fully experience a Web page, codes for pages that offer special security
for money transactions and whether the page contains advertisements or
not.
6.2.10 Index Maintenance Data
Unfortunately, even though we get a Web page classified and indexed, the
work is not over. Due to the dynamic nature of the Web, pages sometimes
move from one URL to another, or they may even be taken off the Web completely.
To prevent our index from containing links to non-existing pages, we need
to check on every single link every now and then. This can be done automatically,
but to do it systematically, we introduce the expiration check field to
the index entries.
Each page must be checked on at least once every month. If the expiration
value is set to 30 when the page is indexed, and the value is decreased
by 1 every day, an expiration check can be initiated when the value reaches
0 (zero). If the page seems to be ok, the counter is reset to 30. If the
page can not be found, a new existence check must be done in a few days
to make sure a page is not prematurely removed from the index. Because
of high load, server failures and temporary malfunctions in computers,
networks and cables, a page is sometimes off the Web for a short period.
This way of dividing the expiration check dates between all the pages will
soon divide the work evenly from day to day. The value of 30 is only a
preliminary suggestion that guarantees that “dead” links are removed from
the index after just over a month in the worst case.
6.3 Automatic and Analytic Meta Information
The document properties we have decided to include in the index entries
for Web pages can either be found automatically by the agents or they need
to be settled by manual classification personnel, supported by the agents’
preparatory work. In general, we can say that the properties will be found
like this:
Automatically: URL, Title, Language, Ratings, Graphics, future extensions
and expiration data.
Through agent-supported manual analysis: DDC code, Page Class, Periodicity,
Keywords
In some cases it may be impossible to find suitable information to put
into each of the possible fields of the index entry of a Web page. Instead
of having an “Undecided” value for each field, we will leave these fields
empty for that particular Web page. This means that each field must be
headed by a short code saying what kind of information the field contains.
This is also the most practical way of doing it when considering that we
may also add new fields to index entries in the future. The suggested field
codes and some additional information are shown in Table 6-1.
Field codes the “Required” column is ticked for must be a part of each
index entry. Other field codes are optional. An index entry can contain
several values for the field codes where the “Multivalue” column is ticked.
Field |
Field Code |
Short for |
Required
|
Optional
|
Multivalue
|
URL |
AD |
Address |
*
|
|
|
Title |
HD |
Header |
|
*
|
|
Dewey Decimal
Classification |
DC |
Dewey Classification |
*
|
|
*
|
Page Class |
PC |
Page Class |
|
*
|
*
|
Language, Z39-53 |
LC |
Language Code |
|
*
|
|
Ratings |
RL |
Ratings Levels |
|
*
|
*
|
Graphics Use |
GR |
Graphics |
*
|
|
|
Periodicity |
PE |
Periodicity |
|
*
|
|
Keywords |
Kn |
Keyword n |
*
|
|
*
|
Extensions |
Enn |
Extension nn |
|
*
|
*
|
Expiration check |
Xnn |
Page eXisitence due in nn days |
*
|
|
|
Table 6-1, Field codes for use in the EDDIC index
Based on the table, this is what an index entry may look like, using
only ad-hoc codes for now:
ADhttp://www.ntnu.no/indexe.html;;HDWelcome to NTNU;;DC378.05;;
PC41;;LCENG;;GR2;;PE0;;K1NTNU;;K2university;;K3Trondheim;; K4faculties;;X10;;
Explanation: The entry tells us
-
The URL, http://www.ntnu.no/indexe.html
-
The title, “Welcome to NTNU”
-
The Dewey code, 378.05, “Public colleges and universities”
-
Page class 41, Front page of educational institution
-
Language used, English
-
Graphics use, 2, meaning there is “some but not heavy” graphics
-
Periodicity, 0, meaning the page changes contents often but not regularly
-
Keywords, “NTNU”, “university”, “Trondheim” and “faculties” considered
important ones
-
Expiration status, in 10 days the next page existence check existence will
be initiated
Including the field delimiter “;;”, chosen because it is a string that
is unlikely to occur in the fields themselves, all this information is
contained in a 142 character entry. The codes used for page class, graphics
use and periodicity are not actual codes, but codes created for the example
only.
Because we do not index the actual contents about the Web page, but
only its “properties” including topic/context, we also avoid any copyright
infringements. This point has become more important as there have appeared
cases of lawsuits between search engine companies and Web page owners concerning
the rights to offer “second-hand” information.
6.4 EDDIC – The Code
To describe the EDDIC index entry in its general form, we use an EBNF-like
[Backus, 1959] (Extended Backus-Naur Form) syntax,
where “[ .. ]” means optional field(s), “{ .. }” means “one or more entry
fields”, < .. > means “all required fields” and “|” indicates alternatives:
IndexEntry
::= RequiredProperties [ OptionalProperties ]
RequiredProperties ::= < ReqProperty >
OptionalProperties ::= { OptProperty }
ReqProperty ::=
ReqPropertyCode Value Delimiter
OptProperty ::=
OptPropertyCode Value Delimiter
ReqPropertycode ::= AD | { DC } |
GR | { Kn } | Xnn
OptPropertyCode ::= HD | { PC } | LC |
{ RL } | PE | { Enn }
Delimiter
::= ;;
Value
::= LegalString
LegalString ::=
{ LegalCharacter }
n
::= 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
LegalCharacter ::= Any character
except ”;” when following a “;”
Since standard EBNF have no simple way of indicating what terms must
be a part of an expression, let us stress that the above means that the
index entry of any Web page shall contain at least the fields containing
the URL (AD), one or more Dewey codes (DC), a graphics value (GR), one
or more keywords (Kn) and an expiry code, Xnn. Optionally an index entry
may also include a title field (HD), one or more page class codes (PC),
a language code (LC), one or more ratings for the page (RL), the periodicity
of the updating of the page (PE) and one or more extensions (Enn).
The string “;;AD” will indicate the start of a new index entry in a
file containing several entries.
6.4.1 Meeting the Requirements
Our code meets the requirements from chapter 6.1, where we listed a number
of important factors when designing a code framework for indexing use:
-
By creating and storing a code describing the page instead of indexing
the page contents itself, we save a lot of space for every page indexed.
Although we can not offer free-text search in the Web pages we index, we
can offer search within a context to a much larger degree than traditional
search engines can, through intelligent use of the different document properties
we have recorded information about.
-
The code is scalable. If someone finds that this seems like a good way
to organize documents, they can use the same system to index their own
documents and files in general on their own computer. Each user will have
to decide which properties to include in the index. Instead of using URL
as unique identificator, the disk drive path and file name may be used.
This can later be mapped to a network address if several people using several
computers want to create a shared index.
-
To optimize the organization of the index database it is necessary to know
much about the data that are to be organized. The best access times can
probably be achieved by having several indexes, each based on a different
sorting key. The properties identified in this chapter should cover and
allow the most important kinds of searches people may want to perform.
-
Future additions are easily implemented as a consequence of the extension
and field code system. When it is decided to add a new index field, the
existing index entries do not have to be changed in any way. While updating
old records can be done by agents in time, most important is it that the
agents and the indexing system can easily be made to incorporate desired
changes when indexing all new entries.
-
Changes to the codes can easily be introduced, since it will be only a
matter of mapping codes from one number to another. This requires a minimum
of manual work, and the process of actually implementing the change is
a fast one, since it consists of having computers do what they do best,
namely manipulate numbers.
-
The readability of the code is optimal. The agents work with numbers that
are mapped by the system to words and expressions that are meaningful to
the human users of the system.
The parts of the EDDIC code that are directly related to describing contextual
information about a Web page also answer to the factors mentioned in Chapter
3.2:
-
Unambiguity and full coverage: Any context can be described very accurately
using the extended Dewey codes. If necessary, new and more precise codes
may be added by classification experts. A document may be given several
codes if they all describe the document's contents well.
-
Simplicity and agent-friendly: The code is available in two formats, one
optimized for the use of agents, the other optimized for humans to read.
-
Compactness: As we have seen, the code is very compact, especially when
compared to storing whole Web pages.
-
Navigable and Visuality: The codes are all arranged in a hierarchy with
a limited number of choices when going from one level of details to another.
Such hierarchies are well suited for creating visual navigation environments.
In the beginning we will want to visualize the hierarchy as a traditional
tree structure, but eventually we may move on to for example showing the
index as a library, where each index entry can be found as a book of a
certain color and a certain size in the right shelf in the right part of
the library, depending on what codes it has been given.
Hence, we can conclude the chapter feeling satisfied about the outlines
for the code, and go on to look at how our index can be used to offer new
ways to search the Internet.
Go to: Front
page - Index - Ch.
1 - Ch. 2 - Ch.
3 - Ch. 4 - Ch.
5 - Ch. 6 - Ch.
7 - Ch. 8 - Ch.
9 - Glossary - References
Visit the author's homepage : http://www.pvv.org/~bct/
E-mail the author, Bjørn Christian Tørrissen:
bct@pvv.org