The University of Queensland Library
      An Introduction to Metadata
 
 

  Home » Papers & Presentations » An Introduction to Metadata

Paper written by Chris Taylor
Manager, Information Access Service
University of Queensland Library
c.taylor@library.uq.edu.au
Revised: 29 July 2003


1. What is Metadata?

Metadata is structured data which describes the characteristics of a resource. It shares many similar characteristics to the cataloguing that takes place in libraries, museums and archives. The term "meta" derives from the Greek word denoting a nature of a higher order or more fundamental kind. A metadata record consists of a number of pre-defined elements representing specific attributes of a resource, and each element can have one or more values. Below is an example of a simple metadata record:

Element name

Value

Title

Web catalogue

Creator

Dagnija McAuliffe

Publisher

University of Queensland Library

Identifier

http://www.library.uq.edu.au/iad/mainmenu.html

Format

Text/html

Relation

Library Web site

Each metadata schema will usually have the following characteristics:

Typically, the semantics is descriptive of the contents, location, physical attributes, type (e.g. text or image, map or model) and form (e.g. print copy, electronic file). Key metadata elements supporting access to published documents include the originator of a work, its title, when and where it was published and the subject areas it covers. Where the information is issued in analog form, such as print material, additional metadata is provided to assist in the location of the information, e.g. call numbers used in libraries. The resource community may also define some logical grouping of the elements or leave it to the encoding scheme. For example, Dublin Core may provide the core to which extensions may be added.

Some of the most popular metadata schemas include:

While the syntax is not strictly part of the metadata schema, the data will be unusable, unless the encoding scheme understands the semantics of the metadata schema. The encoding allows the metadata to be processed by a computer program. Important schemes include:

Metadata may be deployed in a number of ways:

The simplest method is for Web page creators to add the metadata as part of creating the page. Creating metadata directly in a database and linking it to the resource, is growing in popularity as an independent activity to the creation of the resources themselves. Increasingly, it is being created by an agent or third party, particularly to develop subject-based gateways.

2. What is a search engine?

In a nutshell, search engines, such as Google and HotBot, consist of a software package that crawls the Web, extracts and organises the data in a database. People can then submit a search query using a Web browser. The search engine locates the appropriate data in the database and displays it via the browser. This is not to be confused with directories such as Yahoo, that provide subject lists created by humans, that must be browsed. Of course, the Web being the Web, things chnage very rapidly. For example, in October 2002, Yahoo made a giant shift to using Google's crawler-based listings for its main results. Nonetheless, search engines have three major elements:

Search engine software is also available to run on a local Web site. The software has the same basic components, but the spider just visits the local site or a limited number of sites in a community.

3. Why isn't an Internet search engine good enough?

The problem relates to the underlying nature of the World Wide Web. In the early 1990s, "surfing" the World Wide Web was popularised in the mass media. These days, the concept of browsing the Web is little used. The Web has become a two-edged sword. It is now very easy to publish information, but it is becoming more difficult to find relevant information [EC, p.4]. For outsiders and casual users, much of the useful material is difficult to locate and therefore is effectively unavailable [DC1, p.2].

At the global level, Internet search engines were developed to search across multiple Web sites. Unfortunately, these search engines have not been the panacea that some people had hoped for. Every search engine will give you good results some of the time and bad results some of the time. This is what information scientists term "high recall" and "low precision". The high recall refers to the well known (and frustrating) experience of using an Internet search engine and receiving thousands of "hits". It is popularly known as information overload. The low precision refers to not being able to locate the most useful documents. The search engine companies do not view the high hit rates as a problem. Indeed, they market their products on the basis of their coverage of the Web, not in the precision of the search results.

The Working Group on Government Information Navigation outlined the problems with Internet search engines:

The introduction of the <META> element as part of HTML coding, was in part, an attempt to encourage search engines to extract and index more structured data, such as description and keywords. However, search engines are rather proprietorial in recognising <META> tags. It ranges from no support at all, to reasonable. Details are available from Search Engine Watch [SEW]. As far as I am aware, none currently supports metadata schemas. It is the proverbial "chicken and the egg" situation. Web page authors and publishers do not invest in providing metadata if the indexing services do not utilise it and harvesters do not collect metadata if there is not enough data available. The other problem is the malicious "spoofing" of search engines, making them return pages that are irrelevant to the search at hand or pages that rank higher than their content warrants.

Support for <META> tags by search engines designed for local Web servers varies from non-existent to good. Some of the specialist packages include support for Dublin Core or other metadata schemas.

4. Why use metadata?

The foregoing section has discussed the inadequacy of search engines in locating quality information resources. How does metadata solve the problem? A more formal definition of metadata offers a clue:

Metadata is data associated with objects which relieves their potential users of having full advance knowledge of their existence or characteristics. [DESIRE, p.2]

Information resources must be made visible in a way that allows people to tell whether the resources are likely to be useful to them. This is no less important in the online world, and in particular, the World Wide Web. Metadata is a systematic method for describing resources and thereby improving access to them. If a resource is worth making available, then it is worth describing it with metadata, so as to maximise the ability to locate it.

Metadata provides the essential link between the information creator and the information user.

While the primary aim of metadata is to improve resource discovery, metadata sets are also being developed for other reasons, including:

While this document concentrates on resource discovery and retrieval, these additional purposes for metadata should also be kept in mind.

5. Which Metadata schema?

There are literally hundreds of metadata schemas to choose from and the number is growing rapidly, as different communities seek to meet the specific needs of their members.

Recognising the need to answer the question of how can a simple metadata record be defined that sufficiently describes a wide range of electronic documents, the Online Computer Library Center (OCLC) of which the University of Queensland Library is currently the only full member in Australia, combined with the National Center for Supercomputing Applications (NCSA) to sponsor the first Metadata Workshop in March, 1995 in Dublin, Ohio [DC1]. The primary outcome of the workshop was a set of 13 elements (subsequently increased to 15) named the Dublin Metadata Core Element Set (known as Dublin Core). Dublin Core was proposed as the minimum number of metadata elements required to facilitate the discovery of document-like objects in a networked environment such as the Internet.

Below is a summary of the elements in Dublin Core. The metadata elements fall into three groups which roughly indicate the class or scope of information stored in them: (1) elements related mainly to the content of the resource, (2) elements related mainly to the resource when viewed as intellectual property, and (3) elements related mainly to the physical manifestation of the resource.

Content & about the Resource

Intellectual Property

Electronic or Physical manifestation

Title

Author or Creator

Date

Subject

Publisher

Type

Description

Contributor

Format

Source

Rights

Identifier

Language

   

Relation

   

Coverage

   

A description of each element is given in Appendix 1. Below is an example of a Dublin Core record for a short poem, encoded as part of a Web page using the <META> tag:

<HTML> !4.0!
<HEAD>
<TITLE>Song of the Open Road</TITLE>
<META NAME="DC.Title" CONTENT="Song of the Open Road">
<META NAME="DC.Creator" CONTENT="Nash, Ogden">
<META NAME="DC.Type" CONTENT="text">
<META NAME="DC.Date" CONTENT="1939">
<META NAME="DC.Format" CONTENT="text/html">
<META NAME="DC.Identifier" CONTENT="http://www.poetry.com/nash/open.html">
</HEAD>
<BODY><PRE>
I think that I shall never see
A billboard lovely as a tree.
Indeed, unless the billboards fall
I'll never see a tree at all.
</PRE></BODY>
</HTML>

The <META> tag is not normally displayed by Web browsers, but can be viewed by selecting "Page Source".

In addition to the 15 elements, three qualifying aspects have been accepted to enable the Dublin Core to function in an international context and also meet higher level scientific and subject-specific resource discovery needs. These three Dublin Core Qualifiers are:

6. Why Dublin Core?

The Dublin Core metadata schema offers the following advantages:

Dublin Core has received widespread acceptance amongst the resource discovery community and has become the defacto Internet metadata standard [AGLS, p.3].

To date, the depth of implementation in individual sectors has been patchy. In Australia, much activity has taken place in the government sector, under the auspices of the Government Technology and Telecommunications Committee (GTTC). Dublin Core has been formally accepted as the standard for the Australian Government Locator Service [AGLS].

7. Which elements, sub-elements and schemes should I use?

There is no simple answer to this question. At a fundamental level, it becomes a compromise, based on:

The bottom-line is that a simple description is better than no description at all, as long as it can aid in the consistent discovery of resources.

The level of specificity in resource description is also important. The resources can be described individually or at a collection or aggregate level. It would be practically impossible to provide guidelines as to the appropriate level of specificity. Cataloguing librarians have been arguing the toss for years without reaching a consensus. As always, we should think in terms of customer needs. As noted above, with the major search engines, it is possible to have too many records, such that our customers can't see the forest for the trees. Initially, it would be sensible to allow the creators to determine which resources deserve their own record. If a collection-level record is used, it is important to add as much information as possible to ensure appropriate retrieval.

Acting on customer feedback is also important. Monitoring the search terms input by customers, is a well proven technique for improving the quality and coverage of a database. The downside is that the assessment process is essentially a manual one.

8. What about using controlled terminology?

Consistent use of language with metadata descriptions can aid in the consistent discovery of resources. The primary tool for ensuring consistent language usage is via controlled vocabulary, including the use of thesauri. A number of metadata elements would benefit from controlled values.

There are many subject thesauri available. However, most are designed for specialist resource communities. For example, the Edinburgh Engineering Virtual Library (EEVL) originally selected the Engineering Information thesaurus, but decided that it was too complex for the purpose. Instead they developed a modified version to suit their specific needs.

Ultimately, as the AGLS Metadata Element Set notes, "… a common sense, author-based approach is still effective and yields a high return to agencies." [AGLS1].

In the absence of a suitable subject thesaurus, some may be tempted to create one from scratch. This temptation is to be resisted at all cost. History is studded with failed attempts at developing new thesauri. Its like establishing a small business. People don't seem to understand that starting is easy, finding the resources to keep the thesaurus current is the real trick. Keeping a thesaurus up to date is a huge investment in resources that is very difficult to justify.

While strictly not a metadata issue, the mismatch between input and index terms has proven to be a major problem in retrieval from databases, particularly as a result of semantic problems, such as different spellings, singular and plural, etc. Although the basic query interfaces for search engines seem similar, there are important differences that affect the outcome of the search. For example, the query 'Mabo Legislation' could be interpreted by different engines as requesting resources that contain:

Obviously, these three different interpretations will produce different sets of results. Search engines differ in whether queries are case sensitive and how they handle singular versus plural forms of a word. Alternative spellings, for example, labour and labor, may have to be searched separately. The same applies to abbreviations, such as dept and department. This disconcerts the naive user and annoys the experienced user. One solution is to use a common query interface, or an intermediate query engine which takes a standard query and translates it into the specific forms required by the site search engine.

9. Where will the metadata be stored?

Metadata may be deployed in a number of ways:

The simplest method is to ask Web page creators to add the metadata as part of creating the page. To support rapid retrieval, the metadata should be harvested on a regular basis by the site robot. This is currently by far the most popular method for deploying Dublin Core. An increasing range of software is being made available to assist in the addition of metadata to Web pages.

Creating metadata directly in a database and linking it to the resource, is growing in popularity as an independent activity to the creation of the resources themselves. Increasingly, it is being created by an agent or third party, particularly to develop subject-based gateways. The University of Queensland Library is involved in a number of gateway projects, including AVEL and Weblaw.

10. Syntax Issues

For metadata attached to Web pages, the standard encoding scheme is HTML (HyperText Markup Language). RDF (Resource Description Framework) supports multiple metadata schemes. It uses XML (EXtensible Markup Language) to express the structure. The advantages in using RDF/XML are many:

Its major drawback is that user-friendly tools to generate XML are still scarce. For metadata contained within a database, the encoding scheme is a lesser issue. What is important is its interroperability with other database schemas, to support cross-database searching and the sharing of metadata records.

In the context of Web indexing, there are currently two Webs in existence. The first is the "visible" Web, made up of static Web pages that can be harvested and indexed. The second is the "invisible" Web, made up of dynamic pages generated from a database. These pages can’t be directly harvested by a robot and indexed. The records have to exported from the database, not always a trivial matter. Even if they could be harvested, the amount of data in a single, centralised database would be unmanageable.

One option is to interrogate multiple databases at the same time. There are proprietorial systems that can do this, usually at great expense. Individual systems can also talk to one another if they conform to the US National Information Standards Organization (NISO) Z39.50 protocol [NISO]. The Z39.50 protocol for distributed information retrieval, supports the searching of disparate databases, either singularly or in combination, regardless of proprietorial interfaces. Z39.50 supports a number of "profiles" in order to enable translation between various databases. Unfortunately, few databases and local search engines support Z39.50.

A more recent development in federated searching is the increasing availability of portal-type software [LC] that supports a single search across multiple databases. The actual techniques remain highly parochical, but in essence it relies on using client software to simultaneous interrogate the indexes of a number of databases, with the results being normalised for display by employing a locally defined metadata schema (usually DC). More sophisticated versions use some type of record de-deuplication techniques.

Such software is achieving relatively quick penetration in the Library marketplace. This is partially due to fact that the software has been largely developed by library system vendors seeking to broaden their marketplace. It has also come about as a result of librarians de-crying the "search interface wars", i.e. there are just too many database search interfaces for librarians and their clients to learn. Such solutions do not come cheap, however.

The other recent development is the Open Achives Initiative [OAI], which seeks to harvest standards-based metadata (DC is the minimum standard) to build metadata repositories.

11. How does one create metadata?

The more easily the metadata can be created and collected at point of creation of a resource or at point of publication, the more efficient the process and the more likely it is to take place. There are many such tools available and the number continues to grow. Such tools can be standalone or part of a package of software, usually with a backend database or repository to store and retrieve the metadata records, Some examples include:

Ideally, metadata should be created using a purpose-built tool, with the manual creation of data kept to an absolute minimum. The tool should support:


References

[AGLS] Australian Government Locator Service Implementation Plan: A Report by the Australian Government Locator Service Working Party (AGLS WG) December, 1997.

[AGLS1] AGLS Metadata Element Set. National Archives of Australia. http://www.naa.gov.au/recordkeeping/gov_online/agls/metadata_element_set.html

[DC1] The Essential Elements of Networked Object Description. Stuart Weibel. OCLC/NCSA Metadata Workshop, March, 1995. http://www.oclc.org:5046/oclc/research/metadata/dublin_core_report.html

[DESIRE] Specification for resource description methods Part 1: A review of metadata: a survey of current resource description formats. Lorcan Dempsey and Rachel Heery, March,1997. http://www.ukoln.ac.uk/metadata/desire/overview/

[EC] Metadata Workshop. European Commission, Telematics for Libraries, December, 1997. http://hosted.ukoln.ac.uk/ec/metadata-1997/

[LC] The Library of Congress Portals Applications Issues Group.http://www.loc.gov/catdir/lcpaig/

[NISO] Information Retrieval (Z39.50) - Application Service: Definition and Protocol Specification (Version 3). The National Information Standards Organization, 1995. http://www.niso.org/

[OAI] Open Archives Initiative. http://www.openarchives.org/

[SEW] Search Engine Watch. Search Engine Feature Company. http://searchenginewatch.com/webmasters/features.html

[WGGIN] Improving Access to Information and Services of Australian Governments. Working Group on Government Information Navigation, July, 1997. http://www.nla.gov.au/lis/esd4.html


Appendix 1: Dublin Core Metadata schema

Element

Element description

Creator

Person or organisation primarily responsible for creating the intellectual content of the resource, e.g. authors in the case of written documents, artists, photographers, etc. in the case of visual resources.

Publisher

The entity (e.g. agency including unit/branch/section) responsible for making the resource available in its present form, such as a publishing house, a university department, or a corporate entity.

Contributor

Person or organisation not specified in a Creator element who has made significant intellectual contributions to the resource but whose contribution is secondary to any person or organisation specified in a Creator element, e.g. editor, transcriber, illustrator.

Rights Management

A rights management statement, an identifier that links to a rights management statement.

Title

The name given to the resource, usually by the creator or publisher. Can be the same as the title of the resource, or may be more descriptive

Subject

The topic of the resource. Typically, will be expressed as keywords or phrases that describe the subject or content of the resource. Controlled vocabularies and formal classification schemes are encouraged.

Date

A date associated with the creation or availability of the resource.

Identifier

A string or number used to uniquely identify the resource. Examples for networked resources include URLs, Purls and URNs. ISBN or other formal names can be used.

Description

A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources.

Source

The work, either print or electronic, from which this object is derived, if applicable. Source is not applicable if the present resource is in its original form.

Language

The language of the intellectual content of the resource.

Relation

Relationship to other resources, e.g. images in a document, chapters in a book, items in a collection.

Coverage

Spatial locations and temporal duration characteristic of the resource.

Type

The category of the resource, such as home page, novel, poem, working paper, technical report, essay, dictionary.

Format

The data format of the resource, used to identify the software and possibly hardware that might be needed to display or operate the resource, e.g. postscript, HTML, text, jpeg, XML.

my.SI-net  |   eLearning/Blackboard  |   Feedback & suggestions
©2007 The University of Queensland, Brisbane Australia
ABN 63 942 912 684
CRICOS Provider Number: 00025B
Authorised by: University Librarian
Maintained by: UQ Library
  Last Updated: 30 August 2007.