Specification:2007-07-12

From UMBELwiki

Jump to: navigation, search
Image:Important.pngThis is an archived specification and has been significantly updated and modified; it is no longer relevant to the current UMBEL. Please do NOT modify.

Contents

UMBEL - Upper Mapping and Binding Exchange Layer

UMBEL Project Proposal 12 July 2007

This version
http://www.umbel.org/proposal-20070712.xhtml
Latest official version
http://www.umbel.org/proposal.xhtml
Editors
Michael K. Bergman, Web scientist and consultant
Authors
Frédérick Giasson, Zitgist

Creative Commons License

Copyright © 2007. UMBEL.org. Some Rights Reserved.


Abstract

Critical pieces of infrastructure are missing for where to announce, find and query semantic and structured data sets on the Web. This infrastructure needs to be simple, targeted by domain or subject, applicable to the real-world diversity of structured data formalisms, and broadly accepted.

UMBEL (Upper Mapping and Binding Exchange Layer) is a very lightweight and high-level subject layer for mapping various ontologies — including less 'formal' structures — with simple binding mechanisms for any structured formalism. Unlike prior upper-level ontologies, UMBEL provides an organizational 'backbone' or 'umbrella' reference layer based on concrete and high-level subjects (that is, classes of subjects), and not abstract concepts nor complex relationships.

Simply stated, UMBEL is both a high-level reference "bag of subjects" and lightweight mechanisms for binding to Web ontologies via proxies for those subjects.

The UMBEL ontology has two parts. The first 'core' part is a flat listing, or pool, of concrete subject topics that are the proxy binding points for external data sets. The second 'unofficial' part is a reference look-up structure of hierarchical and interlinked subject relationships. We anticipate a number of other look-up structures from third parties may also arise based on the 'core' set of UMBEL subjects.

The high-level 'core' subject pool is itself extracted and derived from existing, accepted knowledge bases, including Wikipedia, WordNet, library systems and other widely accepted subject references. UMBEL's subject topic extraction and selection mechanisms in building its 'core' ontology are automated to allow periodic updates as the subject coverage of these underlying source knowledge bases grow and evolve.

Simple binding protocols through subject proxies and supporting tools are possible for any structured data formalism; the project is initially targeting the most widely employed approaches. The UMBEL binding API and protocols are meant to be as universal as possible, but will, at minimum, initially support Atom, microformats, OPML, OWL, RDF, RDFa, RDF Schema, RSS, tags (via Tag Commons), and topic maps.

The 'core' UMBEL ontology will have low inferential power, and its scope is solely restricted to subjects. Its subject level of representation targets no specific functional use, so all uses remain possible.

The power of the UMBEL ontology thus resides in its organizational and 'info-spatial' reference nature, with some ancillary value in high-level disambiguation. UMBEL helps define where to find the dots and which groups of dots may be appropriate candidates for access or queries. How those dots may be connected is outside the project's scope.

The UMBEL ontology is based on RDF and is written in the RDF Schema vocabulary of SKOS (Simple Knowledge Organization System) with best-practices support for the related RDFS ontologies of FOAF, DOAP, SIOC, and GeoNames. Support for other RDFS vocabularies will likely be added.

Besides development of the UMBEL ontology, the project will also be providing an UMBEL ontology registration service, information and collaboration Web site, and support for language translations and tools development.

The UMBEL ontology is open source, and licensed under the Creative Commons Attribution license version 3.0.

Status of this Document

This is the third review draft for the proposed UMBEL project, and the first one publicly released.

Release Schedule

The public announcement and release of the draft occurred on July 12, 2007. The release of the first version of the UMBEL ontology will occur Fall 2007.


1 Problem Statement

Structured data is now understood to represent a change in emphasis on the Web from a focus on documents to a focus on data objects. RDF data and data sets on the Web are now becoming available at rapidly increasing rates. Some of this data is natively published in RDF; other data can be easily converted. Tools such as GRDDL and "RDFizers" are further enabling the publishing of RDF data from non-RDF sources. RDF data availability in all domains now appears assured.

Structural formalism and semantic expressiveness of these data vary widely and — given the diversity of publication mechanisms, actors and motivations in the native Web — will likely remain so for the foreseeable future.

The rapid emergence of this diverse and potentially linked open data is now surfacing a critical gap in existing Web infrastructure: namely, how can users effectively find and query appropriate portions of this data?

This infrastructure gap has these components:

  • Lack of a central look-up point for where to find RDF data in reference to desired subject matter
  • Lack of a contextual frame of reference for which data sets might be appropriate within any given domain
  • Lack of an open means where any content structure — from formal ontologies to "RDFized" documents to complete RDF data sets — can 'bind' or 'map' to other data sets relevant to its subject domain, and
  • Lack of a registration or publication mechanism for data sets that do become properly placed, for use by SPARQL or similar query endpoints.

Prior attempts to deal with this problem have tended to be too complex and complicated. Uptake has been slow to non-existent.

Fortunately tools and approaches presently exist that can redress these gaps in a lightweight manner in fairly short order. An open community process, reflecting the current diversity of structural formalisms and domain interests, can also help achieve broad Web acceptance for these efforts.

2 Project Overview

UMBEL is a very lightweight way to describe the subject(s) of Web content, akin to the relationship "isAbout". UMBEL (Upper Mapping and Binding Exchange Layer) is meant to work universally with HTML, tagging or other standard practices, including various RDF schema and more formal ontologies. Its high-level reference subject 'backbone' is derived from the intersection of common subjects found on popularly used Web sites and other accepted subject references.

UMBEL is simple. The guiding principal is to provide a lightweight subject reference pool and widely applicable binding mechanisms for organizing the understanding of where structure may appear anywhere on the Web. Access and easy adoption is given preference over inferential or logical elegance.''

The UMBEL ontology is an organized schema of classed pointers to appropriate ontologies for appropriate data spaces. The 'core' organizational schema is based around a flat 'bag of subjects' listing of high-level subjects extracted from widely accepted Web knowledge bases, with actual subject coverage chosen on the basis of concreteness, common usage, coverage and balance. External ontologies 'bind' to the UMBEL structure via subject 'proxies' that are merely binding reference representations of actual subjects through a predicate akin to the concept of isAbout.

This is not a new idea. About the year 2000 the topic map community was active with published subject indicators (PSIs) and other attempts at topic or subject landmarks. From that earlier community, Bernard Vatant has subsequently spoken of the need and use of "hubjects" as organizing and binding points, as has Jack Park and Patrick Durusau using the related concept of "subject maps". An effort that has some overlap with a subject structure is also the Metadata Registry being maintained by the National Science Digital Library (NSDL). The time is now ripe for these earlier efforts to find expression and bear fruit.

The UMBEL ontology has two parts. The first 'core' part is the flat listing, or pool, of concrete subject topics that are the proxy binding points for external data sets. The second optional part is a reference look-up structure of hierarchical and interlinked subject relationships. In addition to the UMBEL reference look-up structure, we anticipate other look-up structures from third parties may also arise based on the 'core' set of UMBEL subjects.

The high-level 'core' UMBEL subject ontology is itself derived from existing, accepted knowledge bases, including Wikipedia and WordNet, with others likely. The actual subjects and their structure emerge from these normative data sets. The selection of subject binding proxies using automated methodologies from broadly accepted knowledge bases is meant to remove contention and arbitrariness. It enables the UMBEL project to define and explicate (hopefully) straightforward justifications relating to candidate data sets and methodologies. UMBEL's actual subject pool is therefore what is already accepted on the Web.

UMBEL's extraction mechanisms in building its ontology are mostly automated to allow periodic updates as the subjects of these underlying source knowledge bases evolve.

Simple binding protocols through subject proxies and supporting tools are possible for any structured data formalism; the project is initially targeting the most widely employed approaches. Binding protocols are being developed for microformats, data feeds, topic maps, hierarchical facets, tagging and folksonomies, and the various W3C Semantic Web technologies.

The reference UMBEL ontology will have low inferential power, and its scope is solely restricted to subjects. Its subject level of representation targets no specific functional use, so all uses remain possible.

The power of the UMBEL ontology thus resides in its organizational and 'info-spatial' reference nature, with some ancillary value in high-level disambiguation. UMBEL helps define where to find the dots and which groups of dots may be appropriate candidates for access or queries. How those dots may be connected is outside the project's scope.

The UMBEL ontology is based on RDF and is written in the RDF Schema vocabulary of SKOS (Simple Knowledge Organization System) with support for the related RDFS ontologies of FOAF, DOAP, Dublin Core, SIOC, and GeoNames. RDF and these reference schema provide the natural data model and characterization 'middle ground' across all Web ontology formalisms. Support for other RDFS vocabularies will likely be added.

In addition to the UMBEL ontology, the UMBEL project will also provide look-up, query, registration, pinging, and related services. The project is completely open and supported by an open community process. All project products are made available without charge under Creative Commons licenses. UMBEL's development is being backed by a number of leading open data efforts and entities.

2.1 Objectives

The overarching objective of the UMBEL project is to find a universal subject binding structure for all forms of data sets and constructs presently found on the Web.

Specific objectives for the UMBEL project are to develop:

  1. A reference umbrella subject binding ontology, with its own pool of high-level binding subjects
  2. Lightweight mechanisms for binding subject-specific ontologies to this structure
  3. A standard listing of subjects that can be refererenced by resources described by other ontologies (e.g., dc:subject)
  4. Provision of a registration and look-up service for finding appropriate subject ontologies
  5. Identification of existing data sets for high-level subject extraction
  6. Codification of high-level subject structure extraction techniques
  7. Identification and collation of tools to work with this subject structure, and
  8. A public Web site for related information, collaboration and project coordination.

2.2 General Conceptual Model

UMBEL is the top layer of a three-tiered general conceptual model of:

  1. Conceptualization (subjects and subject structure)
  2. Representation (ontologies, loosely defined), and
  3. Instantiation (data sets and data spaces).

The aim of this general model is to promote searching and finding (subject structure), understanding (ontology), and then getting information (data spaces).

The UMBEL tier is the subject structure and binding mechanisms for the conceptualizations embracing possible subject content. The second layer is the representation layer, made up of informal to 'formal' ontologies. The third layer are the data sets that provide the actual content instantiations of these subjects and their ontology representations.

Here is a diagram of this general conceptual model:

Three-tiered Conceptual Model

The layers in this general model progress from the organizational and referential at the top layer, useful for directing where traffic needs to go, to concrete information and data at the lower level, the real object of manipulation and analysis.

The data spaces and ontologies of various formalisms in the lower two tiers exist in part today. The upper mapping layer does not. That is the role of UMBEL.

By its nature the top mapping layer, in its role as a universal reference 'backbone,' must be somewhat general in the scope of its reference subjects. As UMBEL begins to be fleshed out, it is also therefore likely that additional intermediate mapping layers will emerge for specific domains, which will have more specific scopes and terminology for more accurate and complete understanding with their contributing specific data spaces. UMBEL is a natural tie-in point for these intermediate subject layers, but is not itself meant to include them in its own subject scope.

2.3 Use of RDF and SKOS

The UMBEL ontology will be written in Resource Description Framework using the RDF Schema vocabulary SKOS.

Across all ontology 'formalisms' presently in use on the Web, RDF is both a middle ground of expressiveness and the emerging consensus choice as the canonical data model for conveying semantics. RDF is self-documenting in ways which enable the creation and combination of vocabularies in a devolved manner. This is particularly important for an ontology such as UMBEL that requires significant buy-in across diverse Web communities.

RDF is usually written using the XML or N3 syntaxes. RDF can be processed by one of the many toolkits available, such as Jena (Java) or Redland (C). More information about RDF can be found in the RDF Primer.

The knowledge representation extension to RDF, RDF Schema, has also been adopted by the leading reference ontologies of FOAF (people), SIOC (communities), DOAP (projects) and GeoNames (places). Though similar representations exist in other expressions, there are also many converters to translate these other forms into the RDF middle ground.

SKOS, or the Simple Knowledge Organization System, is a formal language and schema designed to represent structured information domains as thesauri, classification schemes, taxonomies, subject-heading systems, controlled vocabularies, or others; in short, most all of the 'loosely defined' ontology approaches that exist on the Web. As such, SKOS is the perfect RDFS vocabulary upon which to base UMBEL (c.f. Appendix 3).

2.4 Comparison with Other Efforts

Appendix 2 lists many other efforts with relevant subject structure that could contribute to UMBEL. It is likely the project will pick multiples of these to also contribute to the lightweight UMBEL structure.

Though not the focus of UMBEL, methods for comprehensive mapping of formal ontologies will continue to be an important challenge. Detailed conceptual and logical connections and inferences will only grow in importance as greater numbers of formal ontologies become published on the Web.

Most research to date has emphasized these very real and challenging issues. UMBEL, in contrast, has chosen a very narrow scope: providing high-level subject linkages between data structures at all levels of formalism. This emphasis is less intellectually and conceptually challenging than the issues of general ontology merging and mapping that have preoccupied researchers to date.

2.5 Name and Logo

Draft UMBEL logo

The proposed name for this project is the Upper Mapping and Binding Exchange Layer, or UMBEL. Umbel-ed flowers, like Queen Anne's Lace or licorice, come from the Apiaceae familiy (what used to be called the Umbelliferae, for "umbels of umbels"). Umbel has the same Latin root as umbrella (umbra for shade, or umbella for parasol). The name is meant to convey the umbrella-like nature of UMBEL's subject bindings.

In text references, the name is proposed to be shown in all caps, similar to related FOAF, SIOC, SKOS, DOAP, etc., ontologies. UMBEL is pronounced like "humble," except without the "h". A starting point for a possible logo graphic is also shown. (BTW, the font is Century Gothic.)

Note: There is some disagreement about the use of 'upper' in the UMBEL acronym, though there is consensus support for UMBEL itself. A later poll will be taken to resolve whether the 'U' should stand for 'upper,' 'umbel' (recursive), 'umbrella' or other alternatives.

2.6 Web Address

The domain of umbel.org has been secured.

2.7 Participants and Invited Participants

These are the draft recipients and/or confirmed participants to date:

[names withheld for privacy]

Public announcement of this project has been made to the following groups and mailing lists:

Linked Open Data <linking-open-data@simile.mit.edu>
Microformats <microformats-discuss@microformats.org>
Ontolog Forum <ontolog-forum@ontolog.cim3.net>
RDFa <public-rdf-in-xhtml-tf@w3.org>
SIOC-Dev <sioc-dev@googlegroups.com>
SKOS <public-esw-thes@w3.org> and <public-swd-wg@w3.org>
TagCommons <wg@tagcommons.org>

2.8 Anticipated Level of Effort

The major effort level for this project is anticipated in the first six to twelve months when creating the first UMBEL ontology and establishing the Web presence is underway. Because the efforts for extracting high-level structure are expected to be mostly automatic, once developed subsequent updates are expected to pose minimal further effort.

After the first months of the project, efforts are expected to shift to multiple language versions and maintenance of the ontology registration service. These efforts longer-term are hard to predict, but likely will involve the need for different individuals and skills from the early months of UMBEL ontology establishment.

The public announcement and release of the draft occurred on July 12, 2007. The release of the first version of the UMBEL ontology will occur Fall 2007.

3 Major Project Initiatives

The anticipated dozen or so initiatives of the UMBEL project are briefly described below.

3.1 UMBEL Ontology

The UMBEL ontology has two parts. The first 'core' part is a flat listing or pool of concrete subject topics that are the proxy binding points for external data sets. The second 'unofficial' part is a reference look-up structure of hierarchical and interlinked subject relationships. There may arise multiple unofficial reference structures relating to the core UMBEL 'bag of subjects.'

In addition to their use as a binding layer, this standard listing of subjects can also be refererenced by resources described by other ontologies (e.g., dc:subject or foaf:interest).

Though a guiding focus for the project, this task is largely a synthetic one integrating the efforts of the 3.2 Subject Structure, syntax for the 3.3 Binding Mechanisms, and possible syntax extensions via the 3.9 Standards Activities.

Appendix 1 presents a general overview of the subject structure of this ontology as understood as of this version. This structure will be refined during the early efforts of this project.

3.1.1 'Core' Ontology

UMBEL's high-level 'core' subject pool is itself extracted and derived from existing, accepted knowledge bases, including Wikipedia, WordNet, library classification systems and other subject references. UMBEL's subject topic extraction and selection mechanisms in building its 'core' ontology are automated to allow periodic updates as the subject coverage of its underlying source knowledge bases evolve.

There are three main issues in constructing this 'bag of subjects' core portion of the UMBEL ontology:

  • Selecting the candidate subject-rich data sets
  • Finding the 'consensus' intersection of subjects amongst these sets
  • 'Pruning' the subject pool to a meaningful and tractable pool of subject 'proxies.'

The approaches and methodologies to be taken in addressing these issues are described in Appendix 1.

3.1.2 'Unofficial' Reference Look-up Structure

While making for a cleaner, less controversial binding layer, the flat 'core' UMBEL ontology is also harder to use for look-up or discovery. Thus, the second, 'unofficial' complement to the 'core' is a reference UMBEL structure that does utilize the SKOS properties of broader, narrower, related or hasTopConcept. Over time it is quite likely there could be multiple look-up structures created by different organizations or individuals to serve specific purposes. In any case, such look-up structures are totally independent of the 'core' subject proxy bindings.

3.2 Subject Structure

Unlike traditional upper-level ontologies, the UMBEL high-level subject 'backbone' is not meant to be comprised of abstract concepts or a logical completeness of the "nature of knowledge". Rather, it is meant to be only the thinnest veneer of a 'bag of subjects' derived from external authoritative sources. The subject topics themselves will strive for topic consensus, domain completeness, and a sense of balance.

3.2.1 Subject Topics

According to Park and Durusau as it has emerged from the topic map community, a "subject" is something that can be discussed in a conversation. Subjects are represented by collections of properties, which are collected under the named representation called a subject "proxy".

For UMBEL's purposes and to be tractable, "subjects" are a subset of all possible concepts or topics. Subjects are not individual objects or items or instances. They represent aggregations or classes of similar items. Such aggregate classes "roll up" into still more inclusive ('broader') classes, with the most aggregate of such representing the 'high-level' subjects.

This 'umbrella' subject structure could be thought of as the reference subject 'super-structure' to which other specific ontologies could place themselves in a sort of locational or 'info-spatial' context. That is not to say that more specific subject references won't emerge or be appropriate for specific domains; in fact, many authoritative ones already exist. The intent is that the UMBEL reference subject pool is the tie-in point for such specific maps.

Events, specific places, specific people, specific organizations or other specific items are defined as outside of the composition of UMBEL's subject structure. (This is despite the fact that such specific items may themselves be containers of many properties.) For instance, UMBEL is not intended to provide a mapping to any individual's FOAF profile, though a data set of FOAF profiles would be a suitable subject.

UMBEL's subject vocabulary is meant to be quite small, likely more than a few hundred reference subjects, but likely less than many thousands. A high-level and lightweight subject mapping layer does not warrant difficult (and potentially contentious) specificity. The point is not to comprehensively define the scope of all knowledge, but to provide the fewest choices necessary for what subject or subjects a given domain ontology may appropriately reference while still providing coverage across all possible domains.

UMBEL's overriding structural topology is flat. The ontology also includes a further controlled vocabulary of synonyms and synsets to assist context and disambiguation. The total term set, then, will likely range into the tens of thousands.

3.2.2 Non-subject Topics

UMBEL is intended as the subject reference ontology to complement a group of key RDFS reference ontologies that are emerging. These include DOAP, SIOC, FOAF and GeoNames for the concepts of projects, communities, people and places, respectively. Other standard referents may also emerge in areas such as time and events, organizations, products or whatever.

One commitment of the UMBEL project is re-use. It will not repeat what other RDFS ontologies such as FOAF and SIOC are already doing well. UMBEL is the subject complement to those other structures, though with more content given its reference high-level subjects pool.

UMBEL thus provides particular guidance and tools for subject mapping and general support and guidance for other reference ontologies. In combination, these provide a sort of "best practices" set of reference ontologies for registering other large domain ontologies and data sets on the Web. This general framework is shown by the following diagram.

Lightweight Binding to an Upper Subject Structure Can Bring Order

3.3 Binding Mechanisms

The various subject- or domain-related ontologies that might bind to the UMBEL structure reside in the real world, and have a broad diversity of domains, topics and formality of structure. Thus, UMBEL should: a) provide a binding mechanism responsive to this real-world range of formalisms (that is, make no grand assumptions or requirements of structure); and b) place the responsibility to register and map subjects on the publisher of the contributing content ontology.

Moreover, the UMBEL binding mechanism should in no way impose any limits on what a specific community might do itself with respect to its own ontology scope, definition, format, schema or approach. There also should be no limits to the number of subject references a given data set or ontology can make to UMBEL.

As with other aspects of UMBEL, it is likely that some testing and evaluation of 'what works' will be needed to refine the actual binding mechanisms chosen. For example, clearly many data sets deal with multiple subjects or topics. Should UMBEL's bindings be at the class or property level, or at the level of multiple assignments across the entire data set? For example, a multiple assignment approach makes granularity and splitting difficult, but overcomes limitations in some source data sets that may not have aggregation at the class level.

3.3.1 Subject Proxies

The UMBEL binding approach builds on the idea of subject "proxies" from Park and Durusau. Proxies serve as binding points for actual subject content. A subject proxy is a computer representation of a subject that can be implemented as an object, has an identity, and is addressable as a URI.

Subject "proxies" should not to be confused with the actual subjects themselves. The proxy has no further semantics than its possible binding. No assertion is made about the accuracy, relevance or completeness of any resource claiming a relationship to a subject proxy. How such actual negotiations may occur resides outside of the scope of UMBEL's simple mapping and binding reference layer.

Each contributing source binding to a given UMBEL subject proxy has its own schema and its own definition of subjects. UMBEL's ontology should have sufficient scope, structure and supporting controlled vocabulary to facilitate accurate mapping to its proxy subjects, but the accuracy of that binding is still the responsibility of the outside party. A key assumption is that inaccurate assignments will be corrected due to self interest. (Subject spamming is a separate topic.)

The UMBEL ontology thus provides the standard set of high-level subject reference "proxies". Subject proxies thus allow autonomy and a RESTful design for contributing data sets and ontologies, as this diagram shows:

Example Binding Structure

This diagram shows two contributing data sets and their respective ontologies (no matter how "formal") binding to a given subject "proxy". The UMBEL binding mechanism is nothing more than a simple binding API with a restricted syntax specific to each binding target (see next).

The initial view is that a subject proxy is represented in SKOS as skos:Concept. This representation is the lowest ontological commitment one can make about the proxy. More formal definitions (such as OWL ontologies) may be sort of "indexed" against this representation. The skos:Concept is also the binding point for external ontologies at any degree of formalism.

3.3.2 Mapping Approach

The basic binding reference to UMBEL is via a predicate akin to umbel:isAbout. (It is not clear whether this needs to be a suggested syntax addition to SKOS or should be instantiated directly in UMBEL.) UMBEL will also prescribe other syntax related to identification of query endpoints, type of query endpoint, ontology formalism and so forth.

Besides syntax, some semi-automated tools will be developed that will enable a source data set or ontology to be mapped against the UMBEL controlled vocabulary. While matches to subject proxies may be automatically determined and prompted, acceptance and then tagging of the sources would be the responsibility of the publisher/creator.

The 'tagging' mechanism (embedded in HTML, other) will be a function of the source ontology approach.

3.3.3 Universal Binding API and Multiple Targets

UMBEL's binding mechanisms should at minimum work with the following existing Web approaches and protocols ("binding targets"):

Simple and lightweight protocols and registration procedures are being prepared for these approaches. Given the simplicity of design, new targets or approaches can be easily added.

3.4 Data Set Selection and Incorporation

The acceptance of the actual subjects and their structure is one key to the acceptance — and thus use and usefulness — of the UMBEL ontology. (The other key is simplicity and ease-of-use or tools.) A suitable subject structure must be adaptable and self-defining. It should reflect expressions of actual social usage and practice, which of course changes over time as knowledge increases and technologies evolve.

A premise of the UMBEL project is that suitable subject content and structures already exist within widely embraced knowledge bases. A further premise is that the ongoing use of these popular knowledge bases will enable them to grow and evolve as societal needs and practices grow and evolve.

The major starting point for the core subject pool is WordNet. It is universally accepted, has complete noun and class coverage, has an excellent set of synonyms, and has frequency statistics. It also has data regarding hierarchies and relationships useful to the UMBEL look-up reference structure, the 'unofficial' complement to the core ontology.

A second obvious foundation to building a subject structure is Wikipedia. Wikipedia's topic coverage has been built entirely from the bottom up by 75,000 active contributors writing articles on nearly 1.8 million subjects in English alone, with versions in other degrees of completeness for about 100 different languages. There is also a wealth of internal structure within Wikipedia's templates.

These efforts suggest a starting formula for the UMBEL project of W + W + S + ? (for WordNet + Wikipedia + SKOS + other?). Other potential data sets with rich subject coverage include existing library classification systems, upper-level ontologies such as SUMO, Proton or DOLCE, the contributor-built Open Directory Project, subject 'primitives' in other languages such as Chinese, or the other sources listed in Appendix 2 - Candidate Subject Data Sets.

Though the choice of the contributing data sets from which the UMBEL subject structure is to be built will never be unanimous, using sources that have already been largely selected by large portions of the Web-using public will go a long ways to establishing authoritativeness. Moreover, since the subject structure is only intended as a lightweight reference -- and not a complete closed-world definition -- the UMBEL project is also setting realistic thresholds for acceptance.

3.5 UMBEL Registration System

Here is the basic process a publisher or creator of an appropriate subject data set or ontology would follow to utilize UMBEL:

  1. Depending on data set format and approach, choose one of the various UMBEL API syntaxes to describe the content subjects
  2. According to this API and the UMBEL ontology subject structure, characterize the subject content and other appropriate metadata regarding the data set
  3. Install the UMBEL registration bookmarklet (see below) and publish the registration
  4. Receive registration approval
  5. Ping the UMBEL project via the bookmarklet whenever updates warrant.

Note that under points #1 and #2 the project anticipates that various tools will evolve to aid the labeling and characterization task. Also, it will be possible for third-parties to characterize data sets as well (though publishers will have final say in the event of possible characterization disputes — see 5 Community Process).

UMBEL registration and updates (points #3 and #5) build from already proven infrastructure. OpenLink's Virtuoso Sponger will aid conversion for many existing binding targets, Zitgist's PingtheSemanticWeb (PTSW) already has the RDF bookmarklet, and ping recording and publishing system, and the SIOC Project also has various ping and update systems, browsers and blog plug-ins (such as Semantic Radar for WordPress) that generate the appropriate RDF or browsing of the SIOC profiles.

Atom and RSS feeds may require simple tagging by site publishers, aided by extending existing bookmarklets, in order to remove any need for parsing or information (entity) extraction.

The fourth step in the registration system is provided to prevent spamming.

3.6 UMBEL Access and Query System

Entities from individuals to organizations will find value in UMBEL's central repository of useful Web data characterized by subject. Getting to that information in various ways and formats is therefore an integral part of the project's mission. Baseline examples of these systems may be found at the PTSW and SIOC Project references.

3.6.1 Access and Query

The UMBEL repository information may be accessed and queried on-line or via export. Query and selection fields will at minimum include subject(s), data format, query endpoints, record number, publishing organization and date.

The second, 'unofficial' part of the UMBEL ontology, namely its reference look-up structure, is aimed at helping this access and discovery. Though not the actual binding layer, this part will help in navigating the core UMBEL subject listing.

The full definition of these fields will be an integral part of the overall 3.1 UMBEL Ontology definition.

3.6.2 Serializations

Data exports in RDF/XML and N3 (Notation 3) are already native to PTSW. Given the breadth of non-RDF suppliers of registered data sets to UMBEL, it may be important to expand this roster (such as Atom, GData, JSON, etc.). Determining these additional serialization options is one of the tasks for the project.

3.7 UMBEL-related Toolsets

UMBEL may develop its own tools, see its project members extend existing tools, or watch as third parties independently create new tools. Tools development has been common in this space, with the extent of tools a function of the degree of acceptance and use.

Categories of tools useful to UMBEL include:

  • Ping and Registration -- commitments and plans for these are largely in place (see below)
  • Annotation Systems -- specific to the contributing formats in relation to the UMBEL ontology
  • Structure Browsers — means to navigate and browse the UMBEL ontology and subset directory (hierarchy) structure
  • Mapping Tools — for identifying and then confirming the ontology <—> subject proxy bindings
  • Subject Extractors — for processing the contributing subject structure data sets on a periodic basis
  • Translation Systems — for conversion of the UMBEL ontology and controlled vocabulary into other languages
  • Converters — for processing existing formats into ones suitable to UMBEL (these are largely outside of the UMBEL scope).

Initial and tentative commitments for integration with existing tools and data sets has been indicated for OpenLink's OpenLink Data Spaces (ODS) and Virtuoso Sponger, PingtheSemanticWeb (PTSW), SIOC Browser, Zitgist Semantic Search, DBpedia, Semantic Radar, and [add further names here as commitments are received].

In any event, maintaining a listing of UMBEL-related tools, whether developed by the project or not, is also an important project initiative.

3.8 Standards Activities

It is unclear if UMBEL's deliberations and activities may suggest input or revision to (in particular) the SKOS namespace or specifications, or to subsidiary RDFS reference ontologies such as FOAF, SIOC, DOAP or GeoNames. One open question, for example, is the inclusion of an isAbout or similar predicate.

However, if warranted, this is the placeholder for such standards activities by the UMBEL project.

It is also likely that UMBEL will provide a use case study to the SKOS effort (see http://www.w3.org/TR/skos-ucr/), because of the project's desires and early expressions of interest by the W3C.

3.9 Multiple Language Versions

The project welcomes and will provide support to those interested in alternative language versions.

3.10 Communications

Periodic press releases, release announcements and contacts with the trade press are anticipated. Appropriate content, white papers and FAQs will be added to the UMBEL Web site on a periodic basis.

3.11 Web Site

As the focal point for its activities, an important initiative of the UMBEL project is its Web site. More particulars are provided under the 6 Web Portal section below.

4 UMBEL Registration

Anyone can use or label their ontology using UMBEL, but to become a part of the UMBEL repository, the data set or ontology must be registered with the UMBEL service. Registration as part of the UMBEL repository may be desirable because the source data set or ontology gains:

  • Automatic inclusion in searches of the UMBEL look-up service
  • Inclusion in other services that download and use the UMBEL repository for their own purposes, and
  • A degree of authoritativeness by virtue of accepted registration.

Though the registration process is straightforward, it is not automatic. Approval is required to prevent subject spamming.

4.1 Registration Portal

The basic registration process was listed under the 3.5 UMBEL Registration System. Actual registration may occur either via an on-line manual form at the UMBEL Web site or via an input file (TBD).

Since the registering data sets or ontologies are assumed to remain relatively fixed with respect to their subject coverage and topics, updates to registrations would likely be infrequent (if ever). However, upon acceptance, the submitter may also update the registration directly through a particular pinging approach (TBD).

4.2 Registering Parties

Any party may register a data set or ontology with UMBEL so long as the registration information is complete. However, in the case of disputes, the original data set or ontology publisher or creator will be given priority.

4.3 Preventing Subject Spamming

Unfortunately, registration is required for listing on the UMBEL repository or site to prevent spam or subject spamming.

4.4 Acceptance Policies and Procedures

All data sets or ontologies will be accepted for registration that meet:

  • Minimum registration requirements according to posted guidelines
  • Are not spam, either in subject references or locations of resource URIs, and
  • Are not pornographic or disrespectful to other individuals or communities.

Submitted data sets and ontologies will be accepted as soon as practicable. Acceptance policies and procedures may change from time to time.

UMBEL's acceptance policies will be publicly posted. Violation of these guidelines is grounds for removal from UMBEL registration. All acceptance decisions are the unilateral right of UMBEL.org.

5 Community Process

[TBD: What are some workable group processes for the UMBEL project? The Ontolog or SIOC efforts look to be useful exemplars. We might as well use a successful existing effort as our guidance.]

6 Web Portal

A close exemplar to the eventual UMBEL Web portal is the SIOC Project Web site.

The first-cut UMBEL.org Web site is available on-line. Current efforts are underway to update this Web site with this functionality:

  • Standard Web site
  • Registration portal
  • Query and look-up service
  • Blog
  • Wiki
  • FAQs, info, links, papers
  • Tools listing
  • A group forum and mailing list (perhaps replacing the current Google Group)
  • Code / standards distribution point.

The UMBEL project has a group discussion forum and mailing list located at http://groups.google.com/group/umbel-ontology.

7 License and Copyright

This work is licensed under a Creative Commons Attribution License version 3.0. This copyright applies to the UMBEL Ontology Specification and accompanying documentation and does not apply to UMBEL Ontology data formats, ontology terms, or technology. Users are free to share and re-mix the UMBEL Ontology Specification without restriction so long as full attribution to UMBEL.org and the ontology's version number is made.

The data for ontologies registered with UMBEL.org is specifically reserved, as are any tools used or developed by UMBEL.org, and may not be used without UMBEL.org's express written permission.

Regarding underlying technology, the UMBEL Ontology relies heavily on W3C's RDF technology, an open Web standard that can be freely used by anyone.

8 Glossary

Atom
The name Atom applies to a pair of related standards. The Atom Syndication Format is an XML language used for web feeds, while the Atom Publishing Protocol (APP for short) is a simple HTTP-based protocol for creating and updating Web resources.
Binding
Binding is the creation of a simple reference to something that is larger and more complicated and used frequently. The simple reference can be used instead of having to repeat the larger thing.
Data Space
A data space may be personal, collective or topical, and is a virtual "container" for related information irrespective of storage location, schema or structure.
DOAP
DOAP (Description Of A Project) is an RDF schema and XML vocabulary to describe open-source projects.
FOAF
FOAF (Friend of a Friend) is an RDF schema for machine-readable modelling of homepage-like profiles and social networks.
Folksonomy
A folksonomy is a user-generated set of open-ended labels called tags organized in some manner and used to categorize and retrieve Web content such as Web pages, photographs, and Web links.
GeoNames
GeoNames integrates geographical data such as names of places in various languages, elevation, population and others from various sources.
GRDDL
GRDDL is a markup format for Gleaning Resource Descriptions from Dialects of Languages; that is, for getting RDF data out of XML and XHTML documents using explicitly associated transformation algorithms, typically represented in XSLT.
High-level Subject
A high-level subject is both a subject proxy and category label used in the hierarchical subject classification scheme (taxonomy) used by the UMBEL ontology. Higher-level subjects are classes for more atomic subjects, with the height of the level representing broader or more aggregate classes.
Microformats
A microformat (sometimes abbreviated ?F or uF) is a piece of mark up that allows expression of semantics in an HTML (or XHTML) web page. Programs can extract meaning from a web page that is marked up with one or more microformats.
Ontology
An ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts. Loosely defined, which is the preference of the UMBEL project, ontologies on the Web can have a broad range of formalism, or expressiveness or reasoning power.
OPML
OPML (Outline Processor Markup Language) is an XML format for outlines, and is commonly used to exchange lists of web feeds between web feed aggregators.
OWL
The Web Ontology Language (OWL) is designed for defining and instantiating formal Web ontologies. An OWL ontology may include descriptions of classes, along with their related properties and instances. There are generally three dialects recognized: OWL Lite, OWL DL (descriptive logic) and OWL Full.
RDF
Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata model but which has come to be used as a general method of modeling information, through a variety of syntax formats. The RDF metadata model is based upon the idea of making statements about resources in the form of subject-predicate-object expressions, called triples in RDF terminology. The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object.
RDFa
RDFa is a set of extensions to XHTML being proposed by W3C. RDFa uses attributes from XHTML's meta and link elements, and generalises them so that they are usable on all elements allowing XHTML annotation markup with semantics.
RDF Schema
RDFS or RDF Schema is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources.
RSS
RSS (an acronym for Really Simple Syndication) is a family of web feed formats used to publish frequently updated digital content, such as blogs, news feeds or podcasts.
SIOC
Semantically-Interlinked Online Communities Project (SIOC) is based on RDF and is an ontology defined using RDFS for interconnecting discussion methods such as blogs, forums and mailing lists to each other.
SKOS
SKOS or Simple Knowledge Organisation System is a family of formal languages designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary; it is built upon RDF and RDFS.
SPARQL
SPARQL (pronounced "sparkle") is an RDF query language; its name is a recursive acronym that stands for SPARQL Protocol and RDF Query Language.
Subject
A subject is always a noun or compound noun and is a reference or definition to a particular object, thing or topic, or groups of such items. As used by UMBEL, subjects are meant to be concrete and specific, and not conceptual or abstract.
Subject Extraction
Subject extraction is an automatic process for retrieving and selecting subject names from existing knowledge bases or data sets. Extraction methods involve parsing and tokenization, and then generally the application of one or more information extraction techniques or algorithms.
Subject Proxy
A subject proxy as used by UMBEL is a canonical name or label for a particular object; other terms or controlled vocabularies may be mapped to this label to assist disambiguation. A subject proxy is always representative of its object but is not the object itself.
Subject Structure
The subject structure is the topology of the graph that represents the subjects within a given data set. In UMBEL 'core' this topology is flat; in the 'unofficial' reference look-up structure the UMBEL topology is hierarchical and interlinked.
Tag
A tag is a keyword or term associated with or assigned to a piece of information (e.g. a picture, article, or video clip), thus describing the item and enabling keyword-based classification of information. Tags are usually chosen informally by either the creator or consumer of the item.
Topic
The topic (or theme) is the part of the proposition that is being talked about (predicated). In topic maps, the topic may represent any concept, from people, countries, and organizations to software modules, individual files, and events. Topics and subjects are closely related. Topic is deprecated in UMBEL to limit confusion with
Topic Map
Topic maps are an ISO standard for the representation and interchange of knowledge. A topic map represents information using topics, associations (similar to a predicate relationship), and occurrences (which represent relationships between topics and information resources relevant to them), quite similar in concept to the RDF triple.
UMBEL
UMBEL is the name for both the project and its associated high-level subject ontology. The acronym stands for Upper Mapping and Binding Exchange Layer.
YAGO
"Yet another great ontology" is a WordNet structure placed on top of Wikipedia.

Reference Links

Here are some of the main papers referred to in the proposal:

Appendix 1 - UMBEL Ontology Construction Methodology

The intent is for the methodologies used to construct the UMBEL ontology to be simple, transparent and straightforward. Their basis and use should be readily understandable, and hopefully, thus widely embraced.

'Core' Subject Ontology

The purpose of the 'core' UMBEL subject ontology is to provide a suitable number of consensus and concrete subject proxies that can bind to the subject matter of any conceivable external ontology. The subject coverage is not meant to be comprehensive nor complete to the constituent external ontology, but referentially accurate with respect to the subject matter(s) of its domain.The 'core' UMBEL subject ontology is flat. It does not use the broader, narrower or hasTopConcept properties from SKOS; it relies on the skos:Concept and prefLabel and altLabel to handle synonyms (or "synsets").The general approach to construct this 'core' subject ontology is to:

  1. Use WordNet as the governing controlled vocabulary; its classes would be the candidate SKOS prefLabels, the synsets would be the altLabels
  2. The subject terms in each contributing data set would be mapped to this prefLabel
  3. In each data set, the prefLabel would be given a 'value' x/total (to normalize between small and big subject term sets)
  4. Then, the subjects from all candidate data sets using the prefLabel would be summed and ranked
  5. Depending on WordNet frequency counts (and its Zipf distribution), determine the number of subject proxies in the pool
  6. By varying different contributing data sets, the project would analyze what subjects thus get included, what subjects get left out
  7. Based on a few combinations, the project would then make its version 1 choice for the 'core' UMBEL subject pool.The candidate subject terms from the candidate data sets would be chosen without respect to hierarchy, but may have abstract, 'too inclusive' concepts removed (e.g., thing, entity, concept, object, event, etc.).Though subject term intersections and frequencies will be the guiding decision factors as to which subject proxies are kept, the target number of total subjects is estimated to be fewer than 5,000 or so. (This estimate derives from a nominal three-level hierarchy with 20 choices at each level — 20 + 400 + 8000 -- taking into account sparsity of coverage.) According to Zipf's Law, 5,000 subjects would account for more than 95% of the total class coverage amongst the 45,000 or so unique classes and sub-classes within the WordNet corpus (see further http://www.umbel.org/WordNetClasses.xls ). Other candidate bases for selecting the UMBEL subject pool (see Appendix 2) range from about 200 basic Chinese subject "primitives" to the nearly 600,000 categories in the Open Directory Project.The 'pruning' will be assisted by making sure that specific instances (individual persons or places) are not included in the 'core' UMBEL ontology.

'Unofficial' Reference Look-up Structure

Since it is likely that the WordNet classes will be inclusive for the final subjects in the candidate subject proxy pool, the existing RDFS relations in the Yago data set may be directly translatable to the SKOS properties of broader, narrower, related or hasTopConcept.Care will need to be taken to avoid too many cyclic relationships, such as this one:
orientation > position > placement > proportion > content > object > living thing > organism > person > unpleasant person > bore.

Appendix 2 - Candidate Subject Data Sets

The basic approach to determining the listing of subjects within the UMBEL core is to intersect subject listings from accepted subject-rich sources, after unifying subject terminology via WordNet "synsets" and normalizing weights by the size of each contributing source's subject pool. It is nearly certain that Wikipedia will also participant as a contributing source with WordNet. It, and other potential contributing sources, are briefly described below.

WordNet

WordNet is a semantic lexicon for the English language, maintained by Princeton University. It was begun in 1985 and is now in its 3.0 version. The lexicon is heavily annotated, breaking words into noun, verb, adverb, adjective, conjunction and preposition parts-of-speech (POS). The lexicon contains about 150,000 words organized in over 115,000 "synsets" (synonym sets) for a total of 207,000 word-sense pairs; these are further organized hierarchically and via relations into about 45,000 unique classes and sub-classes.WordNet is almost a universally referenced English lexicon, and versions in other languages have been developed. There are robust data sets and toolkits for manipulating its database, including Prolog versions that provide frequency counts. Many of the existing upper-level ontologies and other text-based data sets have been mapped to WordNet.

Wikipedia

Another obvious foundation to building a subject structure is Wikipedia. That is because the starting basis of Wikipedia information has been built entirely from the bottom up — namely, what is a deserving topic. This has served Wikipedia and the world extremely well, with now nearly 1.8 million articles online in English alone (versions exist for about 100 different languages). There is also a wealth of internal structure within Wikipedia's "infobox" templates, structure that has been utilized by DBpedia (among others) to actually transform Wikipedia into an RDF database (see this related article). As socially-driven and -evolving, Wikipedia should continue to be the substantive core at the center of a knowledge organizational framework for some time to come.

Open Directory Project

The Open Directory Project is the longest-standing, contributor-driven site of subject-rich content on the Web. IT presently contains 590,000 categories maintained (or not!) by some 75,000 editors. Its subject directories are hierarchical, often deep, and annotated with interlinkages. Many of the listed categories are specific instances that would naturally be excluded from consideration as the UMBEL core.It may prove that restricting the ODP to its first three to five levels may be a useful means of pruning possible subject listings. Other techniques will be reviewed. (It may also be possible to get frequency use information by subject or category from the project leaders.) The Open Directory Project subject data is available as an RDF source.

Library Classification Systems

Various systems of subject indexing have been used in libraries for more then a century. One of the most used universal library classification system is the Dewey Decimal Classification (DDC). The DDC organizes all areas of knowledge into ten classes, which are then further subdivided. Elaborated rules can be used to combine subject elements, for instance geographical and temoral index terms, to represent more complex topics.Another system in the Library of Congress Subject Headings (LCSH). The LCSH is a reference set of subjects, used by but different from teh Library of Congress Classification system, that comprise a thesaurus of subject headings to apply to bibiographic records. The LCSH is maintained and officially published by the United States Library of Congress.The DDC and LCSH as well as other library systems are promising candidates to connect semantic resources on the Web with millions of publications that are described in library catalogues.

Various Upper Level Ontologies

Of course, considerable effort has been expended by the ontological community over the past decade or more to build various upper-level ontologies. An upper ontology is an attempt to create an ontology which describes very general concepts applicable across all domains, and often in a hierarchical manner. Though no upper-level ontology has yet gained universal acceptance, a number of them are prominent and have credibility within the community.Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON (PROTo ONtology), Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak.Nonetheless, these could also be useful sources of subject coverage, especially if their editors might work to prune their subject content to more concrete topics in keeping with UMBEL's objectives.

Chinese and Other Language 'Primitives'

The Chinese language (and related derivations such as Japanese Kanji) is built from some 214 'primitives' or 'radicals', also known as semantic root ideograms. In turn, these are used to construct the 3,000 to 4,000 characters in common use by the language. Similar constructs exist in other logographic languages.Most of these primitives represent concrete "things" and could therefore inform an intersection of subject candidates. One issue is that such roots lack modern extensions; another issue is that the potential subject number and degree of aggregation may be "higher" than what is anticipated for UMBEL.Given the emergence of Chinese as a leading global language and the fact that some few thousand characters (or "subjects") can be fairly readily derived, suggests some attention to such sources as contributor to the UMBEL subject data sets. In any event, it is likely appropriate to consider frequencies and subjects from non-Western languages as well.

YAGO

One innovative approach to provide a more hierarchical structural underpinning to Wikipedia has been YAGO ("yet another great ontology"), an effort from the Max-Planck-Institute Saarbrücken. YAGO matches key nouns between Wikipedia and WordNet, and then uses WordNet's well-defined taxonomy of synsets (term clusters relating to individual concepts) to superimpose the hierarchical class structure. The match is more than 95% accurate; YAGO is also designed for extensibility with other quality data sets.Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum, "Yago - A Core of Semantic Knowledge" (also in bib or ppt). Presented at the 16th International World Wide Web Conference (WWW 2007) in Banff, Alberta, on May 8-12, 2007. YAGO contains over 900,000 entities (like persons, organizations, cities, etc.) and 6 million facts about these entities, organized under a hierarchical schema. YAGO is available for download (400Mb) and converters are available for XML, RDFS, MySQL, Oracle and Postgres. The YAGO data set may also be queried directly online.

Others

A partial list of other candidate subject sources also includes SENSUS, the Metadata Registry maintained by the National Science Digital Library (NSDL) , OntoClean and CleanOnto, MetaNet, Penn Treebank, WordSmith, Omega, and developments in the High–level Thesaurus Project (HILT at http://hilt.cdlr.strath.ac.uk/), among possibly others. Many of these are retired projects, but may provide important insights for UMBEL.Again, the objective is not to intersect all possible subject-rich sources, but to find a relatively small number of balanced and well-accepted ones that can lead to a tractable final subject pool within UMBEL.

Appendix 3 - Reference Ontologies

UMBEL is meant to be a subject-oriented mapping and binding reference layer only. Other ontologies exist that refer to such concepts as people, places, communities and projects. These existing reference ontologies will also be included within UMBEL as best practices. Finally, note there are some other areas that would also benefit from a reference ontology, but which do not yet exist.

SKOS

SKOS, or the Simple Knowledge Organization System, is a formal language and schema designed to represent such structured information domains as thesauri, classification schemes, taxonomies, subject-heading systems, controlled vocabularies, or others; in short, most all of the "loosely defined" ontology approaches discussed herein. It is a W3C initiative more fully defined in its SKOS Core Guide. SKOS is built upon the RDF data model of the subject-predicate-object "triple." The subjects and objects are akin to nouns, the predicate a verb, in a simple Dick-sees-Jane sentence. Subjects and predicates by convention are related to a URI that provides the definitive reference to the item. Objects may be either a URI resource or a literal (in which case it might be some indexed text, an actual image, number to be used in a calculation, etc.).Being an RDF Schema simply means that SKOS adds some language and defined relationships to this RDF baseline. This is a bit of recursive understanding, since RDFS is itself defined in RDF by virtue of adding some controlled vocabulary and relations. The power, though, is that these schema additions are also easily expressed and referenced.This RDFS combination can thus be shown as a standard RDF triple graph, but with the addition of the extended vocabulary and relations:

Standard RDF Graph Model

The power of the approach arises from the ability of the triple to express virtually any concept, further extended via the RDFS language defined for SKOS. SKOS includes concepts such as "broader" and "narrower", which enable hierarchical relations to be modeled, as well as "related" and "member" to support networks and arrays, respectively. We can visualize this transforming power by looking at how an "ontology" in a totally foreign scheme can be related to the canonical SKOS scheme. In the figure below the left-hand portion shows the native hierarchical taxonomy structure of the UK Archival Thesaurus (UKAT), next as converted to SKOS on the right (with the overlap of categories shown in dark purple). Note the hierarchical relationships visualize better via a taxonomy, but that the RDF graph model used by SKOS allows a richer set of additional relationships including related and alternative names:

Example Structural Comparison of Hierarchical Taxonomy with Network Graph

SKOS also has a rich set of annotation and labeling properties to enhance human readability of schema developed in it.Combined, these constructs provide powerful mechanisms for giving contributory ontologies a common conceptualization. When added to other sibling RDF schema such as FOAF or SIOC or DOAP, still additional concepts can be collated.The SKOS language has the following classes:

  • CollectableProperty — A property which can be used with a skos:Collection
  • Collection — A meaningful collection of concepts
  • Concept — An abstract idea or notion; a unit of thought
  • ConceptScheme — A set of concepts, optionally including statements about semantic relationships between those concepts. Thesauri, classification schemes, subject heading lists, taxonomies, 'folksonomies', and other types of controlled vocabulary are all examples of concept schemes. Concept schemes are also embedded in glossaries and terminologies.
  • OrderedCollection — An ordered collection of concepts, where both the grouping and the ordering are meaningful . . . and the following properties:
  • altLabel — An alternative lexical label for a resource. Acronyms, abbreviations, spelling variants, and irregular plural/singular forms may be included among the alternative labels for a concept
  • altSymbol — An alternative symbolic label for a resource
  • broader — A concept that is more general in meaning. Broader concepts are typically rendered as parents in a concept hierarchy (tree)
  • changeNote — A note about a modification to a concept
  • definition — A statement or formal explanation of the meaning of a concept
  • editorialNote — A note for an editor, translator or maintainer of the vocabulary
  • example — An example of the use of a concept
  • hasTopConcept — A top level concept in the concept scheme
  • hiddenLabel — A lexical label for a resource that should be hidden when generating visual displays of the resource, but should still be accessible to free text search operations
  • historyNote — A note about the past state/use/meaning of a concept
  • inScheme — A concept scheme in which the concept is included. A concept may be a member of more than one concept scheme
  • isPrimarySubjectOf — A resource for which the concept is the primary subject
  • isSubjectOf --A resource for which the concept is a subject
  • member — A member of a collection
  • memberList — An RDF list containing the members of an ordered collection
  • narrower — A concept that is more specific in meaning. Narrower concepts are typically rendered as children in a concept hierarchy (tree)
  • note — A general note, for any purpose. The other human-readable properties of definition, scopeNote, example, historyNote, editorialNote and changeNote are all sub-properties of note
  • prefLabel — The preferred lexical label for a resource, in a given language. No two concepts in the same concept scheme may have the same value for skos:prefLabel in a given language
  • prefSymbol — The preferred symbolic label for a resource
  • primarySubject — A concept that is the primary subject of the resource. A resource may have only one primary subject per concept scheme
  • related — A concept with which there is an associative semantic relationship
  • scopeNote — A note that helps to clarify the meaning of a concept
  • semanticRelation — A concept related by meaning. This property should not be used directly, but as a super-property for all properties denoting a relationship of meaning between concepts
  • subject — A concept that is a subject of the resource
  • subjectIndicator — A subject indicator for a concept. [The notion of 'subject indicator' is defined here with reference to the latest definition endorsed by the OASIS Published Subjects Technical Committee]
  • symbol — An image that is a symbolic label for the resource. This property is roughly analagous to rdfs:label, but for labelling resources with images that have retrievable representations, rather than RDF literals. Symbolic labelling means labelling a concept with an image.

DOAP

DOAP (Description Of A Project) is an RDF Schema and XML vocabulary to describe software and related projects, initially for open source but not inherently limited to such. Its vocabulary enables international descriptions of a software project and its associated resources, including participants and Web resources. The DOAP project provides basic tools to enable the easy creation and consumption of such descriptions. The DOAP vocabulary is interoperability with other popular Web schema (RSS, FOAF, Dublin Core). The DOAP vocabulary can also be extended for specialist purposes.

Dublin Core

The Dublin Core metadata element set is a standard for cross-domain information resource description. It contains the 15 elements of title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage and rights. Dublin Core is the 'granddaddy' of metadata and a key reference set.UMBEL can be seen as a controlled vocabulary expansion to the Dublin Core subject element.

FOAF

FOAF (Friend of a Friend) is a project for machine-readable modeling of homepage-like profiles and social networks. It has evolved to provide a rather complete description profile of people. FOAF is an RDF Schema. It is similar to the hCard microformat.

GeoNames

GeoNames is a geographical data base freely available and accessible through various Web services. It contains over 8,000,000 geographical names corresponding to over 6,500,000 unique features. All features are categorized into one out of nine feature classes and further sub-categorized into one out of 645 feature codes. Beyond names of places in various languages, data stored include latitude, longitude, elevation, population, administrative subdivision and postal codes. All coordinates use the WGS84 system.GeoNames is available as RDF and in SKOS.

SIOC

Semantically-Interlinked Online Communities Project (SIOC) provides methods for interconnecting discussion methods such as blogs, forums and mailing lists to each other. It consists of the SIOC ontology, an open-standard machine readable format for expressing the information contained both explicitly and implicitly in internet discussion methods, of SIOC metadata producers for a number of popular blogging platforms and content management systems, and of storage and browsing / searching systems for leveraging this SIOC data.SIOC is an RDF Schema. The SIOC project Web site is also an exemplar for UMBEL.

Potential Missing References

Reference ontologies that appear to be missing from the above set, but which might complete a full set of reference ontologies, would include events and time dimensions, companies and organizations, and products. Additional areas such as the sciences, organisms, etc., would appear to also be useful, but likely more as mid-level ontologies as opposed to common, standard reference ontologies.

Change Log

First Draft: 2007-05-30
Second Draft: 2007-06-18 Added split between 'core' and the 'unofficial' structured look-up portions of the UMBEL ontology; added methodology discussion; expanded discussion of candidate subject reference sets and reference ontologies
Third Draft:
(first public release)
2007-07-12 Added material on proposed binding mechanisms; other minor edits in preparation for first public release
Next Planned Revision: TBD
Personal tools