Semantic Web/The Vision
The Vision of the Semantic Web
"The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." - Tim Berners-Lee, James Hendler, Ora Lassila (2001)
According to the World Wide Web Consortium (W3C), the Web can reach its full potential only if it becomes a place where data can be shared, processed, and understood by automated tools as well as by people. For the Web to scale, tomorrow's programs must be able to share, process, and understand data even when these programs have been designed independently from one another.
Still in its definition stage, the term Semantic Web is perhaps new to many people, even to those within IT circles. But the problems it aims to address are ones we have been struggling to solve for decades – issues such as information overload, stovepipe systems, and poor content aggregation (Daconta, Orbst, and Smith, 2003). The fundamental roots of these problems are the lack of semantic definitions in individual systems, the lack of semantic integration among data sets, and the lack of semantic interoperability across disparate systems. The Semantic Web extends beyond the capabilities of the current Web and existing information technologies, enabling more effective collaborations and smarter decision-making. It is an aggregation of intelligent websites and data stores accessible by an array of semantic technologies, conceptual frameworks, and well-understood contracts of interaction to allow machines to do more of the work to respond to service requests – whether that be taking on rote search processes, providing better information relevance and confidence, or performing intelligent reasoning or brokering.
The steps to reach this state, however, are not likely to be accomplished in a few short years. Certainly, rapid progress will be made on some ends, just as numerous websites appeared soon after the introduction of low-cost/no-cost web servers and free graphical browsers. But the progression in the development of websites moved relatively chaotically over the course of a half-dozen years – starting from an ad hoc set of scripting languages, low-end tools, and custom-built server components, and steadily progressing to a relatively unified set of core languages, application servers, content management systems, e-commerce engines, web services, and other enterprise-worthy components and offerings. The growth of the Semantic Web is likely to go through a similar progression in market dynamics. Although the business models of a connected world are better understood and the level of awareness of emerging web technologies more greatly heightened, there will nevertheless be a significant time lag until many of the pieces of the vision are assembled.
What the Semantic Web Is and Is Not
1. The Semantic Web is not a new and distinct set of websites.
The Semantic Web is an extension of the current World Wide Web, not a separate set of new and distinct websites. It builds on the current World Wide Web constructs and topology, but adds further capabilities by defining machine-processable data and relationship standards along with richer semantic associations. Existing sites may use these constructs to describe information within web pages in ways more readily accessible by outside processes such as search engines, spider searching technology, and parsing scripts. Additionally, new data stores, including many databases, can be exposed and made available to machine processing that can do the heavy lifting to federate queries and consolidate results across multiple forms of syntax, structure, and semantics. The protocols underlying the Semantic Web are meant to be transparent to existing technologies that support the current World Wide Web.
2. The Semantic Web is not being constructed with just human accessibility in mind.
The current Web relies mainly on text markup and data link protocols for structuring and interconnecting information at a very coarse level. The protocols are used primarily to describe and link documents in the forms presentable for human consumption (but that have useful hooks for first-order machine searching and aggregation). Semantic Web protocols define and connect information at a much more refined level. Meanings are expressed in formats understood and processed more easily by machines in ways that can bridge structural and semantic differences within data stores. This abstraction and increased accessibility means that current web capabilities can be augmented and extended – and new, powerful ones introduced.
3. The Semantic Web is not built upon radical untested information theories.
The emergence of the Semantic Web is a natural progression in accredited information theories, borrowing concepts from the knowledge representation and knowledge management worlds as well as from revised thinking within the World Wide Web community. The newly approved protocols have lineages that go back many years and embody the ideas of a great number of skilled practitioners in computer languages, information theory, database management, model-based design approaches, and logics. These concepts have been proven within a number of real-world situations although the unifying set of standards from the W3C promises to accelerate and broaden adoption within the enterprise and on the Web. With respect to issues about knowledge representation and its yet-to-be-fulfilled promise, a look at history shows numerous examples of a unifying standard providing critical momentum for acceptance of a concept. HTML was derived from SGML, an only mildly popular text markup language, and yet HTML went on to cause a sea change in the use of information technology. Many in the field point to the long acceptance timeframes for both object-oriented programming and conceptual-to-physical programming models. According to Ralph Hodgson, “knowledge representation is a fundamental discipline that now has an infrastructure and a set of supporting standards to move it out of the labs and into real-world use.”
4. The Semantic Web is not a drastic departure from current data modeling concepts.
According to Tim Berners-Lee, the Semantic Web data model is analogous to the relational database model. “A relational database consists of tables, which consist of rows, or records. Each record consists of a set of fields. The record is nothing but the content of its fields, just as an RDF node is nothing but the connections: the property values. The mapping is very direct – a record is an RDF node; the field (column) name is RDF propertyType; and the record field (table cell) is a value. Indeed, one of the main driving forces for the Semantic Web has always been the expression, on the Web, of the vast amount of relational database information in a way that can be processed by machines." (Berners-Lees, 1998) That said, the Semantic Web is a much more expressive, comprehensive, and powerful form of data modeling. It builds on traditional data modeling techniques – be they entity-relation modeling or another form – and transforms them into much more powerful ways for expressing rich relationships in a more thoroughly understandable manner.
5. The Semantic Web is not some magical piece of artificial intelligence
The concept of machine-understandable documents does not imply some form of magical artificial intelligence that allows machines to comprehend human mumblings. It only indicates a machine's ability to solve a well-defined problem by performing well-defined operations on existing well-defined data (Berners-Lee, Handler, and Lassila, 2001). Current search engines perform capabilities that would have been magical 20 years ago, but that we recognize now as being the result of IP protocols, HTML, the concept of websites, web pages, links, graphical browsers, innovative search and ranking algorithms, and, of course, a large number of incredibly fast servers and equally large and fast disk storage arrays. Semantic Web capabilities will likewise be the result of a logical series of interconnected progressions in information technology and knowledge representation formed around a common base of standards and approaches.
6. The Semantic Web is not an existing entity, ready for users to make use of it.
The Semantic Web currently exists as a vision, albeit a promising and captivating one. Similar to the current Web, the Semantic Web will be formed through a combination of open standard and proprietary protocols, frameworks, technologies, and services. The W3C-approved standards – XML, RDF, and OWL – form the base protocols. New data schemas and contract mechanisms, built using these new protocols, will arise around communities of interest, industry, and intent; some will be designed carefully by experienced data architects and formally recognized by established standards bodies; others will appear from out of nowhere and gain widespread acceptance overnight. A host of new technologies and services will appear such as semantically-aware content publishing tools; context modeling tools; mediation, inference, and reputing engines; data-cleansing and thesaurus services; and new authentication and verification components. Although various elements of the vision already exist, rollout of these technologies, coordination amidst competitive forces, and fulfillment of the vision will take many years.
While the full vision of the Semantic Web may be a bit distant, there are, on the near horizon, capabilities that many think will make enterprise software more connectable, interoperable, and adaptable as well as significantly cheaper to maintain. The use of semantic approaches in combination with the existing and emerging semantics-based schemas and tools can bring immediate and/or near-term benefits to many corporate enterprise and government agency IT initiatives.
Semantic interoperability, for example, represents a more limited or constrained subset of the vision of the Semantic Web. Significant returns, however, can still be gained by using semantic-based tools to arbitrate and mediate the structures, meanings, and contexts within relatively confined and well-understood domains for specific goals related to information sharing and information interoperability. In other words, semantic interoperability addresses a more discrete problem set with more clearly defined endpoints (Pollock and Hodgson, 2004). Semantic technologies can also provide a loosely connected overlay on top of existing Web service and XML frameworks, which in turn can offer greater adaptive capabilities than those currently available. They can also make immediate inroads in helping with service discovery and reconciliation, as well as negotiation of requests and responses across different vocabularies. Considering the depth and difficulty of issues the federal, state, and local agencies have in these regards, semantic technologies may provide the first flexible, open, and comprehensive means to solve them.
Semantic computing is an emerging discipline being formed and shaped as this is written. As such, there are many definitions and interpretations, and even a few low-intensity philosophical wars being waged among thought-leaders and practitioners. That said, the release of RDF and OWL as W3C Recommendations earlier in the year has created a greater commonality in expression.
Because semantic computing makes use of various forms of abstraction and logical expression, it can be difficult to see how the languages provide many of the powerful capabilities expressed in earlier sections. But just as the Internet and World Wide Web are built upon layers of protocols and technologies, so too is the Semantic Web. Understanding several key concepts and becoming familiar with the core building blocks of the Semantic Web will form a basis for visualizing how higher order tools, components, and technologies can deliver on the promise of richer and more flexible machine-processable data. Understanding some of the foundational concepts will also allow readers to better understand the state of the technologies and the areas that still need to be refined in order to reach the full vision of the Semantic Web.
Semantic technologies differ from database schemas, data dictionaries, and controlled vocabularies in an important way: They have been designed with the connectivity in mind allowing different conceptual domains to work together as a network.
1. XML (eXtensible Markup Language)
XML stands for eXtensible Markup Language and is a standard way of describing, transporting, and exchanging data that was pioneered by the W3C in the late 1990s. XML serves as a mechanism for marking up data through the use of customized "tags" in such a way as to describe the data. XML is not necessarily related to HTML and, in fact, the two were designed for entirely different purposes. Despite this fact, the two can complement each other in various ways, depending on a user's needs.
The tags are typically the labels for the data such as “FirstName” or “StreetAddress.” When trying to use XML to define a standard interchange format, it is important to have agreement on the tags. For example, two book suppliers might wish to formalize a partnership involving data exchange. Specifying at the outset that Supplier A’s definition of “Author” is identical to Supplier B’s definition of “Writer” and codifying that in the XML structure would be an essential part of formulating proper data agreement. Additional terms that overlap and have the same meaning would also need to be formally identified, usually in something called a DTD or XML Schema. (XML Schema is a mechanism for defining XML documents in a formal way, thereby ensuring the accurate exchange of information.)
Examples of XML Schemas in working use can be found in many government and industry registries. According to the U.S. CIO Council XML Working Group, “The full benefits of XML will be achieved only if organizations use the same data element definitions and those definitions are available for partners to discover and retrieve. A registry/repository is a means to discover and retrieve documents, templates, and software (i.e., objects and resources) over the Internet. The registry is used to discover the object. It provides information about the object, including its location. A repository is where the object resides for retrieval by users.”
In the context of semantics and the Semantic Web effort, XML is a set of syntax rules for creating semantically rich markup languages in particular domains. XML allows users to add arbitrary structure to their documents but says nothing about what the structures mean (Berners-Lee, Hendler, and Lassila, 2001). In other words, whereas IT systems, databases, and content management systems have become good at describing things, they have not done so well at describing associations. More concrete and faithful descriptions that provide better senses of words, terms, and domains are needed.
2. RDF (Resource Description Framework)
RDF stands for Resource Description Framework and has been specifically designed to provide this associative information. RDF offers ways to make data richer and more flexible, and therefore able to exist in environments outside those explicitly defined by system programmers and data modelers. RDF encodes information in sets of triples, each triple being rather like the subject, verb, and object of an elementary sentence. (This same model can also represent resource, property, and value constructs.) RDF provides an infrastructure for linking distributed metadata and also serves in conjunction with OWL as a core language for describing and representing ontologies.
One of the primary benefits of using RDF to describe data associations is the scalability and flexibility it provides. Explicit database tables can be created that do much the same thing but the unique nature of RDF provides a flexible mechanism that allows far greater associative capabilities, thereby increasing the ability to query and make inferences on topic matters not explicitly hard-wired into tables. The benefits only increase when trying to integrate new data sources, especially when they have different structures or semantics or, more importantly, when they cross conceptual domains (as in the case of environmental and public health data or, alternatively, law enforcement and intelligence data). RDF triples are serialized in XML, providing a way to describe relationships between data elements using XML tags or other syntax in a format that can be easily processed by machines. In an effort to support a loosely coupled and/or virtual architecture, a Universal Resource Identifier (URI) is used to identify each of the triple elements. The purpose of a URI is to uniquely identify a concept in the form of subject, verb, or object by linking to the origin where the concept is defined.
RDF Schema (sometimes written as RDFS or RDF-S) offers a way of semantically describing and extending RDF. It provides mechanisms for describing groups of related resources and the relationships among these resources. RDF Schema does the same thing for RDF that DTD and XML Schema do for XML. A number of query languages for RDF have been developed within academic and industry circles. In October 2004, the W3C RDF Data Access Working Group released a draft specification for SPARQL (pronounced "sparkle"), a query language for RDF that seeks to unify the way developers and end users write, and to consume RDF search results across a wide range of information.
3. OWL (Web Ontology Language)
OWL stands for Web Ontology Language. (The acronym is purposely transposed from the actual name – OWL instead of WOL – as a conscious link to the name of the owl in the book Winnie the Pooh.) Whereas RDF's primary value can be seen in enabling association and integration of distributed data, OWL's main value is in enabling reasoning over distributed data.
OWL is a highly expressive modeling language that is compatible with existing data stores and modeling constructs including XML, Rational, and object-oriented approaches. OWL also provides loosely-coupled “views” of data which makes federated knowledge bases easy to build and evolve. Most importantly, OWL has machine-actionable semantics. Run-time and design-time software tools can do “things” with models, data, metadata, rules, and logic without human assistance or highly specific application code. (Pollock, 2004)
OWL is derived from a number of efforts to develop a set of flexible and computational logic constructs, many of which go back many years. It is the next generation of the ontology language called DAML+OIL, which in turn integrated two efforts, DAML, the DARPA Markup Language, an effort that was based in the United States, and OIL, the Ontology Inference Layer (or Language), an effort that was based in Europe. It also has roots in SHOE (Simple HTML Ontology Extensions), an effort led by James Hendler at the University of Maryland, created specifically for incorporating machine-readable knowledge into web documents thereby facilitating intelligent agent capabilities. There are three levels of OWL defined (OWL Lite, OWL DL, and OWL Full) each having progressively more expressiveness and inferencing power. These levels were created to make it easier for tool vendors to support a specified level of OWL.RDF and OWL can operate together or separately. In some cases, supporting the distributed nature of data may be the primary objective, in which case only RDF may be used. In other cases, distribution plus reasoning capabilities may be desired, and so both RDF and OWL may be featured. In other instances, just reasoning capabilities are desired, and so OWL may suffice.
4. Other Language Development Efforts
Other languages are currently being developed to address additional layers within the Semantic Web vision. For example, a Rule language will provide the capability to express certain logical relationships in a form suitable for machine processing. This language will allow the expression of business rules and will provide greater reasoning and inference capabilities. RuleML was initially proffered as a rule language, although efforts to formalize the Semantic Web Rules Language (SWRL) are currently underway at W3C. A Logic language will conceivably provide a universal mechanism for expressing monotonic logic and validating proofs. A long-term hope is to eventually make use of assertions from around the Web to derive new knowledge. (The problem here is that deduction systems are not terribly interoperable. Rather than designing a single overarching reasoning system, current activities are focused on specifying a universal language for representing proofs. Systems can then digitally sign and export these proofs for other systems to use and incorporate.)
Likewise, constructs, schemas, and architectures for inferring reputation and trust are also being developed, both within the W3C and by the larger web community. These approaches are being looked at not just to infer reputation and trust by and among individuals, but also of groups of people (such as companies, media sources, non-government organizations, and political movements), inanimate objects (such as books, movies, music, academic papers, and consumer products), and even ideas (such as belief systems, political ideas, and policy proposals). (Masum and Zhang, 2004) One challenge faced by practitioners in the field is to create frameworks and languages with sufficient expressiveness to capture the knowledge that can be described in ambiguous human languages. At issue is how to create languages, tools, and systems that will support the easy expression of simple things, while making it possible to express complex things. Another challenge is to maintain compatibility with existing syntax standards such as HTML, XML, and RDF while dealing with issues pertaining to the readability and human accessibility of the syntax. Ultimately, better tools will be developed that will minimize these issues but in the meantime complexities within some of the higher order languages may make it more difficult to develop fully compliant implementations using current editing and modeling tools.
Semantic Tools and Components
'"Semantic Web tools are getting better every day. New companies are starting to form. Big companies are starting to move."'
Several models exist that describe the lifecycle or stages of maturity that technologies go through. Typically these have four stages: entry (or definition), growth (or validation), maturity (or refinement), and decline (or consolidation). By most measures, the Semantic Web, as experienced in a publicly available format, is still in the entry/definition phase. Many of the semantic technologies, however, are well into the growth/validation phase. (The shift into maturity is often elusive; the tipping point being visible only after the fact, and at times passing through a period of hype and unmet expectations.)
Leaders in technology applications across government and private industry have been forging new paths and obtaining successful results from their semantic implementation projects. There are semantic research projects in a number of federal agencies. Semantic products are available from large and established companies such as Adobe, Hewlett Packard, and IBM, as well as from many small pioneering companies such as Unicorn, Network Inference, and Semagix. In addition, there are a number of open source and publicly available tools created by public and private research institutions and organizations.
What follows is a brief survey of commercial and open source tools that can be used to create applications powered by semantic technologies. One way to understand how these tools work together is to view them as either design-time tools or run-time tools. Design-time tools are used by document authors, system designers, and others as part of the creation, design, or authoring process. Examples include tools to create metadata or to create or populate ontologies. Other software components are used as run-time components to process queries, transform data, or otherwise produce operational results. Examples include mediation servers and inference engines. Many of the tools are used as a set within an implementation process – for example, modeling and mapping tools during design-time in partnership with query facilities and mediation servers at run-time.
Metadata Publishing and Management Tools
The process for creating metadata about a document or data item can occur when that item is created or authored, when it is imported into a content management system or a website, or when it is viewed or read by users. It can also be added by some other explicit or implicit action at any point over the course of the existence of that data item. In other words, metadata creation is not just a one-time occurrence. Metadata can continue to accummulate and can be modified at any time by conceivably any number of people.
At content creation, authors typically connect information such as the subject, creator, location, language, and copyright status with a particular document. This information makes the document much more searchable. RSS consists essentially of this type of information, providing newsreading applications with significantly expanded capabilities for searching and filtering information. Moveable Type from a company called SixApart is one of the more popular tools in the blogging community for creating RSS-compliant documents. The increasing popularity and simplicity of RSS is causing its use to extend outside of the blogging community into the general media and even into the enterprise. Other vendors of desktop and web-authoring tools are also moving quickly to provide RSS publishing capabilities.
The creation of metadata is only one step in the process. Metadata management tools are needed in order to maintain metadata vocabularies, perform metadata-driven queries, and provide visualization tools for monitoring changes in areas of interest. An example of a website that uses metadata as a key aspect of creating a collaborative and shared system of data is Flickr, a site for people to easily upload and share digital photos. What sets it apart from other digital photos services is that it provides photo-tagging capabilities as well as an innovative interface for viewing the categories of photos. (The tags are contained in a map and vary in size depending on the frequency of the tag within the data store.) What distinguishes it from earlier metadata implementations is that the feedback loop is extremely tight, meaning that the assignment of tags is bound closely to their use. As soon as photos and sets of photos are tagged, users see clusters of items carrying the same tag. Users can easily change tags to refine the clusters. In terms of tools for querying metadata, the components are not much different than current search engines, although the inclusion of metadata makes for richer data and therefore more precise and relevant searches. Query scripts and languages will likely adapt to allow users more precision although the balance between simplicity and features is constantly in flux, especially in more publicly available search engines. As with the Flickr example above, however, new visualization tools are being developed to help users navigate through complex fields of related data.
Modeling Tools (Ontology creation and modification)
Modeling tools are used to create and modify ontologies. Knowledge modelers used them to create and edit class structures and model domains. The tools often have an interface that is similar to a file system directory structure or bookmark folder interface. They also tend to offer the ability to import, transform, and re-purpose, in whole or in part, existing ontological structures that are often in the form of database schemas, product catalogues, and yellow pages listings. Other prominent feature includes advanced mechanisms for organizing, matching, and associating similar terms and concepts.
Also, because it is a common practice for modelers to create smaller interconnected ontologies instead of a single large monolithic model – primarily for better reusability and ease of use – support for splitting, merging, and connecting models can be an important capability in the ontology editor. Some editors even support collaborative work methods and rich visualization and graphical interaction modes. Protégé-2000 is a free ontology editor from Stanford University with a large and active user community. It features an open architecture that allows independent developers to write plug-ins that can significantly extend Protégé capabilities. Commercial modeling tools are available from a number of vendors including Network Inference, Language and Computing, and Intelligent Views. IBM’s Ontology Management System (also known as SNOBASE, for Semantic Network Ontology Base) is a framework for loading ontologies from files and via the Internet and for locally creating, modifying, querying, and storing ontologies. Internally, SNOBASE uses an inference engine, an ontology persistent store, an ontology directory, and ontology source connectors. Applications can query against the created ontology models and the inference engine deduces the answers and returns results sets similar to JDBC (Java Data Base Connectivity) result sets. As of the time of publication of this paper, SNOBASE is not, however, compatible with OWL. The Sigma ontology development and reasoning system is also a fully formed design and run-time ontology management system. It can be freely licensed although it, like SNOBASE, is not compliant with OWL.
Arriving at the right ontology is often a critical element of successful implementation of semantics-based projects. Even more so than database design, ontology creation is a highly specialized field. Not only are there not as yet a sizeable number of skilled practitioners, it can take considerable time to arrive at an ontology that successfully captures a conceptual domain. As a result, it is important to look at existing bodies of work that can be used (and reused) in lieu of having to create something from scratch. Likely sources of existing ontologies can typically be located in close association with ontology modeling tools, several of which are named above. Use of proprietary ontologies may be contingent upon licensing of the modeling tools, a practice which is not unreasonable considering the efforts expended to develop the ontologies. Other ontologies, however, may be open and free for use for commercial and non-commercial purposes, much in the vein of Linux, JBoss, Wikipedia, Musicbrainz, and other open source software and data repositories.
Current ontology development efforts vary in scope and size. Some ontologies have been developed specifically in answer to localized implementations such as reconciling charts of accounts or health care records, areas where the emphasis is primarily on information interoperability – arbitrating between syntaxes, structures, and semantics – and less on logic programming. Other ontology development efforts take a more top-down approach under the assumption that a shared view of a wide knowledge domain is critical to widespread proliferation of adaptive computing and intelligent reasoning capabilities. There is significant advocacy in these latter circles on the establishment of an enterprise-wide common upper ontology under the belief that it will provide the foundation for any number of domain ontologies. New domain ontologies could be extensions of, and fully compliant with, this upper ontology. Existing ontologies and legacy data models could be mapped to this upper ontology, which theoretically would constitute a significant number of the steps toward achieving greater semantic interoperability across domains. (It should be noted, however, that additional development and engineering is still needed to demonstrate the feasibility and scalability of this approach.)
Several candidate upper ontologies now exist, including DOLCE (Gangemi, et al., 2002), Upper Cyc (Lenat, 1995), and SUMO (Niles and Pease, 2001), but none of these as yet has gained significant market adoption. Proponents of this upper ontology approach believe that were the U.S. Department of Defense and/or the Federal Government to adopt one of these candidates, there is a good chance industry would follow, after which the US could then propose it as a standard to the International Standards Organization.
Even where domain-specific ontologies do not exist, it is possible to jumpstart development by making use of existing taxonomies, XML standards, or other lower order data models. At the federal level, the Knowledge Management working group (http://km.gov) has made significant progress in sharing information about taxonomy projects across agencies. XML.Gov (http://xml.gov) has a mission to facilitate efficient and effective use of XML across agencies in order that seamless sharing of documents and data can be achieved. Many government agencies have existing taxonomies, or have begun to develop taxonomies for their information domains. JusticeXML, for example, is an impressive body of work that could be extended and enhanced by RDF and OWL to provide a more flexible data model, an effort that could pave the way for far easier access to federal, state, and local law enforcement information by other agencies.
Mapping Tools (Ontology population)
Once an ontology model is created, it needs to be populated with data (referred to as class instances in “ontology speak”). This process is usually accomplished by linking various data sources to the concepts in an ontology using a mapping tool. Once “maps” have been created, a query in one data source could be transformed by its map to the ontology and then from the ontology to the other data sources using their maps. The corresponding data could then be returned in the same manner without any of the data stores knowing or caring about the others. In other words, each data source may have a unique “map” to an overarching ontology that acts as a pivot table among the various sources and targets. Providing this abstraction layer requires some effort on the part of creating the ontology and then creating the data maps, but once this has been done each data source can interoperate with other data sources strictly within run-time processes. Bringing new data sources onboard will, in most cases, have little or no effect on existing data sources.
This process drastically reduces the amount of data value mapping and semantic conflict resolution that typically takes place using current enterprise application approaches – approaches that up to now typically require n-squared mappings (mapping from and to each and every data source) or alternatively, exporting to hard-coded, inflexible, and explicit standards. The modeling and mapping process makes the process far less political and far more flexible and adaptable. Anomalies specific to a single data source, for example, can be handled almost transparently, whereas addressing such anomalies within the typical standards process would entail expending significant time and energy. Most of the tools used to handle structured data forms have features that automate the process of mapping database fields to ontologies. Network Inference and Unicorn are two vendors with tools of this type. Tools that aggregate, normalize, and map unstructured data forms to ontologies typically work with a variety of unstructured data forms including Word, RTF, text files, and HTML. Semagix is a leading vendor for unstructured data.
Ontologies and other RDF data models can be stored in native RDF data stores or in relational databases that have been customized to support associative data techniques. Native RDF-data stores are inherently designed to support the concept of triples and can offer an efficient out-of-the-box approach to storing ontologies. RDF native databases are available from companies such as Tucana Technologies and Intellidimensions. Several high-quality open source RDF data stores also exist, including Kowari, Redland, Sesame, and 3Store. To use a relational database, the database must be designed in a somewhat non-traditional way. Instead of having a table that describes each major concept, the database design typically mimics the concept of triples by using a single table containing four columns. Three of the columns store the triple while the forth column is used to store its identification tag. (A report entitled “Mapping Semantic Web Data with RDBMSes” is an excellent resource for finding out more about implementing triple stores in relational databases.)
Issues related to representing, storing, and querying using triples (i.e., RDF) versus traditional relational approaches, as well as the use and/or co-existence of the two types of data stores within implementations, are still working themselves out within industry and the marketplace. Each store-and-query facility provides unique capabilities that the other, at present, does not. RDF is great for situations when it is difficult to anticipate the types of queries that will be performed in the future. It is also terrific for handling metadata and for making queries that require inferences across imprecise or disparate data. For example, a query along the lines of, “How many energy producers qualify for ‘green’ status this year?” is much easier to perform using an RDF query language than in SQL (once the models have been created to tie together various data stores). At the same time, queries that are trivial in SQL, such as, “Which energy producers reduced their CO2 output the most this year?” can be quite complicated using an RDF query language.
It is important to note that RDF query languages are still evolving, which may to some extent explain this limitation. Other limitations of RDF relate to performance issues. Because queries can be broadened, for example, to include concepts instead of just terms, the search space can be dramatically increased. Because RDF data stores are relatively new and the number of implementations relatively small, system developers need to iterate over their designs, paying particular attention to queries and functions that could have negative effects on performance. In terms of industry growth, it is difficult to predict how RDF will affect the database industry. RDF data stores may remain a distinct data storage category in their own right or their capabilities may be subsumed into relational databases in a manner similar to what happened with object-oriented databases.
Mediation engines are automated tools that can dynamically transform data among different syntaxes, structures, and semantics using models instead of hard-wired transformation code. They are critical components of any interoperability architecture. Using data maps, ontologies, and other forms of conceptual models, mediation engines are run-time processes that provide an abstraction layer between heterogeneous data sets, allowing organizations to essentially agree to disagree about how data and information should be represented. Mediation engines typically work with highly structured data. Unstructured and semi-structured data must first be bound to a schema prior to creating the mediation maps (Pollock, 2004).
Inference engines (sometimes referred to as reasoners) are software tools that derive new facts or associations from existing information. It is often said that an inference engine emulates the human capability to arrive at a conclusion by reasoning. In reality, inferencing is not some mythical artificial intelligence capability but, rather, a quite common approach in data processing. One can think of a complex data mining exercise as a form of inferencing. By creating a model of the information and relationships, we enable reasoners to draw logical conclusions based on the model. A common example of an inference is to use models of people and their connections to other people to gain new knowledge. Exploration of these network graphs can enable inferences about relationships that may not have been explicitly defined. Note that with just RDF and OWL, inferences are limited to the associations represented in the models, which primarily means inferring transitive relationships. With the addition of rule and logic languages, however, greater leaps in conceptual understandings, learning, and adaptation can take place, although implementations with these types of capabilities are, as yet, few and far between. Both free and commercial versions of inference engines are available. For example, Jena, an open source Java framework for writing Semantic Web applications developed by HP Labs, has a reasoner subsystem. Jena reasoner includes a generic rule based inference engine together with configured rule sets for RDFS and for the OWL-Lite subset of OWL Full. JESS is a popular OWL inference engine from Carnegie Melon University. Network Inference offers a commercial reasoner based on description logic (OWL-DL).
Ordinary web pages are a good source of instance information; many tools for populating ontologies are based on annotation of web pages. W3C Annotea project offers free annotation tools. Commercial vendors include Ontoprise and Lockheed-Martin. Several software vendors, including Semagix, Siderian Software and Entopia, offer products that use ontologies to categorize information and to provide improved search and navigation.
Applications of Semantic Technologies
Semantic technologies can solve problems that, using current technologies, are unsolvable at any price.
There are a wide variety of applications where semantic technologies can provide key benefits. At their core, semantic approaches are an infrastructure capability that, when combined with other key technologies, represent the next wave of computing. When taken with a multi-year view, there is great promise that these technologies will help the IT industry reach the ever-elusive goal of truly adaptive computing. In some respects, though, the future is already happening. Commercial enterprises and government agencies are implementing production-level programs using existing semantic data stores, ontologies, toolsets, and applications. A few of these near-term project areas include Semantic Web services, semantic interoperability, and intelligent search.
Semantic Web Services
A Web service is a software system designed to support interoperable machine-to-machine interaction over a network. A Web service has an interface described in a machine-processable format using Web Services Description Language (WSDL). The combination of WSDL, UDDI, and SOAP form a triad of technologies that will shift the entire market toward service-oriented architectures (SOA). Together, these technologies provide directory, component lookup, and exchange protocol services on top of an HTTP or SMTP network protocol. Microsoft, IBM, and most other large software vendors have embraced the concepts and languages that underlie the Web services model, and an increasing number of books and industry articles point to the benefits of adopting a service oriented architecture. Web services, however, are not without shortcomings. Security issues have long been a concern but several solutions that address these issues have been introduced over the last several years. Perhaps the most significant improvement opportunities for Web services that remain are in the areas of
(a) flexible look-up and discovery and
(b) information management and schema transformation.
Fundamentally, Web service technologies handle messages in a loosely coupled manner but they do not currently bridge differences in description terminologies nor do they inherently enable the recipient to understand a message that has been sent. With Web services, these parts of the exchange rely on custom-coded solutions and/or widespread community agreement upon some kind of document exchange standard (the latter is rarely achieved).
This difficulty in ensuring flexible discovery and service initiation, as well as seamless operational use of information exchanged with Web services, has led to W3C’s efforts to incorporate semantic technologies as part of its Semantic Web Services initiative. Semantic Web Services are a Web Service implementation that leverages the Web Ontology Language Service specification (OWL-S) to provide a flexible framework for describing and initiating web services. OWL-S supplies Web service providers with a core set of markup language constructs for describing the properties and capabilities of their Web Services in unambiguous, computer-interpretable form. OWL-S markup of Web services will facilitate the automation of Web service tasks, including automated Web service discovery, execution, interoperation, composition, and execution monitoring. Following the layered approach to markup language development, the current version of OWL-S builds upon W3C’s standard OWL.
Formally put, the use of semantic technologies makes it possible to describe the logical nature and context of the information being exchanged, while allowing for maximum independence among communicating parties. The results are greater transparency and more dynamic communication among information domains irrespective of business logic, processes, and workflows (Pollock and Hodgson, 2004).
The technical vision is one where flexible information models, not inflexible programs or code, are used to drive dynamic, self-healing, and emergent infrastructures for the sharing of mission critical data in massively scaleable environments. Recent advances in taxonomy and thesaurus technology, context modeling approaches, inferencing technology, and ontology-driven interoperability can be applied in a cohesive framework that dramatically changes the way information is managed in disperse, decentralized communities of knowledge (Pollock and Hodgson, 2004).
NASA views semantic interoperability as an extremely promising way to make information available to all stakeholders without having to standardize on a particular format or vocabulary or re-key databases to conform to a uniform model. One example where NASA is using these concepts is to address serious and ongoing maintenance problems related to the aging wiring systems within the Space Shuttle fleet. The existing set of wiring system databases contains information about part specifications, bills of materials, drawings, change orders, standard practices, test procedures, test reports, inspection reports, failure tracking and reporting information, work orders, and repair disposition documentation. Tens of diverse databases and systems – each supporting different but related aspects of engineering and design work – are in use within NASA with related data dispersed among several contracting companies that support the Space Shuttle program. Troubleshooting wiring problems requires timely access to many cross-organizational systems, databases, and knowledge repositories, the breadth of which is enormous. The situation becomes especially critical for diagnosing and troubleshooting in-flight anomalies whereby a timely resolution is mission-critical as well as life-critical. The work to make these sets of data richer and more accessible to the numerous parties who need access to them is still in its early stages, but as highlighted in the quote at the beginning of this chapter, semantic technologies represent one of the more promising ways to address what is largely an unsolvable problem using current data integration approaches.
A highly distilled version of how such a project works is as follows. Design-time tools are used to develop RDF and OWL models that encompass a particular domain. These models could be based on existing XML standards or defined via other means. Other design-time tools can be used to flexibly map specific data representations to these models, thus eliminating the need to explicitly convert applications to adopt a certain data standard. Run-time processes can then use these models and maps as pivot tables to transform data from source to target or to perform federated queries from a single query statement. Semantic interoperability frameworks of this type can provide a solid basis for better resolving differences in syntax, structure, and semantics – ushering in a future where organizations can agree to disagree, yet still share data and interoperate without having to change their current methods of operation.
One of the key advantages of using semantic interoperability approaches is that they do not necessarily require the replacement of existing integration technologies, databases, or software applications. A semantic framework made up of various semantics-based components and application program interfaces (APIs) can be deployed with web services or traditional middleware APIs to leverage existing infrastructure investments, and yet still provide massive benefits by virtually centralizing the query, transformation, and business rules metadata that flows through the network infrastructure’s pipes. As such, the software will fit into the customer’s existing IT ecosystem with low overhead for installation, minimal coding, and maximum reusability (Pollock and Hodgson, 2004).
Related in some regards to semantic interoperability is the area of intelligent search. As mentioned above, semantic interoperability techniques can allow queries native to one system to be federated to other non-native systems. This eliminates the need to convert systems to a universal query language and enables systems to continue maintaining the information they have in their current formats. By overlaying a virtual layer on top of the data sources, queries can be defined in a universal manner, thereby enabling access to all mapped assets. Federated searching can also be made smarter by making searches more semantically precise. In other words, searches can be broadened to include concepts, or narrowed to include only specific key words. The depth – or granularity – of such searches enables the specification of the search that the individual desires. Another aspect of intelligent search is the ability to make searches more relevant to the person searching by making use of identity and relationship information. Relationships among people and information about them can be key links to greater relevance and confidence. Despite investments in knowledge management systems, many people still rely on their personal network of friends, neighbors, co-workers, and others to locate experts or find trusted information. Personal relationships are also useful in sales situations and in many organizational interactions. Social networking schemas and software are making broad use of this.
An example of how this information can be used on a larger scale is the case of a telephone company exploring technologies for providing more intelligent phone number look-up. Instead of providing a generic list of matched names, the telecommunications company is looking at combining information about the person searching and the list of possible names, in order to provide a more intelligent match. For example, inferring relationships between social networks could provide information on whether a person is known, or could be expected to be known, to the other person (by employing friend-of-a-friend forms of calculations). Other information such as locations or schools attended or past or current jobs could be used to infer matches. To be sure, there are significant privacy issues involved; many believe, however, that techniques such as hashing and encryption of personally identifiable information and progressive disclosure will likely resolve many privacy concerns. Semantic approaches for enabling intelligent search are beginning to find their ways into knowledge management systems. Whereas current knowledge management systems tend to exist within their own silos and have difficultly crossing organization boundaries, intelligent search techniques can be added as overlays to existing information infrastructures, thereby bridging physical data formats, knowledge domains, and organizational structures.