以文本方式查看主题

-  中文XML论坛 - 专业的XML技术讨论区  (http://bbs.xml.org.cn/index.asp)
--  『 WORD to XML, HTML to XML 』  (http://bbs.xml.org.cn/list.asp?boardid=13)
----  XML到RDF的自动转换  (http://bbs.xml.org.cn/dispbbs.asp?boardid=13&rootid=&id=43632)


--  作者:xiaoweixiong
--  发布时间:3/5/2007 1:15:00 PM

--  XML到RDF的自动转换
哪位会做"利用GRDDL实现XML到RDF的自动转换",有报酬!
--  作者:admin
--  发布时间:4/3/2007 11:47:00 PM

--  
Converting XML to RDF
by [URL=http://www.xml.com/pub/au/42]Bob DuCharme[/URL]
September 01, 2004


[URL=http://ad.doubleclick.net/click%3Bh=v8/3529/3/0/%2a/b%3B95546771%3B0-0%3B0%3B14622526%3B4252-336/280%3B20634436/20652330/1%3B%3B~aopt%3D2/0/ff/0%3B~sscs%3D%3fhttp://clk.atdmt.com/goiframe/23608745/tchteafm0080000038ave/direct;wi.336;hi.280/01][/URL] [URL=http://www.xml.com/pub/a/2004/08/04/tr-xml.html]Last month[/URL] we looked at the REST interface to Amazon Web Services (AWS), and how an f parameter in a URL calling this interface can point to an XSLT stylesheet. If you set it to "xml" instead of pointing it at a stylesheet, Amazon returns data in formats that conform to either the "lite" or "heavy" DTDs (and corresponding schemas) included with their SDK; if you do, their server applies the stylesheet to that data at the server before returning the result to you.

In that column, I promised to show how to use this feature to pull RDF from the Amazon servers. I had written a stylesheet called aws2rdf.xsl, but the more I thought about it the more I realized that such a stylesheet needed very few dependencies on the Amazon Web Services DTDs, and that it could convert a wide variety of XML to RDF. So, I revised and renamed it to xml2rdf.xsl, and we'll look at it here.

RDF and Data-Oriented XML
RDF/XML sometimes looks strange, but it doesn't need to to. [URL=http://www.xml.com/pub/a/2002/10/30/rdf-friendly.html]RDF-friendly[/URL] XML adds a few things to otherwise typical-looking XML so that an RDF parser can treat all of its information as RDF triples.

This is not too difficult as long as your XML has no text nodes with elements as siblings. For example, <p>this p element has <emph>three</emph> text nodes and <emph>two</emph> emph elements</p>. XML developers often call this "mixed content" because the p element's contents are a mix of text nodes and elements. The [URL=http://www.w3.org/TR/xml11#sec-mixed-content]official definition[/URL] of mixed content, however, is any element type that may have any character data, <p>even a p element like this</p>.

An element that has only character data and isn't "mixed" in the more popular sense can often be converted to RDF/XML without much trouble. Many applications use these elements, along with [URL=http://www.w3.org/TR/xml11/#sec-element-content]element content[/URL] container elements that group these elements, to represent transactions and database records — what people often call "data-oriented" XML, despite the fact that all XML is data (or rather, [URL=http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-intro]data objects[/URL]). The kind of XML used to describe narrative content for publication in one medium or another — what people call "document-oriented" XML, despite [URL=http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-intro]all XML being in documents[/URL] — is more likely to have elements and text nodes as siblings of each other (like in the first p example in the preceding paragraph), and is not a good candidate for automated conversion to RDF.

The data being returned by Amazon Web Services, which obviously came from relational databases somewhere, is a fine candidate for conversion to RDF. Besides, Amazon is in the business of selling physical objects, and its site provides metadata about those objects. Having that data in RDF-friendly XML makes it easier to link this metadata with other metadata, thereby extending the potential reach of the Semantic Web.

A Somewhat Generic XML to RDF Converter
When processing XML documents that are good candidates for conversion to RDF/XML, a stylesheet can handle certain tasks generically. Other tasks require modifications to the conversion stylesheet to prepare it for the specific input that's coming. The generic parts of the stylesheet below, which come after the comment beginning with the words "End of template rules addressing," automate the advice given in the XML.com article [URL=http://www.xml.com/pub/a/2002/10/30/rdf-friendly.html]Make Your XML RDF-Friendly[/URL]. Rule numbers mentioned below refer to the numbered pieces of advice in that article.

The first half of the stylesheet has the parts that require editing to prepare the stylesheet for your particular source documents. The bold parts show my customizations to tailor the stylesheet for documents returned by Amazon Web Services:

As Rule Number 1 says, make sure that every element comes from a specific namespace, so the namespace must be declared. I clipped the filename off the URIs used for the U.S./Japan versions of the DTDs and schemas to come up with http://xml.amazon.com/schemas3/ as an Amazon Web Services namespace URI.

The result of the transformation will be metadata about a single resource, and the "resourceURL" variable is where the stylesheet stores the URL of that resource. While there are several variations on the basic URI that take you to the web page describing a particular book on Amazon, the developer's kit describes a format of http://www.amazon.com/exec/obidos/ASIN/ followed by the ASIN number, so the stylesheet below constructs this URL by appending the ASIN number (using an XPath expression to pull it out of the XML) to that URI string.

The generic code later in the stylesheet uses the namespace prefix for the described resource's properties in several different places, so storing it in a variable lets us leave the generic code alone. This should be the prefix declared with the namespace URI added to the xsl:stylesheet start-tag — in this case, "aws."

You won't necessarily want every element in your source document passed along to your RDF version, so add the names of the ones to suppress to the stylesheet's first template rule.

Similarly, certain container elements in the source won't add anything to the RDF version, so adding their names to the second template rule tells the stylesheet to pass along their contents without their enclosing tags. (As we'll see, certain containers are very useful, so we'll keep them.)


<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                xmlns:aws="http://xml.amazon.com/schemas3/">

  <!-- Convert XML to RDF that all describes one resource. Template
       rules after "End of template rules" comment are generic; those
       before are for customizing treatment of source XML
       (e.g. deleting elements). -->

  <!-- URL of the resource being described. -->
  <xsl:variable name="resourceURL">
    <xsl:text>http://www.amazon.com/exec/obidos/ASIN/</xsl:text>
    <xsl:value-of select="/ProductInfo/Details/Asin"
/>
  </xsl:variable>

  <!-- Namespace prefix for predicates. Needs a corresponding xmlns
       declaration in the xsl:stylesheet start-tag above. If your set
       of predicates come from more than one namespace, than this
       stylesheet is too simple for your needs. -->
  <xsl:variable name="nsPrefix">aws</xsl:variable>

  <!-- Elements to suppress. priority attribute necessary
       because of template that adds rdf:parseType above. -->
  <xsl:template priority="1" match="Request|TotalResults|TotalPages"/>

  <!-- Just pass along contents without tags.  -->
  <xsl:template match="ProductInfo|Details">
    <xsl:apply-templates/>
  </xsl:template>

  <!-- ========================================================
       End of template rules addressing specific element types.
       Remaining template rules are generic xml2rdf template rules.
       ======================================================== -->

  <xsl:template match="/">
    <rdf:RDF>
      <rdf:Description
       rdf:about="{$resourceURL}">
        <xsl:apply-templates/>
      </rdf:Description>
    </rdf:RDF>
  </xsl:template>

  <!-- Elements with URLs as content: convert them to store
       their value in rdf:resource attribute of empty element -->
  <xsl:template match="*[starts-with(.,'http://') or starts-with(.,'urn:')]">
    <xsl:element name="{$nsPrefix}:{name()}">
      <xsl:attribute name="rdf:resource">
        <xsl:value-of select="."/>
      </xsl:attribute>
    </xsl:element>
  </xsl:template>

  <!-- Container elements: if the element has children and an element parent
       (i.e. it isn't the root element) and it has no attributes, add
       rdf:parseType = "Resource". -->

  <xsl:template match="*[* and ../../* and not(@*)]">
    <xsl:element name="{$nsPrefix}:{name()}">
      <xsl:attribute name="rdf:parseType">Resource</xsl:attribute>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
  </xsl:template>

  <!-- Copy remaining elements, putting them in a namespace. -->
  <xsl:template match="*">
    <xsl:element name="{$nsPrefix}:{name()}">
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
  </xsl:template>

</xsl:stylesheet>

The generic part of the stylesheet has four template rules:

The first template rule in the generic part (the third template rule in the stylesheet) wraps the contents in an rdf:RDF element and identifies the resource being described.

The next template rule implements RDF-friendliness Rule Number 4, converting any elements whose contents consist of a URI (or rather, any elements whose contents begin with "http://" or "urn:") into empty elements with the URI stored in an rdf:about attribute.

The stylesheet's second-to-last template rule follows the advice given near the end of RDF-friendliness Rule 6 by adding an rdf:parseType attribute with a value of "Resource" to container elements that aren't the root element of the document. This way, these containers won't throw off the striping pattern of nested predicate/object pairs that an RDF processor expects to find in an RDF/XML document.

The stylesheet's last template rule copies any elements not covered by the other template rules to the result tree with the namespace prefix from the nsPrefix variable added onto their names.

I tested this with both "lite" and "heavy" XML returned by Amazon Web Services for various books, CDs, authors, and bands, and the [URL=http://www.hpl.hp.com/personal/jjc/arp/]ARP2[/URL] RDF parser had no problem with any of the results. (For authors and bands, though, the RDF isn't quite semantically correct, because all of the triples created by the stylesheet have the same subject, so it makes more sense to use this for Amazon pages that describe a single work such as a book or CD.) For example, with the stylesheet stored at [URL=http://www.snee.com/xsl/xml2rdf.xsl]http://www.snee.com/xsl/xml2rdf.xsl[/URL], the following REST URL (with carriage returns deleted and a working developer ID substituted for "dev-ID-here") retrieves kosher RDF metadata (saved version [URL=http://www.snee.com/xsl/quinetapes.rdf]here[/URL]; when viewing with a browser, do a View Source to see the RDF/XML) about the boxed set of Robert Quine's live recordings of the Velvet Underground:

http://xml.amazon.com/onca/xml3?locale=us&t=bobducharmeA
&dev-t=dev-ID-here&AsinSearch=B00005Q567&mode=music
&type=heavy&f=http://www.snee.com/xsl/xml2rdf.xsl
    
With the appropriate revisions to the bold parts of the stylesheet above, there's a lot of regularly structured XML out there that could be converted to RDF. The great thing about using it on XML returned by Amazon Web Services is that we can execute the XSLT transformation on Amazon's servers, so a single REST URL can retrieve RDF directly from Amazon. This is the power that Amazon has put into our hands by letting us use its server-side XSLT processor with its database.

(For more on mapping XML to RDF using XSLT, see Michael Sperberg-McQueen and Eric Miller's Extreme 2004 paper [URL=http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Sperberg-McQueen01/EML2004Sperberg-McQueen01.html]On mapping from colloquial XML to RDF using XSLT[/URL].)


--  作者:admin
--  发布时间:4/3/2007 11:50:00 PM

--  
这里还有一篇专门论述XML to RDF的论文:

http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Sperberg-McQueen01/EML2004Sperberg-McQueen01.html

下面是该文的Introduction部分:


On mapping from colloquial XML to RDF using XSLT

C. M. Sperberg-McQueen [World Wide Web Consortium, MIT Computer Science and AI Laboratory]


Eric Miller [World Wide Web Consortium, MIT Computer Science and AI Laboratory]
Introduction
Let us begin by trying to make explicit some assumptions we are making which may or may not be fully shared by others.

The mapping problem
An application of XML or SGML defines what some people call a markup language, and other people would prefer to refer to as a markup vocabulary or namespace. Since some people prefer to reserve the term markup language for meta-languages like XML and SGML, the following discussion will use the term vocabulary — without, however, intending to obscure the fact that the XML-based applications in question do have rules that go beyond the provision of names and may be captured in whole or in part by syntactic formalisms.

Many people (vocabulary designers, schema and DTD authors, application developers, people trying to make it easier to work with documents in markup languages designed by others, and no doubt others, too) wish to say, for specific constructs in a vocabulary, what they mean. By the constructs of a vocabulary we mean primarily the element( type)s, attributes, notations, processing-instruction targets, and entities defined in that vocabulary; in some cases, it is convenient also to include simple or complex datatypes and substitution groups (as in XML Schema 1.0), non-terminals (as in Relax and Relax NG), classes (as in the ODD system used to generate the Text Encoding Initiative DTDs), or other abstractions under this term.

Some of those who wish to say what markup constructs mean wish to do so using some machine-processable notation; others would be happy with better tools for human-understandable documentation. We are here concerned mostly with the former, though good rules for machine-processable specification of meaning may also help make meaning clear to humans.

Two difficulties attend any effort say what markup constructs mean. First of all, different people have very different ideas of what would be involved. And second, if such an attempt is not to remain a purely individual mental exercise, the results must be written down or spoken in some language with its own syntactic rules. What may have started as an attempt to focus on semantics to the exclusion of syntax thus concludes by looking like just another translation from one syntax to another. Let us examine these two difficulties in more detail.

First, different people have very different ideas of what it would mean, for the constructs of a vocabulary, to say what they mean. For purposes of discussion, we identify five. The first four are all mapping problems in one way or another:

Some people mean by this that they wish to be able to specify how data structures internal to some application software are serialized as XML, or how XML is de-serialized into data structures; questions like "When does an element become an object of class Foo, and when does it become an object of class Foobar?", asked with reference to some set of object classes defined in some programming language, are central to their concerns. Call this the concrete data-structure mapping problem. An example of this approach is [Krupnikov/Thompson 2001].
Others wish to specify how to map XML document instances into columns, rows, and tables in some SQL database management system; sometimes they wish to specify a mapping into new rows of existing tables, and sometimes what is needed is a mapping which would specify which new tables to create. Call this the abstract data structure mapping problem. It differs from the concrete data structure mapping problem as the abstraction of a SQL table differs from the various programming-language constructs which might be used to implement the abstraction. See [Vorthmann/Buck 2000a], [Vorthmann/Buck 2000b].
Still others wish to specify a mapping into first-order predicate calculus as a way of defining the correct interpretation of markup. Call this the FOPC mapping problem. Cf. [Sperberg-McQueen/Huitfeldt/Renear 2001a].
Some wish to map arbitrary XML into RDF.1 Call this the RDF mapping problem. See for example [Haza&euml;l-Massieux/Connolly 2004].
These four mapping problems seem to cover the most frequently discussed ground among those interested primarily in machine-processable descriptions of meaning, but we have no proof that the classification is necessarily exhaustive, and nothing in the further argument requires that it be exhaustive.

Some people believe that the four mapping problems described above do not necessarily have much in common. Others believe that all of them are at root ‘the same thing’. Mostly, they seem to mean by this that if a formalism is provided for what they wish to do, they believe that everyone else's requirements will be met. They do not, in general, seem to mean that if anyone else's requirements are met, they will be able to do what they wish to do.

The four mapping problems identified above do have in common that they involve defining a meaning-preserving mapping from XML notation into some other model. Let us call this other model the target model. If the target model has a syntax in which it can be serialized, let us call that syntax the target formalism. We will be concerned only with target models which can be serialized in this way; it may be possible to extend our proposals to some models with non-serial notations, but not to ineffable models (those to which no notation at all is adequate).

If the target model has a corresponding target formalism, then all four of the mapping problems can be conceived of as involving the translation of information from one syntax (XML) into some other syntax. A mapping problem may thus be conceived of as a syntax-to-syntax translation even if, in practice, the result desired is not a string of characters denoting some abstraction, but some other representation of the abstraction (such as an in-memory data structure).

A fifth idea of what it means to describe the meaning of a vocabulary should also be mentioned:

Some wish to communicate enough information about each construct in a vocabulary to other human beings to enable them to recognize and use the elements and attributes correctly. Call this the documentation problem.
The documentation problem is known to be soluble, but the solution is not easy: intelligent humans must write clear natural-language descriptions of the vocabulary, and attentive humans must read them and interpret them correctly. This is straightforward but not automatable. Numerous vocabularies for describing markup vocabularies have been developed and used over the years; their use may make the construction of useful, fairly complete documentation easier, but they cannot make it mechanical. Nothing in this paper reduces the importance of the documentation problem or makes it any easier to solve.2

Any solution to the documentation problem produces, by definition, correct understanding in the part of a hearer or reader. Since such understanding is a pre-requisite to the creation of any meaning-preserving transformation or mapping, it may be noted that the solution of the documentation problem appears to be a prerequisite to any solution of any of the mapping problems, except where the mapping problems are solved by the original specifier of a vocabulary without the need for communication with any other humans. The converse is not true: solutions to the mapping problems are neither prerequisites nor necessary consequences of solutions to the documentation problem. Solving the mapping problems, on the other hand, would make it possible to perform more useful work with marked up data without involving the need for quite so many attentive human programmers. This might be advantageous because attentive human programmers are commonly in short supply.

The fifth idea of meaning brings us to the second difficulty identified above. Reducing the mapping problem to a translation problem operating at the level of syntax may trouble some, particularly those interested in the documentation problem. If we regard the realm of meanings as distinct in some way from that of utterances or syntax, we are bound to be disappointed in a mapping-oriented solution, because the solution seems to shift ground from the ethereal to the mundane. But strictly speaking, any useful formulation of semantics is reducible in this way to a problem operating at the level of syntax. Any attempt to say what markup means necessarily involves constructing some utterance in some perceivable form. That utterance can only be described and interpreted in terms of some syntax: without syntax, all meanings are ineffable.

The involvement of the syntactic layer does not (pace some reviewers of this paper) render the mapping problems mentioned above meaningless, nor does it divorce them from meaning. The mapping problems are not solved by arbitrary mappings from XML into the target models, but only by mappings which retain the meaning of the original.3 (In specialized cases, it may suffice for practical purposes to capture only part of the meaning.) It is easy to dismiss these as merely pushing a bump in the rug from one location to another: having translated from XML into some other notation, are we not still faced with the task of specifying the meaning of that other notation? In cases where the target notation is as opaque to us as the original notation, the criticism has some justice. When the target notation is well understood, however, the translation does precisely what is needed. And we stress again: every successful explanation takes the form of translation from one syntax into another. The documentation problem is also a mapping problem and differs from the others only in substituting the syntax of English or French or some other natural language for the machine-processable target syntaxes of the other views. On the positive side, the fact that all specification of semantics is thus reducible to a problem in specifying syntactic transformations means that we can directly exploit without embarrassment the long history of work on mechanisms for syntax-driven transformations of marked up data.

Since in the long run every notation must be explained to be useful, it is an inescapable prerequisite for any useful work with marked up data that the documentation problem be solved for some notation or other. Since in the long run one of the main reasons for using markup is to reduce the need for human intervention in routine information procesing, however, solving the documentation problem alone will not suffice to allow us to exploit markup to full advantage. Hence this paper's emphasis on machine-processable target notations.


--  作者:duotiger
--  发布时间:4/4/2007 11:03:00 AM

--  
哇~~
--  作者:peachpig
--  发布时间:4/4/2007 8:28:00 PM

--  
哇~
--  作者:xiaoweixiong
--  发布时间:4/12/2007 12:43:00 PM

--  
谢谢楼上的大哥,能不能指点一下大概的过程.
--  作者:magiclmm
--  发布时间:4/13/2007 3:03:00 PM

--  
哇塞。原来有人讨论的呀。我也想请教啊。有没有直接的转换工具呢

--  作者:micropuss
--  发布时间:4/8/2008 12:06:00 PM

--  
好帖,顶起来!

很多人都想找现成的,我也想,但是发现现成的不好用,还是自己搞吧,毕竞人家是按照他们自己的想法去构建的!!


--  作者:wacn
--  发布时间:4/15/2008 9:38:00 AM

--  
owl可以转换成RDF吗?
--  作者:dengluxin
--  发布时间:4/16/2008 4:33:00 PM

--  
xml、owl、rdf之间的转换方式?有无可用的工具?
--  作者:micropuss
--  发布时间:4/16/2008 10:26:00 PM

--  
以下是引用dengluxin在2008-4-16 16:33:00的发言:
xml、owl、rdf之间的转换方式?有无可用的工具?

有,不好用。有兴趣就看一下吧,还有几个类似的project在做。

http://seed.uma.pt/projects/jxml2owl/jxml2owlapi/index.html


--  作者:XSLFO
--  发布时间:4/18/2008 11:02:00 PM

--  
用XSL:FO来实现, 不知道你的样式要求如何?
--  作者:Humphrey
--  发布时间:5/21/2009 8:26:00 PM

--  
其实我们通常接触的RDF文件就是以XML语法为基础编制而成的,OWL也同样有XML语法的根。但是这3种文件格式对数据的描述的刚性不同,如果不是有针对性的实现这种转化可能会遇到比较烦人的问题。
另外,费尽心思地实现这种横向转化最主要的目的是什么呢?真的有这么做的必要吗?
W 3 C h i n a ( since 2003 ) 旗 下 站 点
苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》
93.750ms