以文本方式查看主题 - 中文XML论坛 - 专业的XML技术讨论区 (http://bbs.xml.org.cn/index.asp) -- 『 WORD to XML, HTML to XML 』 (http://bbs.xml.org.cn/list.asp?boardid=13) ---- XML到RDF的自动转换 (http://bbs.xml.org.cn/dispbbs.asp?boardid=13&rootid=&id=43632) |
-- 作者:xiaoweixiong -- 发布时间:3/5/2007 1:15:00 PM -- XML到RDF的自动转换 哪位会做"利用GRDDL实现XML到RDF的自动转换",有报酬! |
-- 作者:admin -- 发布时间:4/3/2007 11:47:00 PM -- Converting XML to RDF by [URL=http://www.xml.com/pub/au/42]Bob DuCharme[/URL] September 01, 2004 In that column, I promised to show how to use this feature to pull RDF from the Amazon servers. I had written a stylesheet called aws2rdf.xsl, but the more I thought about it the more I realized that such a stylesheet needed very few dependencies on the Amazon Web Services DTDs, and that it could convert a wide variety of XML to RDF. So, I revised and renamed it to xml2rdf.xsl, and we'll look at it here. RDF and Data-Oriented XML This is not too difficult as long as your XML has no text nodes with elements as siblings. For example, <p>this p element has <emph>three</emph> text nodes and <emph>two</emph> emph elements</p>. XML developers often call this "mixed content" because the p element's contents are a mix of text nodes and elements. The [URL=http://www.w3.org/TR/xml11#sec-mixed-content]official definition[/URL] of mixed content, however, is any element type that may have any character data, <p>even a p element like this</p>. An element that has only character data and isn't "mixed" in the more popular sense can often be converted to RDF/XML without much trouble. Many applications use these elements, along with [URL=http://www.w3.org/TR/xml11/#sec-element-content]element content[/URL] container elements that group these elements, to represent transactions and database records — what people often call "data-oriented" XML, despite the fact that all XML is data (or rather, [URL=http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-intro]data objects[/URL]). The kind of XML used to describe narrative content for publication in one medium or another — what people call "document-oriented" XML, despite [URL=http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-intro]all XML being in documents[/URL] — is more likely to have elements and text nodes as siblings of each other (like in the first p example in the preceding paragraph), and is not a good candidate for automated conversion to RDF. The data being returned by Amazon Web Services, which obviously came from relational databases somewhere, is a fine candidate for conversion to RDF. Besides, Amazon is in the business of selling physical objects, and its site provides metadata about those objects. Having that data in RDF-friendly XML makes it easier to link this metadata with other metadata, thereby extending the potential reach of the Semantic Web. A Somewhat Generic XML to RDF Converter The first half of the stylesheet has the parts that require editing to prepare the stylesheet for your particular source documents. The bold parts show my customizations to tailor the stylesheet for documents returned by Amazon Web Services: As Rule Number 1 says, make sure that every element comes from a specific namespace, so the namespace must be declared. I clipped the filename off the URIs used for the U.S./Japan versions of the DTDs and schemas to come up with http://xml.amazon.com/schemas3/ as an Amazon Web Services namespace URI. The result of the transformation will be metadata about a single resource, and the "resourceURL" variable is where the stylesheet stores the URL of that resource. While there are several variations on the basic URI that take you to the web page describing a particular book on Amazon, the developer's kit describes a format of http://www.amazon.com/exec/obidos/ASIN/ followed by the ASIN number, so the stylesheet below constructs this URL by appending the ASIN number (using an XPath expression to pull it out of the XML) to that URI string. The generic code later in the stylesheet uses the namespace prefix for the described resource's properties in several different places, so storing it in a variable lets us leave the generic code alone. This should be the prefix declared with the namespace URI added to the xsl:stylesheet start-tag — in this case, "aws." You won't necessarily want every element in your source document passed along to your RDF version, so add the names of the ones to suppress to the stylesheet's first template rule. Similarly, certain container elements in the source won't add anything to the RDF version, so adding their names to the second template rule tells the stylesheet to pass along their contents without their enclosing tags. (As we'll see, certain containers are very useful, so we'll keep them.) <!-- Convert XML to RDF that all describes one resource. Template <!-- URL of the resource being described. --> <!-- Namespace prefix for predicates. Needs a corresponding xmlns <!-- Elements to suppress. priority attribute necessary <!-- Just pass along contents without tags. --> <!-- ======================================================== <xsl:template match="/"> <!-- Elements with URLs as content: convert them to store <!-- Container elements: if the element has children and an element parent <xsl:template match="*[* and ../../* and not(@*)]"> <!-- Copy remaining elements, putting them in a namespace. --> </xsl:stylesheet> The generic part of the stylesheet has four template rules: The first template rule in the generic part (the third template rule in the stylesheet) wraps the contents in an rdf:RDF element and identifies the resource being described. The next template rule implements RDF-friendliness Rule Number 4, converting any elements whose contents consist of a URI (or rather, any elements whose contents begin with "http://" or "urn:") into empty elements with the URI stored in an rdf:about attribute. The stylesheet's second-to-last template rule follows the advice given near the end of RDF-friendliness Rule 6 by adding an rdf:parseType attribute with a value of "Resource" to container elements that aren't the root element of the document. This way, these containers won't throw off the striping pattern of nested predicate/object pairs that an RDF processor expects to find in an RDF/XML document. The stylesheet's last template rule copies any elements not covered by the other template rules to the result tree with the namespace prefix from the nsPrefix variable added onto their names. I tested this with both "lite" and "heavy" XML returned by Amazon Web Services for various books, CDs, authors, and bands, and the [URL=http://www.hpl.hp.com/personal/jjc/arp/]ARP2[/URL] RDF parser had no problem with any of the results. (For authors and bands, though, the RDF isn't quite semantically correct, because all of the triples created by the stylesheet have the same subject, so it makes more sense to use this for Amazon pages that describe a single work such as a book or CD.) For example, with the stylesheet stored at [URL=http://www.snee.com/xsl/xml2rdf.xsl]http://www.snee.com/xsl/xml2rdf.xsl[/URL], the following REST URL (with carriage returns deleted and a working developer ID substituted for "dev-ID-here") retrieves kosher RDF metadata (saved version [URL=http://www.snee.com/xsl/quinetapes.rdf]here[/URL]; when viewing with a browser, do a View Source to see the RDF/XML) about the boxed set of Robert Quine's live recordings of the Velvet Underground: http://xml.amazon.com/onca/xml3?locale=us&t=bobducharmeA (For more on mapping XML to RDF using XSLT, see Michael Sperberg-McQueen and Eric Miller's Extreme 2004 paper [URL=http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Sperberg-McQueen01/EML2004Sperberg-McQueen01.html]On mapping from colloquial XML to RDF using XSLT[/URL].) |
-- 作者:admin -- 发布时间:4/3/2007 11:50:00 PM -- 这里还有一篇专门论述XML to RDF的论文: http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Sperberg-McQueen01/EML2004Sperberg-McQueen01.html 下面是该文的Introduction部分: C. M. Sperberg-McQueen [World Wide Web Consortium, MIT Computer Science and AI Laboratory] The mapping problem Many people (vocabulary designers, schema and DTD authors, application developers, people trying to make it easier to work with documents in markup languages designed by others, and no doubt others, too) wish to say, for specific constructs in a vocabulary, what they mean. By the constructs of a vocabulary we mean primarily the element( type)s, attributes, notations, processing-instruction targets, and entities defined in that vocabulary; in some cases, it is convenient also to include simple or complex datatypes and substitution groups (as in XML Schema 1.0), non-terminals (as in Relax and Relax NG), classes (as in the ODD system used to generate the Text Encoding Initiative DTDs), or other abstractions under this term. Some of those who wish to say what markup constructs mean wish to do so using some machine-processable notation; others would be happy with better tools for human-understandable documentation. We are here concerned mostly with the former, though good rules for machine-processable specification of meaning may also help make meaning clear to humans. Two difficulties attend any effort say what markup constructs mean. First of all, different people have very different ideas of what would be involved. And second, if such an attempt is not to remain a purely individual mental exercise, the results must be written down or spoken in some language with its own syntactic rules. What may have started as an attempt to focus on semantics to the exclusion of syntax thus concludes by looking like just another translation from one syntax to another. Let us examine these two difficulties in more detail. First, different people have very different ideas of what it would mean, for the constructs of a vocabulary, to say what they mean. For purposes of discussion, we identify five. The first four are all mapping problems in one way or another: Some people mean by this that they wish to be able to specify how data structures internal to some application software are serialized as XML, or how XML is de-serialized into data structures; questions like "When does an element become an object of class Foo, and when does it become an object of class Foobar?", asked with reference to some set of object classes defined in some programming language, are central to their concerns. Call this the concrete data-structure mapping problem. An example of this approach is [Krupnikov/Thompson 2001]. Some people believe that the four mapping problems described above do not necessarily have much in common. Others believe that all of them are at root ‘the same thing’. Mostly, they seem to mean by this that if a formalism is provided for what they wish to do, they believe that everyone else's requirements will be met. They do not, in general, seem to mean that if anyone else's requirements are met, they will be able to do what they wish to do. The four mapping problems identified above do have in common that they involve defining a meaning-preserving mapping from XML notation into some other model. Let us call this other model the target model. If the target model has a syntax in which it can be serialized, let us call that syntax the target formalism. We will be concerned only with target models which can be serialized in this way; it may be possible to extend our proposals to some models with non-serial notations, but not to ineffable models (those to which no notation at all is adequate). If the target model has a corresponding target formalism, then all four of the mapping problems can be conceived of as involving the translation of information from one syntax (XML) into some other syntax. A mapping problem may thus be conceived of as a syntax-to-syntax translation even if, in practice, the result desired is not a string of characters denoting some abstraction, but some other representation of the abstraction (such as an in-memory data structure). A fifth idea of what it means to describe the meaning of a vocabulary should also be mentioned: Some wish to communicate enough information about each construct in a vocabulary to other human beings to enable them to recognize and use the elements and attributes correctly. Call this the documentation problem. Any solution to the documentation problem produces, by definition, correct understanding in the part of a hearer or reader. Since such understanding is a pre-requisite to the creation of any meaning-preserving transformation or mapping, it may be noted that the solution of the documentation problem appears to be a prerequisite to any solution of any of the mapping problems, except where the mapping problems are solved by the original specifier of a vocabulary without the need for communication with any other humans. The converse is not true: solutions to the mapping problems are neither prerequisites nor necessary consequences of solutions to the documentation problem. Solving the mapping problems, on the other hand, would make it possible to perform more useful work with marked up data without involving the need for quite so many attentive human programmers. This might be advantageous because attentive human programmers are commonly in short supply. The fifth idea of meaning brings us to the second difficulty identified above. Reducing the mapping problem to a translation problem operating at the level of syntax may trouble some, particularly those interested in the documentation problem. If we regard the realm of meanings as distinct in some way from that of utterances or syntax, we are bound to be disappointed in a mapping-oriented solution, because the solution seems to shift ground from the ethereal to the mundane. But strictly speaking, any useful formulation of semantics is reducible in this way to a problem operating at the level of syntax. Any attempt to say what markup means necessarily involves constructing some utterance in some perceivable form. That utterance can only be described and interpreted in terms of some syntax: without syntax, all meanings are ineffable. The involvement of the syntactic layer does not (pace some reviewers of this paper) render the mapping problems mentioned above meaningless, nor does it divorce them from meaning. The mapping problems are not solved by arbitrary mappings from XML into the target models, but only by mappings which retain the meaning of the original.3 (In specialized cases, it may suffice for practical purposes to capture only part of the meaning.) It is easy to dismiss these as merely pushing a bump in the rug from one location to another: having translated from XML into some other notation, are we not still faced with the task of specifying the meaning of that other notation? In cases where the target notation is as opaque to us as the original notation, the criticism has some justice. When the target notation is well understood, however, the translation does precisely what is needed. And we stress again: every successful explanation takes the form of translation from one syntax into another. The documentation problem is also a mapping problem and differs from the others only in substituting the syntax of English or French or some other natural language for the machine-processable target syntaxes of the other views. On the positive side, the fact that all specification of semantics is thus reducible to a problem in specifying syntactic transformations means that we can directly exploit without embarrassment the long history of work on mechanisms for syntax-driven transformations of marked up data. Since in the long run every notation must be explained to be useful, it is an inescapable prerequisite for any useful work with marked up data that the documentation problem be solved for some notation or other. Since in the long run one of the main reasons for using markup is to reduce the need for human intervention in routine information procesing, however, solving the documentation problem alone will not suffice to allow us to exploit markup to full advantage. Hence this paper's emphasis on machine-processable target notations. |
-- 作者:duotiger -- 发布时间:4/4/2007 11:03:00 AM -- 哇~~ |
-- 作者:peachpig -- 发布时间:4/4/2007 8:28:00 PM -- 哇~ |
-- 作者:xiaoweixiong -- 发布时间:4/12/2007 12:43:00 PM -- 谢谢楼上的大哥,能不能指点一下大概的过程. |
-- 作者:magiclmm -- 发布时间:4/13/2007 3:03:00 PM -- 哇塞。原来有人讨论的呀。我也想请教啊。有没有直接的转换工具呢 |
-- 作者:micropuss -- 发布时间:4/8/2008 12:06:00 PM -- 好帖,顶起来! 很多人都想找现成的,我也想,但是发现现成的不好用,还是自己搞吧,毕竞人家是按照他们自己的想法去构建的!! |
-- 作者:wacn -- 发布时间:4/15/2008 9:38:00 AM -- owl可以转换成RDF吗? |
-- 作者:dengluxin -- 发布时间:4/16/2008 4:33:00 PM -- xml、owl、rdf之间的转换方式?有无可用的工具? |
-- 作者:micropuss -- 发布时间:4/16/2008 10:26:00 PM --
有,不好用。有兴趣就看一下吧,还有几个类似的project在做。 http://seed.uma.pt/projects/jxml2owl/jxml2owlapi/index.html |
-- 作者:XSLFO -- 发布时间:4/18/2008 11:02:00 PM -- 用XSL:FO来实现, 不知道你的样式要求如何? |
-- 作者:Humphrey -- 发布时间:5/21/2009 8:26:00 PM -- 其实我们通常接触的RDF文件就是以XML语法为基础编制而成的,OWL也同样有XML语法的根。但是这3种文件格式对数据的描述的刚性不同,如果不是有针对性的实现这种转化可能会遇到比较烦人的问题。 另外,费尽心思地实现这种横向转化最主要的目的是什么呢?真的有这么做的必要吗? |
W 3 C h i n a ( since 2003 ) 旗 下 站 点 苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》 |
93.750ms |