新书推介:《语义网技术体系》
作者:瞿裕忠,胡伟,程龚
   >>中国XML论坛<<     W3CHINA.ORG讨论区     计算机科学论坛     SOAChina论坛     Blog     开放翻译计划     新浪微博  
 
  • 首页
  • 登录
  • 注册
  • 软件下载
  • 资料下载
  • 核心成员
  • 帮助
  •   Add to Google

    >> XML与各种文件格式的相互转换及相关工具。 word to xml, xml to word, html to xml, xml to pdf,
    csv to xml, rtf to xml, text to xml, xml to text, xls to xml, xml to xls
    FOP
    [返回] 中文XML论坛 - 专业的XML技术讨论区XML.ORG.CN讨论区 - XML技术『 WORD to XML, HTML to XML 』 → When Word-to-XML conversions get nasty 查看新帖用户列表

      发表一个新主题  发表一个新投票  回复主题  (订阅本版) 您是本帖的第 11640 个阅读者浏览上一篇主题  刷新本主题   树形显示贴子 浏览下一篇主题
     * 贴子主题: When Word-to-XML conversions get nasty 举报  打印  推荐  IE收藏夹 
       本主题类别:     
     admin 帅哥哟,离线,有人找我吗?
      
      
      
      威望:9
      头衔:W3China站长
      等级:计算机硕士学位(管理员)
      文章:5255
      积分:18406
      门派:W3CHINA.ORG
      注册:2003/10/5

    姓名:(无权查看)
    城市:(无权查看)
    院校:(无权查看)
    给admin发送一个短消息 把admin加入好友 查看admin的个人资料 搜索admin在『 WORD to XML, HTML to XML 』的所有贴子 点击这里发送电邮给admin  访问admin的主页 引用回复这个贴子 回复这个贴子 查看admin的博客楼主
    发贴心情 When Word-to-XML conversions get nasty


    [B][URL=mailto:mikegross@dclab.com]Mike Gross[/URL],  Chief Technology Officer at Data Conversion Laboratory, Inc., reveals the five ways your conversion engine can get broken when converting MS Word documents to XML. (First published in [URL=http://www.cmswatch.com/]CMSWatch magazine[/URL]).[/B]
    OTHER XML RESOURCES ON DCLAB.COM

    按此在新窗口浏览图片
    [URL=http://www.dclab.com/dclfaq.asp#QuarkXML]Converting from Quark to XML[/URL]

    按此在新窗口浏览图片
    [URL=http://www.dclab.com/raq1.asp]Converting Adobe PageMaker and InDesign documents to XML[/URL]

    按此在新窗口浏览图片
    [URL=http://www.dclab.com/dclfaq.asp#diff]XML & SGML - What's the Difference? [/URL]

    按此在新窗口浏览图片
    [URL=http://www.dclab.com/techlibrary1.asp?GRP=1]DCL Technical Library, XML pages[/URL]

    One of the first hurdles facing any major content or document management implementation is what to do with legacy documents.  Chances are, many or most of those documents reside in Microsoft Word format, but enterprises often want to get them into a more open format, like XML.  This is particularly the case for STM (Scientific, Technical, Medical) publishing, where you find complicated - but highly structured - information along with tantalizingly attractive re-use opportunities.  But it is also true for everyday corporate documents as well.

    The Conundrum of Structure in Word Processing
    Most of the challenges faced when converting documents from MS Word to XML are typical of the challenges that you'll face when converting documents from any word-processing/desktop publishing mode (used to give it a particular look on a piece of paper) to a structure/content basis, where you're trying to explicitly indicate what something is, not how it looks.

    Like most publishing tools, Word provides users with elegant ways to produce their documents.  Features such as style templates, paragraph formats, and table editors make it easy to give documents a consistent appearance, so, in a large part, the ease of conversion will be a function of how well document authors have followed any such guidelines in producing their documents.  Consistently-styled Word documents, as well as simple documents (such as memos) can be fairly straightforward to convert.

    Real Word in the Real World
    In the real world, most enterprises do not employ Word's styling capabilities very well, which is actually quite understandable, since Word was never really intended to be a structured editor.  So a conversion must infer structural tagging from the visual clues that exist within the authored document.  

    And here you run into the problem of all the multiple ways people use word processing tools in the real world.

    Many if not most content authors posses only a minimal working knowledge of how to use Word.  We live in a world where Microsoft Office exists on millions of users' machines, but most Word users don't ever learn how to manipulate its features, and so they use Word in primitive ways to accomplish a particular appearance. Not only are these types of documents difficult to convert in a more automated (and therefore less expensive) approach, but they are also difficult to modify in any significant way to clean them up in the first place.   

    Let's not blame authors here, though; in most cases they never knew that their employer would someday want to preserve and repurpose the content, and even where re-use was sought, likely no one ever took the time to train them how to achieve those goals.

    5 ways to break your conversion engine
    In this article, I'll take a look at some particularly "nasty" examples (derived from real-world samples) of the types of Word constructs I encounter on a regular basis that are difficult to convert.  I'll use Word screenshots to display how the documents are formatted.  The examples are from Word 2000, but you can expect to see the same types of issues in other versions of Word.  Note: In the Word screen dumps, spaces are represented by a period, tabs by a right arrow, and paragraph hard returns by a paragraph symbol.  Click on any of the images to see a larger version.  

    1) Improperly Formatted Paragraphs
    按此在新窗口浏览图片

    In this section from a legal of document, list item (f) contains a nested paragraph, which is indented using Word's paragraph formatting, so that the text of the paragraph shifts to the right.  

    Of course, list item (g), should be formatted in a similar way.  Unfortunately, the person typing that segment did not know how to use proper Word paragraph formatting and indentation, and as a result, what logically represents only one paragraph now contains hard returns at the end of each line, followed by tabs to indent the next line.  

    Even from a Word perspective, this is bad, because if you decide to add a couple of words to a line, the rest of the paragraph must be "rewrapped" by hand, which is incredibly tedious.  When transforming this document to XML, it will be very difficult for conversion software to determine that section (g) really represents just one list item.  You might find this mistake quite obvious and even absurd; actually, I find this type of misuse of Word to be the most common.

    While autoformatting features in more recent versions of Word theoretically make it easier to produce such autonumbered lists, my experience is that this kind of "power" feature in the hands of authors who don't understand how to employ it properly often results in even worse types of bizarre paragraph constructs.

    2) Absolute Positioning of Text Boxes
    按此在新窗口浏览图片

    Authors sometimes use text boxes, along with absolute positioning, to accomplish a specific appearance on the page.  In the example above, absolute positioning has been used to position the table boxes next to the text to the left of them.  Unfortunately, it will be quite difficult for a conversion program to determine what goes where the text boxes conceivably can come out jumbled and completely in the wrong position on the page.  

    This sort of construct can be mimicked using Word's table editor, which would make conversion far easier.  Also, a table construct would allow you to maintain a structural relationship between the first and second (boxed) entries in each row.  

    3) Simulated Tables with Spaces and Tabs
    按此在新窗口浏览图片

    In this scientific data example, a table has been simulated using tabs, spaces, and line drawing characters.  Tables done strictly with Tabs are hard enough, as a conversion program needs to attempt to determine where the tab(s) have positioned the next chunk of text.  

    In this table, some of the column positioning was actually "brute forced" using spaces - notice in the last four rows of the table, there is only a sequence of spaces separating columns 2 and 3.  Conversion to some sort of XML table tagging structure (such as CALS or HTML) will be very difficult, as typical conversion tools will not be able to recognize that the spaces in the middle of a cell are actually serving to divide columns.  

    In general, most of these types of problems can be avoided by using Word's table editing facility properly.  But, even within that, there are potential problems, as our next two examples will demonstrate.

    4) Misaligned table column separators
    按此在新窗口浏览图片

    Within this Word table, the author has - accidentally or purposefully - shifted the column width very slightly from the second row to the third row on the screen (as highlighted by a small purple circle).  

    Word tends to think of tables as stacks of independent rows, rather than as an organic collection of cells.  So this minute column shift is not much of a problem within Word, and visually, most people would probably not even notice the tiny change in width.  

    But converting this table just got very difficult, because XML table structures won't handle the discrepancy very well.  You typically end up with all kinds of bizarre column spanning out of this example, since the conversion utility infers that a cell is straddling multiple columns - which technically it is, however imperceptible to the naked eye.

    5) Improper Table Row Separation
    按此在新窗口浏览图片

    In this example, a table was laid out in the table editor, which as mentioned previously, is the best way to construct tables in Word, but things can still go wrong. Each of the rows in the table body should have been separated into its own row in the Word table i.e. 1 header row, along with 4 body rows.   Unfortunately, the author has put the entire body of the table into one table row, with the rows in the body aligned via the insertion of hard return characters.  

    This is difficult to convert, because the XML table tagging should contain five rows in a table.  Typical conversion tools will mimic the look of the original table, and simply output the one body row, which may even render properly in some cases.

    But remember, we seek less to mimic appearance and more to glean structure, and the example above does not represent a correct logical representation of the table.  In fact, if the XML table is ultimately displayed on a device where space is limited, it is likely, for instance that the "Mass Merchandiser" cell will wrap to 2 rows, throwing off the alignment of the "Total" row completely.  This is another example of bad usage of the Word table editing facility - which is also difficult to correct by the way - because of the forced hard returns.

    Other Issues
    There are other issues that, while rare, can still befuddle your conversion efforts:

    Fonts.  Word authors are free to use whatever fonts they have available, or any that they find (locally or on the web).  Mapping these, in the case of symbol characters (such as a mathematical plus-minus, "±") to consistent ISO character entities or Unicode is often challenging. Authors also sometimes employ specific Bold and Italic Fonts to accomplish highlighting of specific words, rather than using Word's native Bold/Italic character-formatting capability.  The specific fonts used by the author need to be analyzed for these types of situations, and accounted for in any conversion routine.


    Linking.  Most enterprises seeking to migrate to an XML repository also want to leverage natural (but often implicit) links within and between documents.  A typical conversion therefore requires some level of content tagging and hypertexted cross references that are usually not done that way in a source Word document, so that even documents that are formatted reasonably well in Word still require a significant amount of cleanup.  Document references such as "See Pages 4 through 6" or "See sections 4,5,6, and bottom of 7" are all items that require a fair amount of sophistication to convert, if the goal is to maintain critical relationships.  Footnotes implemented as superscripts and bibliographic entries (typical in Scholarly Journals) that need to be decomposed to full element markup (such as author, title, data, page) will also require effort to convert, as the structure that is required does not usually exist in the source document.
    Conclusion
    Because of the numerous issues that we've discussed, the cost of legacy document conversion can vary greatly.  Those types of documents that require only simple tagging or were authored fairly consistently are much cheaper to convert than complex documents that were authored without much effort placed on document consistency or training in proper Word usage (and I realize that some poorly authored documents are the results of authors being under tremendous time pressure).  

    I should note that the new Word 2003 includes native XML support, and if you and your authors can implement and consistently follow the proper styling rules while authoring documents in Word 2003, your team should be able to produce many types of XML documents without too much pain.  While the prospect of being able to use Word 2003 to map and export XML is exciting, it does not provide an easy means of converting existing or legacy Word documents to XML, and none of the nasty examples you saw above are any easier to deal with in Word 2003.

    Before you begin a conversion, look through your source Word documents to see how well they were formatted but be prepared you may be horrified with what you find.

    Mike Gross
    Data Conversion Laboratory

    [URL=mailto:mikegross@dclab.com]Send Feedback[/URL]

    按此在新窗口浏览图片 Michael Gross is responsible for solution engineering at [URL=http://www.dclab.com/]Data Conversion Laboratory, Inc.[/URL], a leading New York-based data conversion and XML firm. Michael has been solving digital publishing conversion problems at DCL for almost 20 years, where he has overseen thousands of legacy conversion projects, and is the chief architect of DCL's document conversion toolset, including its proprietary hub and spoke technology.


       收藏   分享  
    顶(0)
      




    ----------------------------------------------

    -----------------------------------------------

    第十二章第一节《用ROR创建面向资源的服务》
    第十二章第二节《用Restlet创建面向资源的服务》
    第三章《REST式服务有什么不同》
    InfoQ SOA首席编辑胡键评《RESTful Web Services中文版》
    [InfoQ文章]解答有关REST的十点疑惑

    点击查看用户来源及管理<br>发贴IP:*.*.*.* 2005/2/24 0:06:00
     
     GoogleAdSense
      
      
      等级:大一新生
      文章:1
      积分:50
      门派:无门无派
      院校:未填写
      注册:2007-01-01
    给Google AdSense发送一个短消息 把Google AdSense加入好友 查看Google AdSense的个人资料 搜索Google AdSense在『 WORD to XML, HTML to XML 』的所有贴子 点击这里发送电邮给Google AdSense  访问Google AdSense的主页 引用回复这个贴子 回复这个贴子 查看Google AdSense的博客广告
    2024/4/27 7:09:19

    本主题贴数1,分页: [1]

    管理选项修改tag | 锁定 | 解锁 | 提升 | 删除 | 移动 | 固顶 | 总固顶 | 奖励 | 惩罚 | 发布公告
    W3C Contributing Supporter! W 3 C h i n a ( since 2003 ) 旗 下 站 点
    苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》
    7,843.750ms