中文XML论坛--When Word-to-XML conversions get nasty

[B][URL=mailto:mikegross@dclab.com]Mike Gross[/URL], Chief Technology Officer at Data Conversion Laboratory, Inc., reveals the five ways your conversion engine can get broken when converting MS Word documents to XML. (First published in [URL=http://www.cmswatch.com/]CMSWatch magazine[/URL]).[/B]
OTHER XML RESOURCES ON DCLAB.COM

[URL=http://www.dclab.com/dclfaq.asp#QuarkXML]Converting from Quark to XML[/URL]

[URL=http://www.dclab.com/raq1.asp]Converting Adobe PageMaker and InDesign documents to XML[/URL]

[URL=http://www.dclab.com/dclfaq.asp#diff]XML & SGML - What's the Difference? [/URL]

[URL=http://www.dclab.com/techlibrary1.asp?GRP=1]DCL Technical Library, XML pages[/URL]

One of the first hurdles facing any major content or document management implementation is what to do with legacy documents. Chances are, many or most of those documents reside in Microsoft Word format, but enterprises often want to get them into a more open format, like XML. This is particularly the case for STM (Scientific, Technical, Medical) publishing, where you find complicated - but highly structured - information along with tantalizingly attractive re-use opportunities. But it is also true for everyday corporate documents as well.

The Conundrum of Structure in Word Processing
Most of the challenges faced when converting documents from MS Word to XML are typical of the challenges that you'll face when converting documents from any word-processing/desktop publishing mode (used to give it a particular look on a piece of paper) to a structure/content basis, where you're trying to explicitly indicate what something is, not how it looks.

Like most publishing tools, Word provides users with elegant ways to produce their documents. Features such as style templates, paragraph formats, and table editors make it easy to give documents a consistent appearance, so, in a large part, the ease of conversion will be a function of how well document authors have followed any such guidelines in producing their documents. Consistently-styled Word documents, as well as simple documents (such as memos) can be fairly straightforward to convert.

Real Word in the Real World
In the real world, most enterprises do not employ Word's styling capabilities very well, which is actually quite understandable, since Word was never really intended to be a structured editor. So a conversion must infer structural tagging from the visual clues that exist within the authored document.

And here you run into the problem of all the multiple ways people use word processing tools in the real world.

Many if not most content authors posses only a minimal working knowledge of how to use Word. We live in a world where Microsoft Office exists on millions of users' machines, but most Word users don't ever learn how to manipulate its features, and so they use Word in primitive ways to accomplish a particular appearance. Not only are these types of documents difficult to convert in a more automated (and therefore less expensive) approach, but they are also difficult to modify in any significant way to clean them up in the first place.

Let's not blame authors here, though; in most cases they never knew that their employer would someday want to preserve and repurpose the content, and even where re-use was sought, likely no one ever took the time to train them how to achieve those goals.

5 ways to break your conversion engine
In this article, I'll take a look at some particularly "nasty" examples (derived from real-world samples) of the types of Word constructs I encounter on a regular basis that are difficult to convert. I'll use Word screenshots to display how the documents are formatted. The examples are from Word 2000, but you can expect to see the same types of issues in other versions of Word. Note: In the Word screen dumps, spaces are represented by a period, tabs by a right arrow, and paragraph hard returns by a paragraph symbol. Click on any of the images to see a larger version.

1) Improperly Formatted Paragraphs

In this section from a legal of document, list item (f) contains a nested paragraph, which is indented using Word's paragraph formatting, so that the text of the paragraph shifts to the right.

Of course, list item (g), should be formatted in a similar way. Unfortunately, the person typing that segment did not know how to use proper Word paragraph formatting and indentation, and as a result, what logically represents only one paragraph now contains hard returns at the end of each line, followed by tabs to indent the next line.

Even from a Word perspective, this is bad, because if you decide to add a couple of words to a line, the rest of the paragraph must be "rewrapped" by hand, which is incredibly tedious. When transforming this document to XML, it will be very difficult for conversion software to determine that section (g) really represents just one list item. You might find this mistake quite obvious and even absurd; actually, I find this type of misuse of Word to be the most common.

While autoformatting features in more recent versions of Word theoretically make it easier to produce such autonumbered lists, my experience is that this kind of "power" feature in the hands of authors who don't understand how to employ it properly often results in even worse types of bizarre paragraph constructs.

2) Absolute Positioning of Text Boxes

Authors sometimes use text boxes, along with absolute positioning, to accomplish a specific appearance on the page. In the example above, absolute positioning has been used to position the table boxes next to the text to the left of them. Unfortunately, it will be quite difficult for a conversion program to determine what goes where the text boxes conceivably can come out jumbled and completely in the wrong position on the page.

This sort of construct can be mimicked using Word's table editor, which would make conversion far easier. Also, a table construct would allow you to maintain a structural relationship between the first and second (boxed) entries in each row.

3) Simulated Tables with Spaces and Tabs

In this scientific data example, a table has been simulated using tabs, spaces, and line drawing characters. Tables done strictly with Tabs are hard enough, as a conversion program needs to attempt to determine where the tab(s) have positioned the next chunk of text.

In this table, some of the column positioning was actually "brute forced" using spaces - notice in the last four rows of the table, there is only a sequence of spaces separating columns 2 and 3. Conversion to some sort of XML table tagging structure (such as CALS or HTML) will be very difficult, as typical conversion tools will not be able to recognize that the spaces in the middle of a cell are actually serving to divide columns.

In general, most of these types of problems can be avoided by using Word's table editing facility properly. But, even within that, there are potential problems, as our next two examples will demonstrate.

4) Misaligned table column separators

Within this Word table, the author has - accidentally or purposefully - shifted the column width very slightly from the second row to the third row on the screen (as highlighted by a small purple circle).

Word tends to think of tables as stacks of independent rows, rather than as an organic collection of cells. So this minute column shift is not much of a problem within Word, and visually, most people would probably not even notice the tiny change in width.

But converting this table just got very difficult, because XML table structures won't handle the discrepancy very well. You typically end up with all kinds of bizarre column spanning out of this example, since the conversion utility infers that a cell is straddling multiple columns - which technically it is, however imperceptible to the naked eye.

5) Improper Table Row Separation

In this example, a table was laid out in the table editor, which as mentioned previously, is the best way to construct tables in Word, but things can still go wrong. Each of the rows in the table body should have been separated into its own row in the Word table i.e. 1 header row, along with 4 body rows. Unfortunately, the author has put the entire body of the table into one table row, with the rows in the body aligned via the insertion of hard return characters.

This is difficult to convert, because the XML table tagging should contain five rows in a table. Typical conversion tools will mimic the look of the original table, and simply output the one body row, which may even render properly in some cases.

But remember, we seek less to mimic appearance and more to glean structure, and the example above does not represent a correct logical representation of the table. In fact, if the XML table is ultimately displayed on a device where space is limited, it is likely, for instance that the "Mass Merchandiser" cell will wrap to 2 rows, throwing off the alignment of the "Total" row completely. This is another example of bad usage of the Word table editing facility - which is also difficult to correct by the way - because of the forced hard returns.

Other Issues
There are other issues that, while rare, can still befuddle your conversion efforts:

Fonts. Word authors are free to use whatever fonts they have available, or any that they find (locally or on the web). Mapping these, in the case of symbol characters (such as a mathematical plus-minus, "±") to consistent ISO character entities or Unicode is often challenging. Authors also sometimes employ specific Bold and Italic Fonts to accomplish highlighting of specific words, rather than using Word's native Bold/Italic character-formatting capability. The specific fonts used by the author need to be analyzed for these types of situations, and accounted for in any conversion routine.

Linking. Most enterprises seeking to migrate to an XML repository also want to leverage natural (but often implicit) links within and between documents. A typical conversion therefore requires some level of content tagging and hypertexted cross references that are usually not done that way in a source Word document, so that even documents that are formatted reasonably well in Word still require a significant amount of cleanup. Document references such as "See Pages 4 through 6" or "See sections 4,5,6, and bottom of 7" are all items that require a fair amount of sophistication to convert, if the goal is to maintain critical relationships. Footnotes implemented as superscripts and bibliographic entries (typical in Scholarly Journals) that need to be decomposed to full element markup (such as author, title, data, page) will also require effort to convert, as the structure that is required does not usually exist in the source document.
Conclusion
Because of the numerous issues that we've discussed, the cost of legacy document conversion can vary greatly. Those types of documents that require only simple tagging or were authored fairly consistently are much cheaper to convert than complex documents that were authored without much effort placed on document consistency or training in proper Word usage (and I realize that some poorly authored documents are the results of authors being under tremendous time pressure).

I should note that the new Word 2003 includes native XML support, and if you and your authors can implement and consistently follow the proper styling rules while authoring documents in Word 2003, your team should be able to produce many types of XML documents without too much pain. While the prospect of being able to use Word 2003 to map and export XML is exciting, it does not provide an easy means of converting existing or legacy Word documents to XML, and none of the nasty examples you saw above are any easier to deal with in Word 2003.

Before you begin a conversion, look through your source Word documents to see how well they were formatted but be prepared you may be horrified with what you find.

Mike Gross
Data Conversion Laboratory

[URL=mailto:mikegross@dclab.com]Send Feedback[/URL]

Michael Gross is responsible for solution engineering at [URL=http://www.dclab.com/]Data Conversion Laboratory, Inc.[/URL], a leading New York-based data conversion and XML firm. Michael has been solving digital publishing conversion problems at DCL for almost 20 years, where he has overseen thousands of legacy conversion projects, and is the chief architect of DCL's document conversion toolset, including its proprietary hub and spoke technology.


	W 3 C h i n a ( since 2003 ) 旗下站点苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》	7,843.750ms