Provides the tools for doing the conversion of StarWriter XML to and from AportisDoc format.
It follows the {@link org.openoffice.xmerge} framework for the conversion process.
Since it converts to/from a Palm application format, these converters
follow the
PalmDB
stream format for writing out to the Palm sync client or
reading in from the Palm sync client.
Note that PluginFactoryImpl
also provides a
DocumentMerger
object, i.e. {@link org.openoffice.xmerge.converter.xml.sxw.aportisdoc.DocumentMergerImpl DocumentMergerImpl}.
This functionality was derived from its superclass
{@link org.openoffice.xmerge.converter.xml.sxw.SxwPluginFactory
SxwPluginFactory}.
The AportisDoc pdb format is widely used by different Palm applications, e.g. QuickWord, AportisDoc Reader, MiniWrite, etc. Note that some of these applications put tweaks into the format. The converters will only support the default AportisDoc format, plus some very minor tweaks to accommodate other applications.
The text content of the format is plain text, i.e. there are no styles or structures. There is no notion of lists, list items, paragraphs, headings, etc. The format does have support for bookmarks.
For most Doc applications, the default character encoding supported is the extended ASCII character set, i.e. ISO-8859-1. StarWriter XML is in UTF-8 encoding scheme. Since UTF-8 encoding scheme covers more characters, converting UTF-8 strings into extended ASCII would mean that there can be possible loss of character mappings.
Using JAXP, XML files can be parsed and read in as Java String
s
which is in Unicode format, there is no loss of character mapping from UTF-8
to Java Strings. There is possible loss of character mapping in
converting Java String
s to ASCII bytes. Java characters that
cannot be represented in extended ASCII are converted into the ASCII
character '?' or x3F in hex digit via the String.getBytes(encoding)
API.
The DocumentSerializerImpl
class implements the
org.openoffice.xmerge.DocumentSerializer
.
This class specifically provides the conversion process from a given
SxwDocument
object to DOC formatted records, which are
then passed back to the client via the ConvertData
object.
The following XML tags are handled. [Note that some may not be implemented yet.]
Paragraphs <text:p> and Headings <text:h>
Heading elements are classified the same as paragraph elements since both have the same possible elements inside. Their main difference is that they refer to different types of style information, which is outside of their element tags. Since there are no styles on the DOC format, headings should be treated the same way a paragraph is converted.
For paragraph elements, convert and transfer text nodes that are essential. Text nodes directly contained within paragraph nodes are such. There are also a number of elements that a paragraph element may contain. These are explained in their own context.
At the end of the paragraph, an EOL character is added by the converter to provide a separation for each paragraph, since the Doc format does not have a notion of a paragraph.
White spaces <text:s> and Tabs <text:tab-stop>
In SXW, normally 2 or more white-space characters are collapsed into a single space character. In order to make sure that the document content really contains those white-space characters, there are special elements assigned to them.
The space element specifies the number of spaces are in it. Thus, converting it just means providing the specific number of spaces that the element requires.
There is also the tab-stop element. This is a bit tricky. In a StarWriter document, tab-stops are specified by a column position. A tab is not an exact number of space, but rather a specific column positioning. Say, regular tab-stops are set at every 5th column. At column 4, if I hit a tab, it goes to column 5. At column 1, hitting a tab would put the cursor at column 5 as well. SmartDoc and AporticDoc applications goes by columns for the ASCII tab character. The only problem is that in StarWriter, one could specify a different tab-stop, but not in most of these Doc applications, at least I have not seen one. Solution for this is just to go with the converting to the ASCII tab character and not do anything for different tab-stop positioning.
Line breaks <text:line-break>
To represent line breaks, it is simpliest to just put an ASCII LF character. Note that the side effect of this is that an end of paragraph also contains an ASCII LF character. Thus, for the DOC to SXW conversion, line breaks are not distinguishable from specifying the end of a paragraph.
Text spans <text:span>
Text spans contain text that have different style attributes from the paragraphs'. Text spans can be embedded within another text span. Since it is purely for style tagging, we only needed to convert and transfer the text elements within these.
Hyperlinks <text:a>
Convert and transfer the text portion.
Bookmarks <text:bookmark> <text:bookmark-start> <text:bookmark-end> [Not implemented yet]
In SXW, bookmark elements are embedded inside paragraph elements. Bookmarks can either mark a text position or a text range. <text:bookmark> marks a position while the pair <text:bookmark-start> and <text:bookmark-end>
marks a text range. The DOC format only supports bookmarking a text position. Thus, for the conversion, <text:bookmark> and <text:bookmark-start> will both mark a text position.Change Tracking <text:tracked-changes> <text:change*> [Not implemented yet]
Change tracking elements are not supported yet on the current OpenOffice XML filters, will have to watch out on this. The text within these elements have to be interpreted properly during the conversion process.
Lists <text:unordered-list> and <text:ordered-lists>
A list can only contain one optional <text:list-header> and one or more <text:list-item> elements.
A <text:list-header> contains one or more paragraph elements. Since there are no styles, the conversion process does not do anything special for list headers, conversion for the paragraphs within list headers are the same as explained above.
A <text:list-item> may contain one or more of paragraphs, headings, list, etc. Since the Doc format does not support any list structure, there will not be any special handling for this element. Conversion for elements within it shall be applied according to the element type. Thus, lists with paragraphs within it will result in just plain paragraphs. Sublists will not be identifiable. Paragraphs in sublists will still appear.
<text:section>
I am not sure what this is yet, will need to investigate more on this.
There may be other tags that will still need to be addressed for this conversion.
Refer to {@link org.openoffice.xmerge.converter.xml.sxw.aportisdoc.DocumentSerializerImpl DocumentSerializerImpl}
for details of implementation. It uses DocEncoder
class to do the encoding
part.
The DocumentDeserializerImpl
class implements the
org.openoffice.xmerge.DocumentDeserializer
. It is
passed the device document in the form of a ConvertData
object.
It will then create a SxwDocument
object from the conversion of
the DOC formatted records.
The text content of the Doc format will be transferred as text. Paragraph elements will be formed based on the existence of an ASCII LF character. There will be at least one paragraph element.
Bookmarks in the Doc format will be converted to the bookmark element <text:bookmark> [Not implemented yet].
As mentioned above, the DocumentMerger
object produced by
PluginFactoryImpl
is DocumentMergerImpl
.
Refer to the javadocs for that package/class on its merging specifications.