You want to find the differences between two XML files and output a "colored diff" file of the differences.
Background on XML Differences
Unlike plain text files, XML structural differences that must be considered when comparing two XML files.
For example when comparing two attributes for an element the order that the attributes appear in a file is not significant. The following two lines are technically the same even though the order of the attributes is different:
<myelement attr1="abc" attr2="def"/> <myelement attr2="def" attr1="abc"/>
XML differences also tend to ignore the spaces and tabs used when indenting and XML file to make it more readable.
So the traditional Longest Common Subsequence (LCS) algorithms used tools such as UNIX diff, GNU diff, or the Subversion diff will not usually give us the results that we desire. 
XML Differencing Algorithms
There are many different algorithms for doing comparisons between tree structured data. Because hierarchical data can be so complex each algorithm will have different precision and performance considerations. There are also many options to consider. For example:
- Do you want to ignore XML comments?
- Do you want to ignore Processor Instructions (PIs)?
- Do you want to ignore case (uppercase/lowercase) differences?
- Do you want to ignore whitespace between elements?
- Can you assume that the structure of the XML documents being compared is identical and only the text is different?
- Are you interested if the order of attributes change?
- Do you want your differences algorithm to output a list of changes to be made on the first or second file?
For our first version we will just do a simple scan of the elements and text within the elements.
We will create a recursive XQuery function that compares all the nodes of an XML file.
XML Difference Output Format
We want to create an XML output format that allows the user to easily display the output using a side-by-side file comparison method.
For example the output might look like:
<xml-diffs> <parameters> <output-format-code>xml<output-format-code> <show-original-indicator>false<show-original-indicator> </parameters> <diff> <change>...<change> <diff> <diff> <addition>...<addition> <diff> <diff> <deletion>...<deletion> <diff> </xml-diffs>
Formatting the output for HTML and CSS
The above output could be considered a raw semantic markup without concern as to how the web site wants to display the output using standard HTML div blocks and CSS. As a second step we can place the output in two HTML
O(ND) Difference Algorithm was originally designed to compare text files using linebreaks as a fundamental unit of comparison. We will need to modify it to recursively compare XML elements and attributes. XML comparison also should not report differences in the order of attributes.
To be continued...
- "S. Chawathe, A. Rajaraman, H. Garcia-Molina and J. Widom" ("June 1996"). "Change Detection in Hierarchically Structured Information". "Proceedings of the ACM SIGMOD International" "Conference on Management of Data, Montreal".
- An O(ND) Difference Algorithm and its Variations" by Eugene Myers Algorithmica Vol. 1 No. 2, 1986, p 251
- [http://www.cs.wisc.edu/niagara/papers/xdiff.pdf X-Diff: An Effective Change Detection Algorithm for XML Documents Yuan Wang, David J. DeWitt, Jin-Yi Cai, University of Wisconsin – Madison