Exciting things with XML

So XML exists, and you probably have to deal with it occasionally.

It’s a bit more complex than a CSV or a fixed-width file, so an enormous number of people have created an even greater number of toolkits to help you read and write the things.

Some chunk of XML
Some generic chunk of XML

What the industry standard appears to be is to create a toolkit that implements half of the XML standard, and then drop that and start another which has a different, but overlapping subset of supported features.

If you’re a member of the W3C consortium, you probably want to have several different XML implementations, with a whole raft of SAXTransformerFactory‘s and DOMImplementationRegistry‘s just to work out which implementation to use, none of which really work, but that’s the sort of thing that keeps committees busy.

Why not use XmlUtil ? I know I certainly do. Especially in that example from a few weeks ago.

It has the following methods, which I will explain briefly in the bit below this bit.

  • getCleanXml
  • getText
  • getTextPreserveElements
  • getTextNonRecursive
  • getXmlString
  • toDocument
  • compact
  • processContentHandler
  • static inner class SimpleTableContentHandler
  • static inner class AbstractStackContentHandler
  • static inner class NodeListIterator

getCleanXml

This method runs a document through tagsoup to get an ‘XML-clean’ document.

A fairly common use-case for me is having some a fairly horrible looking bit of HTML taken from the wastelands of the internet, and wanting to extract some data from it.

The tagsoup library does a fairly decent job of cleaning up things that look like HTML so that it starts to approach something that more closely resembles XML.

It’s a useful first step before performing XPath-style data extraction, or before attempting to modify a webpage to pass the W3C validator. Which is the sort of thing that SEO companies think gives you google brownie points.

Incidentally, if you search for “SEO company” on google, you get about 93,900,000 results. What I want to know is why aren’t they all on the first page of results ?

I have my own views on search-engine optimisation, which are that:

  • google knows what you’re searching for
  • google knows when you stop searching for it
  • google probably knows therefore whether a website is a good result for a particular search query or not

which, you’ll notice, doesn’t involve the phrase ‘meta tags’ anywhere in it.

You can send my $20,000 fee to the usual account.

On a slightly more conspiratorial note,

  • google is aware of your IP address
  • google is therefore more likely to show you your own ‘ads’ that you’ve purchased on it’s network
  • which is therefore more likely to get you to throw more money at google seeing as it’s doing such a good job getting your extra-specially-important site at the top of those search results

That was a bit of a longer side-rant than I thought it would be. I might move this bit into a separate blog entry. Maybe. Later.

So anyway, this is how you’d use this getCleanXml thing:

// uses the Scanner trick from http://weblogs.java.net/blog/pat/archive/2004/10/stupid_scanner.html
String rubbishHtml = new Scanner(new URL("http://www.microsoft.com").openStream(), 
  "UTF-8").useDelimiter("\\A").next();
String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false);Document d = XmlUtil.toDocument(cleanXml);
Element titleEl = (Element) XPathAPI.selectSingleNode(d, "/html/head/title");

getText

getText() returns the text between an opening and closing element in an XML document.

So extending the example above:

// uses the Scanner trick from http://weblogs.java.net/blog/pat/archive/2004/10/stupid_scanner.html
String rubbishHtml = new Scanner(new URL("http://www.microsoft.com").openStream(), 
  "UTF-8").useDelimiter("\\A").next();
String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false);
Document d = XmlUtil.toDocument(cleanXml);
Element titleEl = (Element) XPathAPI.selectSingleNode(d, "/html/head/title");
String title = XmlUtil.getText(titleEl);System.out.println(title);

currently generates

Microsoft Australia | Devices and Services

with nary a regex in sight.

getTextPreserveElements

Same as getText(), but allows a predefined list of tags to be returned as well.

This is useful when processing formatted text in which you might want to keep certain tags (B, I, IMG etc)

String input = "<p>Here is some formatted text <b>in bold</b>, <i>in italics</i>," +
  " <blink>blinking</blink>, and with an image at the end <img src=\"frog.png\"/></p>";
Document d = XmlUtil.toDocument(input);
Element paraEl = d.getDocumentElement();
System.out.println(XmlUtil.getText(paraEl));
System.out.println(XmlUtil.getTextPreserveElements(paraEl,   new String[] { "b", "i", "img" } ));

which would give you something like

Here is some formatted text in bold, in italics, blinking, and with an image at the end 
Here is some formatted text <b>in bold</b>, <i>in italics</i>, blinking, and with an image at the end <img src="frog.png"></img>

getTextNonRecursive

Same as getText(), but doesn’t attempt to recurse into child elements within the parent XML element

String input = "<p>This is the bit we're interested in <span>but not this bit</span></p>";
Document d = XmlUtil.toDocument(input);
Element paraEl = d.getDocumentElement();
System.out.println(XmlUtil.getText(paraEl));
System.out.println(XmlUtil.getTextNonRecursive(paraEl));
This is the bit we're interested in but not this bit
This is the bit we're interested in

getXmlString

Converts a Document into a String. As in the following example, which attempts to extract the ‘keywords’ and ‘description’ meta tag elements from google’s home page, and send them to stdout.

String rubbishHtml = new Scanner(new URL("http://www.google.com").openStream(), 
  "UTF-8").useDelimiter("\\A").next();
String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false);
Document d = XmlUtil.toDocument(cleanXml);
NodeList nodes = XPathAPI.selectNodeList(d, "/html/head/meta");
for (int i=0; i<nodes.getLength(); i++) {
	System.out.println(i + ": " + XmlUtil.getXmlString(nodes.item(i),  true));
}

which currently, and not too surprisingly, generates:

0: <meta content="/images/google_favicon_128.png" itemprop="image"/>

It’s amazing that anyone can find them, really.

toDocument

Converts a String into a W3C Document object

Most of the examples above use this method, so hopefully you get the idea

compact

Removes whitespace surrounding text nodes contained within an Element

Such as this attempt to generate the Cliff’s notes for All’s Well That End’s Well:

// from http://www.ibiblio.org/xml/examples/shakespeare/all_well.xml
String input = 
	"<PERSONAE>\n" +
	"    <TITLE>Dramatis Personae</TITLE>\n" +
	"    \n" +
	"    <PERSONA>KING OF FRANCE</PERSONA>\n" +
	"    <PERSONA>DUKE OF FLORENCE</PERSONA>\n" +
	"    <PERSONA>BERTRAM, Count of Rousillon.</PERSONA>\n" +
	"    <PERSONA>LAFEU, an old lord.</PERSONA>\n" +
	"    <PERSONA>PAROLLES, a follower of Bertram.</PERSONA>\n" +
	"    \n" +
	"    <PGROUP>\n" +
	"        <PERSONA>Steward</PERSONA>\n" +
	"        <PERSONA>Clown</PERSONA>\n" +
	"        <GRPDESCR>servants to the Countess of Rousillon.</GRPDESCR>\n" +
	"    </PGROUP>\n" +
	"    \n" +
	"    <PERSONA>A Page. </PERSONA>\n" +
	"    <PERSONA>COUNTESS OF ROUSILLON, mother to Bertram. </PERSONA>\n" +
	"    <PERSONA>HELENA, a gentlewoman protected by the Countess.</PERSONA>\n" +
	"    <PERSONA>An old Widow of Florence. </PERSONA>\n" +
	"    <PERSONA>DIANA, daughter to the Widow.</PERSONA>\n" +
	"    \n" +
	"    <PGROUP>\n" +
	"        <PERSONA>VIOLENTA</PERSONA>\n" +
	"        <PERSONA>MARIANA</PERSONA>\n" +
	"        <GRPDESCR>neighbours and friends to the Widow.</GRPDESCR>\n" +
	"    </PGROUP>\n" +
	"    \n" +
	"    <PERSONA>Lords, Officers, Soldiers, &amp;c., French and Florentine.</PERSONA>\n" +
	"</PERSONAE>\n";
 
Document d = XmlUtil.toDocument(input);
System.out.println("Before compact():");
System.out.println(XmlUtil.getXmlString(d, true));
 
XmlUtil.compact(d.getDocumentElement());System.out.println();
System.out.println("And after compact(), the much more readable:");
System.out.println(XmlUtil.getXmlString(d, true));

which generates:

Before compact():
<PERSONAE>
    <TITLE>Dramatis Personae</TITLE>
    
    <PERSONA>KING OF FRANCE</PERSONA>
    <PERSONA>DUKE OF FLORENCE</PERSONA>
    <PERSONA>BERTRAM, Count of Rousillon.</PERSONA>
    <PERSONA>LAFEU, an old lord.</PERSONA>
    <PERSONA>PAROLLES, a follower of Bertram.</PERSONA>
    
    <PGROUP>
        <PERSONA>Steward</PERSONA>
        <PERSONA>Clown</PERSONA>
        <GRPDESCR>servants to the Countess of Rousillon.</GRPDESCR>
    </PGROUP>
    
    <PERSONA>A Page. </PERSONA>
    <PERSONA>COUNTESS OF ROUSILLON, mother to Bertram. </PERSONA>
    <PERSONA>HELENA, a gentlewoman protected by the Countess.</PERSONA>
    <PERSONA>An old Widow of Florence. </PERSONA>
    <PERSONA>DIANA, daughter to the Widow.</PERSONA>
    
    <PGROUP>
        <PERSONA>VIOLENTA</PERSONA>
        <PERSONA>MARIANA</PERSONA>
        <GRPDESCR>neighbours and friends to the Widow.</GRPDESCR>
    </PGROUP>
    
    <PERSONA>Lords, Officers, Soldiers, &c., French and Florentine.</PERSONA>
</PERSONAE>

And after compact(), the much more readable:

<PERSONAE><TITLE>Dramatis Personae</TITLE><PERSONA>KING OF FRANCE</PERSONA><PERSONA>DUKE OF FLORENCE</PERSONA><PERSONA>BERTRAM, Count of Rousillon.</PERSONA><PERSONA>LAFEU, an old lord.</PERSONA><PERSONA>PAROLLES, a follower of Bertram.</PERSONA><PGROUP><PERSONA>Steward</PERSONA><PERSONA>Clown</PERSONA><GRPDESCR>servants to the Countess of Rousillon.</GRPDESCR></PGROUP><PERSONA>A Page.</PERSONA><PERSONA>COUNTESS OF ROUSILLON, mother to Bertram.</PERSONA><PERSONA>HELENA, a gentlewoman protected by the Countess.</PERSONA><PERSONA>An old Widow of Florence.</PERSONA><PERSONA>DIANA, daughter to the Widow.</PERSONA><PGROUP><PERSONA>VIOLENTA</PERSONA><PERSONA>MARIANA</PERSONA><GRPDESCR>neighbours and friends to the Widow.</GRPDESCR></PGROUP><PERSONA>Lords, Officers, Soldiers, &c., French and Florentine.</PERSONA></PERSONAE>

processContentHandler

Runs a SAX ContentHandler over a Document

You need a ContentHandler in order to process it, so why not keep reading to see an example of one of those.

Class SimpleTableContentHandler

A simple ContentHandler that parses data from a HTML table (into a List of Lists).

The example here parses the currency exchange tables from xe.com, and selects a currency using polar method of G. E. P. Box, M. E. Muller, and G. Marsaglia, as described by Donald E. Knuth in The Art of Computer Programming, Volume 3: Seminumerical Algorithms, section 3.4.1, subsection C, algorithm P.

It then goes on tell you whether you’d have made a profit or loss by selling that currency 12 months later.

public static List<List<String>> getCurrencyRates(String date) throws MalformedURLException, IOException, SAXException, TransformerException {
	// this doesn't actually work due to http://www.xe.com/errors/noautoextract.htm
	// but I'm sure you enterprising types can get around that
	String url = "http://www.xe.com/currencytables/?from=AUD&date=" + date;
	String rubbishHtml = new Scanner(new URL(url).openStream(), 
	  "UTF-8").useDelimiter("\\A").next();
 
	// hint:
	// String rubbishHtml = Text.getFileContents(date.equals("2012-01-01") ? "c:\\rate1.txt" : "c:\\rate2.txt");
	String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false);
 
	Document d = XmlUtil.toDocument(cleanXml);
	Element tableEl = (Element) XPathAPI.selectSingleNode(d, ".//table[@id=\"historicalRateTbl\"]");
 
	XmlUtil.SimpleTableContentHandler stch = new XmlUtil.SimpleTableContentHandler();	XmlUtil.processContentHandler(stch, XmlUtil.getXmlString(tableEl, false));
	List<List<String>> table = stch.getTable();
	return table;
}
 
// see what amount of abstract currency units you can get in exchange for
// the shiny Australian cylinder with a kangaroo on it.
List<List<String>> startTable = getCurrencyRates("2012-01-01");
List<List<String>> endTable = getCurrencyRates("2013-01-01");
 
// pick an investment strategy
int tradingOption = (int) (Math.random().nextGaussian() * startTable.size());
 
// sanity check that the row with the same index in each table contains the same currency
if (!startTable.get(tradingOption).get(0).equals(endTable.get(tradingOption).get(0))) {
	throw new RuntimeException("Rate tables don't line up");
}
 
System.out.println("You're investing in the " + startTable.get(tradingOption).get(1));
System.out.println("Of which you could pick up " + startTable.get(tradingOption).get(2) + " for a measly AUD$1 way back in 2012");
System.out.println("and then sell for AUD$" + endTable.get(tradingOption).get(3) + " each barely a year later");
 
// and then let's pretend that xe.com is accurate, and that rounding, 
// buy/sell prices and transaction costs don't exist
BigDecimal bd = new BigDecimal(startTable.get(tradingOption).get(2));
BigDecimal bd2 = new BigDecimal(endTable.get(tradingOption).get(3));
BigDecimal val = bd.multiply(bd2);
 
System.out.println("landing you AUD$" + val);
System.out.println(val.compareTo(new BigDecimal(1))>0 ? "Winner!" : "You lose!");

which generates:

You're investing in the Belizean Dollar
Of which you could pick up 2.0266687920 for a measly AUD$1 way back in 2012
and then sell for AUD$0.4810598095 each barely a year later
landing you AUD$0.97494890299911512400
You lose!

Class AbstractStackContentHandler

Similar to the Apache Digester, but much much smaller, this ContentHandler creates a ‘stack’ of elements separated by a slash, which it passes through to the concrete implementation.

This example parses a device.xml file (as used in DMX-web, an example of which is attached to the foot of this blog entry), and lists the name of all the devices contained within.

public static class DeviceContentHandler 
  extends XmlUtil.AbstractStackContentHandler {	List<String> names = new ArrayList<String>();
 
	// process the start of an XML element
	public void element(String path) throws SAXException { }
 
	// process the text of an XML element 
	public void elementText(String path, String content) throws SAXException { 
		if (path.equals("devices/device/name")) {
			names.add(content);
		}
	}
 
	// return the names of all devices
	public List<String> getNames() { return names; }
}
 
DeviceContentHandler dch = new DeviceContentHandler();
XmlUtil.processContentHandler(dch, Text.getFileContents("c:\\device.xml"));
List<String> names = dch.getNames();
for (int i=0; i<names.size(); i++) {
	System.out.println(i + ": " + names.get(i));
}

which generates:

0: DMXKing USB
1: Art-Net
2: WinAMP Controller
3: Ye olde dmxy plugigne

Class NodeListIterator

A wrapper for a NodeList that makes it Iterable

Here’s an example using the WindowTreeDom object from a couple of weeks back

// get the Windows window tree in an XML object and attempt to find the Outlook windows in there
WindowTreeDom wtd = new WindowTreeDom();
Document windows = wtd.getDom();
NodeList outlookWindows = XPathAPI.selectNodeList(windows, 
  ".//window[@class='rctrl_renwnd32']/window[@class='AfxWndW']/window[@class='#32770']");   
 
for (Node n : new XmlUtil.NodeListIterator(outlookWindows)) {    Element e = (Element) n;
    logger.debug("Found possible window " + XmlUtil.getXmlString(n, true));
}

Which could come in handy to prove that your commercialised electronic marketing campaigns look readable in all three types of email client that exist.

So anyhoo, here’s the code already.

XmlUtil.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
package com.randomnoun.common;
 
/* (c) 2013 randomnoun. All Rights Reserved. This work is licensed under a
 * BSD Simplified License. (http://www.randomnoun.com/bsd-simplified.html)
 */
 
import java.io.*;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Set;
 
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
 
import org.ccil.cowan.tagsoup.*;
import org.ccil.cowan.tagsoup.Parser;
 
import org.w3c.dom.*;
import org.w3c.dom.Element;
import org.xml.sax.*;
 
import org.apache.log4j.Logger;
 
/** XML utility functions
 *
 * @author knoxg
 * @blog http://www.randomnoun.com/wp/2013/01/25/exciting-things-with-xml/
 * @version $Id: XmlUtil.java,v 1.5 2013-09-24 02:37:09 knoxg Exp $
 */
public class XmlUtil {
 
    /** A revision marker to be used in exception stack traces. */
    public static final String _revision = "$Id: XmlUtil.java,v 1.5 2013-09-24 02:37:09 knoxg Exp $";
 
 
	/** Clean some HTML text through the tagsoup filter. The returned string is guaranteed to be 
	 * well-formed XML (and can therefore be used by other tools that expect valid XML). 
	 * 
	 * @param inputXml input XML document
	 * @param isHtml if true, uses the HTML schema, omits the XML declaration, and uses the html method
	 * 
	 * @throws SAXException if the tagsoup library could not parse the input string
	 * @throws IllegalStateException if an error occurred reading from a string (should never occur)
	 */ 
	public static String getCleanXml(String inputXml, boolean isHtml) throws SAXException {
		return getCleanXml(new ByteArrayInputStream(inputXml.getBytes()), isHtml);
	}
 
	/** Clean a HTML inputStream through the tagsoup filter. The returned string is guaranteed to be 
	 * well-formed XML (and can therefore be used by other tools that expect valid XML). 
	 * 
	 * @param is input XML stream
	 * @param isHtml if true, uses the HTML schema, omits the XML declaration, and uses the html method
	 * 
	 * @throws SAXException if the tagsoup library could not parse the input string
	 * @throws IllegalStateException if an error occurred reading from a string (should never occur)
	 */ 
	public static String getCleanXml(InputStream inputStream, boolean isHtml) throws SAXException {
		try {
			ByteArrayOutputStream baos = new ByteArrayOutputStream();
			InputSource is = new InputSource();
			is.setByteStream(inputStream); // could use raw inputstream here later
 
			XMLReader xmlReader = new Parser();
			Writer w = new OutputStreamWriter(baos);
			XMLWriter tagsoupXMLWriter = new XMLWriter(w);
			tagsoupXMLWriter.setOutputProperty(XMLWriter.OMIT_XML_DECLARATION, "yes");
			if (isHtml) {
				HTMLSchema theSchema = new HTMLSchema();
				xmlReader.setProperty(Parser.schemaProperty, theSchema);
 
				tagsoupXMLWriter.setOutputProperty(XMLWriter.METHOD, "html");
				tagsoupXMLWriter.setPrefix(theSchema.getURI(), "");
			}
 
			xmlReader.setContentHandler(tagsoupXMLWriter);
			xmlReader.parse(is);
			return baos.toString();
		} catch (IOException ioe) {
			throw (IllegalStateException) new IllegalStateException("IO Exception reading from string").initCause(ioe);		
		}
	}
 
 
	/**
	 * Iterates through the child nodes of the specified element, and returns the contents
	 * of all Text and CDATA elements among those nodes, concatenated into a string.
	 *
	 * <p>Elements are recursed into.
	 *
	 * @param element the element that contains, as child nodes, the text to be returned.
	 * @return the contents of all the CDATA children of the specified element.
	 */
	public static String getText(Element element)
	{
		if (element == null) { throw new NullPointerException("null element"); }
		StringBuffer buf = new StringBuffer();
		NodeList children = element.getChildNodes();
		for (int i = 0; i < children.getLength(); ++i) {
			org.w3c.dom.Node child = children.item(i);
			short nodeType = child.getNodeType();
			if (nodeType == org.w3c.dom.Node.CDATA_SECTION_NODE) {
				buf.append(((org.w3c.dom.Text) child).getData());			
			} else if (nodeType == org.w3c.dom.Node.TEXT_NODE) {
				buf.append(((org.w3c.dom.Text) child).getData());
			} else if (nodeType == org.w3c.dom.Node.ELEMENT_NODE) {
				buf.append(getText((Element) child));
			}
		}
		return buf.toString();
	}
 
	/**
	 * Iterates through the child nodes of the specified element, and returns the contents
	 * of all Text and CDATA elements among those nodes, concatenated into a string. 
	 * Any elements with tagNames that are included in the tagNames parameter of this
	 * method are also included. 
	 * 
	 * <p>Attributes of these tags are also included in the result, but may be reordered.
	 * 
	 * <p>Self-closing elements (e.g. <code>&lt;br/&gt;</code>)
	 * are expanded into opening and closing elements (e.g. <code>&lt;br&gt;&lt;/br&gt;</code>)
	 *
	 * <p>Elements are recursed into.
	 *
	 * @param element the element that contains, as child nodes, the text to be returned.
	 * @return the contents of all the CDATA children of the specified element.
	 */
	public static String getTextPreserveElements(Element element, String[] tagNames) {
		if (element == null) { throw new NullPointerException("null element"); }
		Set<String> tagNamesSet = new HashSet<String>(Arrays.asList(tagNames));
		StringBuffer buf = new StringBuffer();
		NodeList children = element.getChildNodes();
		for (int i = 0; i < children.getLength(); ++i) {
			org.w3c.dom.Node child = children.item(i);
			short nodeType = child.getNodeType();
			if (nodeType == org.w3c.dom.Node.CDATA_SECTION_NODE) {
				buf.append(((org.w3c.dom.Text) child).getData());			
			} else if (nodeType == org.w3c.dom.Node.TEXT_NODE) {
				buf.append(((org.w3c.dom.Text) child).getData());
			} else if (nodeType == org.w3c.dom.Node.ELEMENT_NODE) {
				String tagName = ((Element) child).getTagName();
				boolean includeEl = tagNamesSet.contains(tagName);
				if (includeEl) {
					buf.append('<');
					buf.append(tagName);
					NamedNodeMap nnm = ((Element) child).getAttributes();
					for (int j = 0; j < nnm.getLength(); j++) {
						Attr attr = (Attr) nnm.item(j);
						buf.append(" " + attr.getName());
						if (attr.getValue()!=null) {
							buf.append("=\"" + attr.getValue() + "\"");
						}
					}
					buf.append('>');
				}
				buf.append(getTextPreserveElements((Element) child, tagNames));
				if (includeEl) {
					buf.append("</" + tagName + ">");
				}
			}
		}
		return buf.toString();
	}	
 
 
 
	/**
	 * Iterates through the child nodes of the specified element, and returns the contents
	 * of all Text and CDATA elements among those nodes, concatenated into a string.
	 * 
	 * <p>Elements are not recursed into.
	 *
	 * @param element the element that contains, as child nodes, the text to be returned.
	 * @return the contents of all the CDATA children of the specified element.
	 */
	public static String getTextNonRecursive(Element element)
	{
		if (element == null) { throw new NullPointerException("null element"); }
		StringBuffer buf = new StringBuffer();
		NodeList children = element.getChildNodes();
		for (int i = 0; i < children.getLength(); ++i) {
			org.w3c.dom.Node child = children.item(i);
			short nodeType = child.getNodeType();
			if (nodeType == org.w3c.dom.Node.CDATA_SECTION_NODE) {
				buf.append(((org.w3c.dom.Text) child).getData());			
			} else if (nodeType == org.w3c.dom.Node.TEXT_NODE) {
				buf.append(((org.w3c.dom.Text) child).getData());
			} else if (nodeType == org.w3c.dom.Node.ELEMENT_NODE) {
				// ignore child elements
			}
		}
		return buf.toString();
	}
 
	/** Return a DOM document object from an XML string
	 * 
	 * @param text the string representation of the XML to parse 
	 */
	public static Document toDocument(String text) throws SAXException {
		return toDocument(new ByteArrayInputStream(text.getBytes()));
	}
 
	/** Return a DOM document object from an InputStream
	 * 
	 * @param is the InputStream containing the XML to parse 
	 */
	public static Document toDocument(InputStream is) throws SAXException {
		try {
			DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
			DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
			Document doc = docBuilder.parse(is);
			doc.getDocumentElement().normalize(); // Collapses adjacent text nodes into one node.
			return doc;
		} catch (ParserConfigurationException pce) {
			// this can never happen 
			throw (IllegalStateException) new IllegalStateException("Error creating DOM parser").initCause(pce);
		} catch (IOException ioe) {
			// this can also never happen
			throw (IllegalStateException) new IllegalStateException("Error retrieving information").initCause(ioe);
		} 
	}
 
	/** Converts a document node subtree back into an XML string 
	 * 
	 * @param node a DOM node 
	 * @param omitXmlDeclaration if true, omits the XML declaration from the returned result
	 * 
	 * @return the XML for this node
	 * 
	 * @throws TransformerException if the transformation to XML failed
	 * @throws IllegalStateException if the transformer could not be initialised 
	 */
	public static String getXmlString(Node node, boolean omitXmlDeclaration) 
		throws TransformerException 
	{
		try {
			ByteArrayOutputStream baos = new ByteArrayOutputStream();
			TransformerFactory transformerFactory = TransformerFactory.newInstance();
			Transformer transformer = transformerFactory.newTransformer();
			DOMSource source = new DOMSource(node);
			StreamResult result = new StreamResult(baos);
			transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, omitXmlDeclaration ? "yes": "no");
			transformer.transform(source, result);
			return baos.toString();
		} catch (TransformerConfigurationException tce) {
			throw (IllegalStateException) new IllegalStateException("Could not initialise transfoermer").initCause(tce);
		}
	}
 
 
	/** Remove leading/trailing whitespace from all text nodes in this nodeList.
	 * Will iterate through subnodes recursively.
	 * 
	 * @param nodeList
	 */
	public static void compact(Node node) {
		if (node.getNodeType()==Node.TEXT_NODE) {
			org.w3c.dom.Text el = (org.w3c.dom.Text) node;
			if (el.getNodeValue()!=null) {
				el.setNodeValue(el.getNodeValue().trim());
			}
		} else if (node.getNodeType()==Node.ELEMENT_NODE) {
			NodeList childNodes = node.getChildNodes();
			if (childNodes != null && childNodes.getLength() > 0) {
				int len = childNodes.getLength();
				for (int i=0; i<len; i++) {
					Node childNode = childNodes.item(i);
				    compact(childNode);
				}
			}
		}
	}
 
 
	/** Parse a string of XML text using a SAX contentHandler. Nothing is returned by this method - it 
	 * is assumed that the contentHandler supplied maintains it's own state as it parses the XML supplied,
	 * and that this state can be extracted from this object afterwards.
	 * 
	 * @param contentHandler a SAX content handler 
	 * @param xmlText an XML document (or part thereof)
	 * 
	 * @throws SAXException if the document could not be parsed
	 * @throws IllegalException if the parser could not be initialised, or an I/O error occurred 
	 *   (should not happen since we're just dealing with strings)
	 */
	public static void processContentHandler(ContentHandler contentHandler, String xmlText) throws SAXException {
		 SAXParserFactory factory = SAXParserFactory.newInstance();
		 try {
			 // Parse the input
			 SAXParser saxParser = factory.newSAXParser();
			 XMLReader xmlReader = saxParser.getXMLReader();
			 xmlReader.setContentHandler(contentHandler);
			 xmlReader.parse(new InputSource(new ByteArrayInputStream(xmlText.getBytes())));
		 } catch (IOException ioe) {
		 	throw (IllegalStateException) new IllegalStateException("IO Exception reading from string").initCause(ioe);
		 } catch (ParserConfigurationException pce) {
			throw (IllegalStateException) new IllegalStateException("Could not initialise parser").initCause(pce);		 		
		 }
	}
 
	/** Convert a table into a List of Lists (each top-level list represents a table row,
	 * each second-level list represents a table cell). Only contents are returned; attributes
	 * and formatting are ignored.
	 * 
	 * <p>This class will probably not work when tables are embedded within other tables
	 */
	public static class SimpleTableContentHandler
		implements ContentHandler 
	{
		/** Logger instance for this class */
		public static final Logger logger = Logger.getLogger(SimpleTableContentHandler.class);
 
		/** Current table */
		List<List<String>> thisTable = null;
		/** Current row in table */
		List<String> thisRow = null;
		/** Current cell in row */
		String thisCell = "";
 
		/** The state of this parser */
		private enum State {
			/** start of doc, expecting 'table' */
			START,
			/** in table element, expecting 'tr' */
			IN_TABLE,
			/** in tr element, expecting 'td' (or other ignored elements) */
			IN_TR,
			/** in td element, capturing to closing tag */
			IN_TD
		}
 
		State state = State.START;
 
		// unused interface methods
		public void setDocumentLocator(Locator locator) { }
		public void startDocument() throws SAXException { }
		public void endDocument() throws SAXException { }
		public void startPrefixMapping(String prefix, String uri) throws SAXException { }
		public void endPrefixMapping(String prefix) throws SAXException { }
		public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException { }
		public void processingInstruction(String target, String data) throws SAXException { }
		public void skippedEntity(String name) throws SAXException { }
 
 
		public void startElement(String uri, String localName, String qName, Attributes atts)
			throws SAXException 
		{
			switch (state) {
				case START: 
					if (qName.equals("table")) {
						thisTable = new ArrayList<List<String>>(); 
						state = State.IN_TABLE; 
					} else {
						logger.warn("Warning: top-level element '" + qName + "' found (expected 'table')");
					}
					break;
 
				case IN_TABLE:
					if (qName.equals("tr")) {
						thisRow = new ArrayList<String>();
						thisTable.add(thisRow);
						state = State.IN_TR;
					}
					break;
 
				case IN_TR: 
					if (qName.equals("td")) {
						thisCell = "";
						state = State.IN_TD;
					}
					break;
 
				case IN_TD:
					break;
 
				default:
					throw new IllegalStateException("Illegal state " + state + " in SimpleTableContentHandler");
 
			}
		}
 
		public void characters(char[] ch, int start, int length)
			throws SAXException {
			if (state==State.IN_TD) {
				thisCell += new String(ch, start, length);
			}
		}
 
		public void endElement(String uri, String localName, String qName)
			throws SAXException 
		{
			if (state == State.IN_TD && qName.equals("td")) {
				thisRow.add(thisCell);
				state = State.IN_TR;
			} else if (state == State.IN_TR && qName.equals("tr")) {
				state = State.IN_TABLE;
			}
		}
 
		public List<List<String>> getTable() {
			return thisTable;
		}
	}
 
	/** An abstract stack-based XML parser. Similar to the apache digester, but without
	 * the dozen or so dependent JARs.
	 * 
	 * <p>Only element text is captured 
	 * <p>Element attributes are not parsed by this class.
	 * <p>Mixed text/element nodes are not parsed by this class.
	 * 
	 */
	public abstract static class AbstractStackContentHandler implements ContentHandler 
	{
		/** Logger instance for this class */
		public static final Logger logger = Logger.getLogger(AbstractStackContentHandler.class);
 
		/** Location in stack */
		private String stack = "";
		private String text = null;     // text captured so far
 
		// unused interface methods
		public void setDocumentLocator(Locator locator) { }
		public void startDocument() throws SAXException { }
		public void endDocument() throws SAXException { }
		public void startPrefixMapping(String prefix, String uri) throws SAXException { }
		public void endPrefixMapping(String prefix) throws SAXException { }
		public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException { }
		public void processingInstruction(String target, String data) throws SAXException { }
		public void skippedEntity(String name) throws SAXException { }
 
		public void startElement(String uri, String localName, String qName, Attributes atts)
			throws SAXException 
		{
			stack = stack.equals("") ? qName : stack + "/" + qName;
			text = "";
			element(stack);
		}
		public void characters(char[] ch, int start, int length) throws SAXException {
			text += new String(ch, start, length);
		}
		public void endElement(String uri, String localName, String qName)
			throws SAXException 
		{
			elementText(stack, text);
			text = ""; // probably not necessary
			stack = stack.contains("/") ? stack.substring(0, stack.lastIndexOf("/")) : "";
		}
		public abstract void element(String path) throws SAXException;
		public abstract void elementText(String path, String content) throws SAXException;
	}
 
 
	/** Convert a NodeList into something that Java1.5 can treat as Iterable,
	 * so that it can be used in <tt>for (Node node : nodeList) { ... }</tt> style
	 * constructs.
	 * 
	 * <p>(org.w3c.dom.traversal.NodeListIterator doesn't currently implement Iterable)
	 * 
	 */
	public static class NodeListIterator implements Iterable<org.w3c.dom.Node> {
		private final NodeList nodeList;
		public NodeListIterator(NodeList nodeList) {
			this.nodeList = nodeList;
		}
		public Iterator<org.w3c.dom.Node> iterator() {
			return new Iterator<org.w3c.dom.Node>() {
				private int index = 0;
				public boolean hasNext() {
					return index < nodeList.getLength();
				}
				public org.w3c.dom.Node next() {
					return nodeList.item(index++);
				}
				public void remove() {
					throw new UnsupportedOperationException("remove() not allowed in NodeList");
				}
			};
		}
	}
 
 
 
}

and the medley of contrived examples above:

XmlUtilExample.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
package com.example.contrived.cli;
 
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.math.BigDecimal;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
 
import javax.xml.transform.TransformerException;
 
import org.apache.xpath.XPathAPI;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
 
import com.randomnoun.common.XmlUtil;
 
public class XmlUtilExample {
 
	// utility method
    public static String getFileContents(String filename)
        throws IOException {
        File file = new File(filename);
        FileInputStream fis = new FileInputStream(file);
        byte[] data = new byte[(int) file.length()];
        int len = fis.read(data);
        fis.close();
        if (len < file.length()) {
            throw new IOException("Buffer read != size of file");
        }
        return new String(data);
    }
 
 
	public void getTextExample() throws TransformerException, 
	  SAXException, MalformedURLException, IOException 
	{
		// uses the Scanner trick from http://weblogs.java.net/blog/pat/archive/2004/10/stupid_scanner.html
		String rubbishHtml = new Scanner(new URL("http://www.microsoft.com").openStream(), 
		  "UTF-8").useDelimiter("\\A").next();
		String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false);
		Document d = XmlUtil.toDocument(cleanXml);
		Element titleEl = (Element) XPathAPI.selectSingleNode(d, "/html/head/title");
		String title = XmlUtil.getText(titleEl);System.out.println(title);
	}
 
	public void getTextPreserveElementsExample() throws SAXException {
		String input = "<p>Here is some formatted text <b>in bold</b>, <i>in italics</i>," +
		  " <blink>blinking</blink>, and with an image at the end <img src=\"frog.png\"/></p>";
		Document d = XmlUtil.toDocument(input);
		Element paraEl = d.getDocumentElement();
		System.out.println(XmlUtil.getText(paraEl));
		System.out.println(XmlUtil.getTextPreserveElements(paraEl, 
		  new String[] { "b", "i", "img" } ));
	}
 
	public void getTextNonRecursiveExample() throws SAXException {
		String input = "<p>This is the bit we're interested in <span>but not this bit</span></p>";
		Document d = XmlUtil.toDocument(input);
		Element paraEl = d.getDocumentElement();
		System.out.println(XmlUtil.getText(paraEl));
		System.out.println(XmlUtil.getTextNonRecursive(paraEl));
	}
 
	public void getXmlStringExample() throws MalformedURLException, 
	  IOException, SAXException, TransformerException 
	{
		// how to determine what 'keywords' and 'description' meta tags google has
		String rubbishHtml = new Scanner(new URL("http://www.google.com").openStream(), 
		  "UTF-8").useDelimiter("\\A").next();
		String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false);
		Document d = XmlUtil.toDocument(cleanXml);
		NodeList nodes = XPathAPI.selectNodeList(d, "/html/head/meta");
		for (int i=0; i<nodes.getLength(); i++) {
			System.out.println(i + ": " + XmlUtil.getXmlString(nodes.item(i),  true));
		}
	}
 
	public void compactExample() throws SAXException, TransformerException {
		// from http://www.ibiblio.org/xml/examples/shakespeare/all_well.xml
		String input = 
			"<PERSONAE>\n" +
			"    <TITLE>Dramatis Personae</TITLE>\n" +
			"    \n" +
			"    <PERSONA>KING OF FRANCE</PERSONA>\n" +
			"    <PERSONA>DUKE OF FLORENCE</PERSONA>\n" +
			"    <PERSONA>BERTRAM, Count of Rousillon.</PERSONA>\n" +
			"    <PERSONA>LAFEU, an old lord.</PERSONA>\n" +
			"    <PERSONA>PAROLLES, a follower of Bertram.</PERSONA>\n" +
			"    \n" +
			"    <PGROUP>\n" +
			"        <PERSONA>Steward</PERSONA>\n" +
			"        <PERSONA>Clown</PERSONA>\n" +
			"        <GRPDESCR>servants to the Countess of Rousillon.</GRPDESCR>\n" +
			"    </PGROUP>\n" +
			"    \n" +
			"    <PERSONA>A Page. </PERSONA>\n" +
			"    <PERSONA>COUNTESS OF ROUSILLON, mother to Bertram. </PERSONA>\n" +
			"    <PERSONA>HELENA, a gentlewoman protected by the Countess.</PERSONA>\n" +
			"    <PERSONA>An old Widow of Florence. </PERSONA>\n" +
			"    <PERSONA>DIANA, daughter to the Widow.</PERSONA>\n" +
			"    \n" +
			"    <PGROUP>\n" +
			"        <PERSONA>VIOLENTA</PERSONA>\n" +
			"        <PERSONA>MARIANA</PERSONA>\n" +
			"        <GRPDESCR>neighbours and friends to the Widow.</GRPDESCR>\n" +
			"    </PGROUP>\n" +
			"    \n" +
			"    <PERSONA>Lords, Officers, Soldiers, &amp;c., French and Florentine.</PERSONA>\n" +
			"</PERSONAE>\n";
 
		Document d = XmlUtil.toDocument(input);
		System.out.println("Before compact():");
		System.out.println(XmlUtil.getXmlString(d, true));
 
		XmlUtil.compact(d.getDocumentElement());
		System.out.println();
		System.out.println("And after compact(), the much more readable:");
		System.out.println(XmlUtil.getXmlString(d, true));
	}
 
	/** @see #simpleTableContentHandlerExample() */
	public static List<List<String>> getCurrencyRates(String date) throws MalformedURLException, 
	  IOException, SAXException, TransformerException 
	{
		// this doesn't actually work due to http://www.xe.com/errors/noautoextract.htm
		// but I'm sure you enterprising types can get around that
		//String url = "http://www.xe.com/currencytables/?from=AUD&date=" + date;
		//String rubbishHtml = new Scanner(new URL(url).openStream(), 
		//  "UTF-8").useDelimiter("\\A").next();
 
		String rubbishHtml = getFileContents(date.equals("2012-01-01") ? "c:\\rate1.txt" : "c:\\rate2.txt");
		String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false);
 
		Document d = XmlUtil.toDocument(cleanXml);
		Element tableEl = (Element) XPathAPI.selectSingleNode(d, ".//table[@id=\"historicalRateTbl\"]");
 
		XmlUtil.SimpleTableContentHandler stch = new XmlUtil.SimpleTableContentHandler();
		XmlUtil.processContentHandler(stch, XmlUtil.getXmlString(tableEl, false));
		List<List<String>> table = stch.getTable();
		return table;
	}
 
	/** @throws TransformerException 
	 * @throws SAXException 
	 * @throws IOException 
	 * @throws MalformedURLException 
	 * @see #getCurrencyRates(String) */
	public void simpleTableContentHandlerExample() throws MalformedURLException, IOException, 
	  SAXException, TransformerException 
	{
		// see what amount of abstract currency units you can get in exchange for
		// the shiny Australian cylinder with a kangaroo on it.
		List<List<String>> startTable = getCurrencyRates("2012-01-01");
		List<List<String>> endTable = getCurrencyRates("2013-01-01");
		// pick an investment strategy
		int tradingOption = (int) (Math.random() * startTable.size());
		if (!startTable.get(tradingOption).get(0).equals(endTable.get(tradingOption).get(0))) {
			throw new RuntimeException("Rate tables don't line up");
		}
		System.out.println("You're investing in the " + startTable.get(tradingOption).get(1));
		System.out.println("Of which you could pick up " + startTable.get(tradingOption).get(2) + " for a measly AUD$1 way back in 2012");
		System.out.println("and then sell for AUD$" + endTable.get(tradingOption).get(3) + " each barely a year later");
 
		// and then let's pretend that rounding, buy/sell prices and transaction costs don't exist
		BigDecimal bd = new BigDecimal(startTable.get(tradingOption).get(2));
		BigDecimal bd2 = new BigDecimal(endTable.get(tradingOption).get(3));
		BigDecimal val = bd.multiply(bd2);
 
		// which should be $1 exactly, but xe.com can't put an infinite number of decimal points 
		// on their tables
		System.out.println("landing you AUD$" + val);
		System.out.println(val.compareTo(new BigDecimal(1))>0 ? "Winner!" : "You lose!");
	}
 
	/** @see #abstractStackContentHandlerExample() */
	public static class DeviceContentHandler extends XmlUtil.AbstractStackContentHandler {
		List<String> names = new ArrayList<String>();
 
		// process the start of an XML element
		public void element(String path) throws SAXException { }
 
		// process the text of an XML element 
		public void elementText(String path, String content) throws SAXException { 
			if (path.equals("devices/device/name")) {
				names.add(content);
			}
		}
 
		// return the names of all devices
		public List<String> getNames() { return names; }
	}
 
	/** @throws IOException 
	 * @throws SAXException 
	 * @see XmlUtilExample.DeviceContentHandler */
	public void abstractStackContentHandlerExample() throws SAXException, IOException {
		DeviceContentHandler dch = new DeviceContentHandler();
		XmlUtil.processContentHandler(dch, getFileContents("c:\\device.xml"));
		List<String> names = dch.getNames();
		for (int i=0; i<names.size(); i++) {
			System.out.println(i + ": " + names.get(i));
		}
	}
 
 
	/**
	 * @param args
	 * @throws SAXException 
	 * @throws IOException 
	 * @throws TransformerException 
	 * @throws MalformedURLException 
	 */
	public static void main(String[] args) throws SAXException, MalformedURLException, 
	  TransformerException, IOException 
	{
		XmlUtilExample xue = new XmlUtilExample();
 
		xue.getTextExample();
		xue.getTextPreserveElementsExample();
		xue.getTextNonRecursiveExample();
		xue.getXmlStringExample();
		xue.compactExample();
		xue.simpleTableContentHandlerExample();
		xue.abstractStackContentHandlerExample();
	}
 
}

and the unit tests and associated resources

XmlUtilTest.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
package com.randomnoun.common;
 
/* (c) 2013 randomnoun. All Rights Reserved. This work is licensed under a
 * BSD Simplified License. (http://www.randomnoun.com/bsd-simplified.html)
 */
 
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.TransformerException;
 
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.xml.sax.SAXException;
 
import com.randomnoun.common.XmlUtil.SimpleTableContentHandler;
 
import junit.framework.TestCase;
 
/** Test the XmlUtil class.
 * 
 * @author knoxg
 * @blog http://www.randomnoun.com/wp/2013/01/25/exciting-things-with-xml/
 * @version $Id: XmlUtilTest.java,v 1.9 2013-09-24 02:37:09 knoxg Exp $
 *
 */
public class XmlUtilTest extends TestCase {
 
	private String getResourceAsString(String name) throws IOException {
		ByteArrayOutputStream baos = new ByteArrayOutputStream();
		InputStream is = XmlUtilTest.class.getResourceAsStream("./" + name);
		if (is==null) { 
			is = XmlUtilTest.class.getResourceAsStream(name);
			if (is==null) {
				throw new IllegalStateException("Missing resource"); 
			}
		}
		int ch = is.read(); while (ch!=-1) { baos.write(ch); ch=is.read(); }
		is.close();
		return baos.toString();
	}
 
	public void testGetCleanXml() throws IOException, SAXException {
		String input, output;		
		input = "<html>So here's <br>some HTML with<p>some <b>unclosed <i>formatting</p>tags.";
		output = XmlUtil.getCleanXml(input, false); // apply HTML rules
		assertEquals("<html xmlns=\"http://www.w3.org/1999/xhtml\"><body>So here's <br clear=\"none\"></br>some HTML with<p>some <b>unclosed <i>formatting</i></b></p><b><i>tags.</i></b></body></html>", output.trim());
 
		output = XmlUtil.getCleanXml(input, true); // apply XML rules
		assertEquals("<html xmlns=\"http://www.w3.org/1999/xhtml\"><body>So here's <br clear=\"none\">some HTML with<p>some <b>unclosed <i>formatting</i></b></p><b><i>tags.</i></b></body></html>", output.trim());
 
		// clean the 'extreme' HTML from http://ccil.org/~cowan/XML/tagsoup/extreme.html
		input = getResourceAsString("/tagsoupInput.txt");
		output = XmlUtil.getCleanXml(input, true);
		String expected = getResourceAsString("/tagsoupOutput.txt"); expected=expected.replaceAll("\r\n",  "\n");
		assertEquals(expected, output);
 
	}
 
	public void testGetText() throws SAXException {
		String input, output;
 
		input = "<p>Here is some <b>bold text</b> and <i>some <u>underlined</u> italics</i> text</p>";
		Document d = XmlUtil.toDocument(input);
		Element paraEl = d.getDocumentElement(); // top-level element is a paragraph element
		output = XmlUtil.getText(paraEl);
		assertEquals("Here is some bold text and some underlined italics text", output);
 
	}
 
	public void testGetTextPreserveElements() throws SAXException {
		String input, output;
		Document d;
 
		input = "<p>Here is some <b>bold text</b> and <i>some <u>underlined</u> italics</i> text</p>";
		d = XmlUtil.toDocument(input);
		Element paraEl = d.getDocumentElement(); // top-level element is a paragraph element
		output = XmlUtil.getTextPreserveElements(paraEl, new String[] { "b", "i", "u" } );
		assertEquals("Here is some <b>bold text</b> and <i>some <u>underlined</u> italics</i> text", output);
 
		output = XmlUtil.getTextPreserveElements(paraEl, new String[] { "b", "i" } );
		assertEquals("Here is some <b>bold text</b> and <i>some underlined italics</i> text", output);
 
		output = XmlUtil.getTextPreserveElements(paraEl, new String[] { "i" } );
		assertEquals("Here is some bold text and <i>some underlined italics</i> text", output);
 
		// attributes are preserved, but may be reordered
		// self-closing elements are expanded to an opening and closing element
		input = "<p>And this would be a paragraph with two images in it: <img src=\"src1.png\" /> and <img src=\"src2.png\"></img></p>";
		d = XmlUtil.toDocument(input);
		paraEl = d.getDocumentElement(); // top-level element is a paragraph element
		output = XmlUtil.getTextPreserveElements(paraEl, new String[] { "img" } );
		assertEquals("And this would be a paragraph with two images in it: <img src=\"src1.png\"></img> and <img src=\"src2.png\"></img>", output);
 
	}
 
	public void testGetTextNonRecursive() throws SAXException {
		String input, output;
 
		input = "<p>Here is some <b>bold text</b> and <i>some <u>underlined</u> italics</i> text</p>";
		Document d = XmlUtil.toDocument(input);
		Element paraEl = d.getDocumentElement(); // top-level element is a paragraph element
		output = XmlUtil.getTextNonRecursive(paraEl);
		assertEquals("Here is some  and  text", output);
 
	}
 
	public void testToDocumentString() {
		// should have been tested above, but maybe do some more samples here
 
	}
 
	public void testToDocumentInputStream() throws SAXException, TransformerException {
		String input, output;
		InputStream is;
		Document d;
 
		// DOCTYPEs are discarded
		input = "<!DOCTYPE something><p>XML with a DOCTYPE</p>";
		is = new ByteArrayInputStream(input.getBytes());
		d = XmlUtil.toDocument(is);
		output = XmlUtil.getXmlString(d.getDocumentElement(), false);
		assertEquals("<?xml version=\"1.0\" encoding=\"UTF-8\"?><p>XML with a DOCTYPE</p>", output);
 
		// XML with namespaces
		input = "<root xmlns:dc=\"http://purl.org/dc/terms/\"" +
          " xmlns:html=\"http://www.w3.org/1999/xhtml\">" +
		  "<html:p><dc:something>dublin core dublin core</dc:something>" +
		  "Well this should be aware of namespaces then</html:p></root>";
		is = new ByteArrayInputStream(input.getBytes());
		d = XmlUtil.toDocument(is); 
		output = XmlUtil.getXmlString(d.getDocumentElement(), false);
		assertEquals("<?xml version=\"1.0\" encoding=\"UTF-8\"?>" + input, output);
 
	}
 
	public void testGetXmlString() throws SAXException, TransformerException {
		String input, output;
		Document d;
		Element paraEl;
 
		input = "<p>Here is some <b>bold text</b> and <i>some <u>underlined</u> italics</i> text</p>";
		d = XmlUtil.toDocument(input);
		paraEl = d.getDocumentElement(); 
		output = XmlUtil.getXmlString(paraEl, true);
		assertEquals(input, output); // output should match input
 
		output = XmlUtil.getXmlString(paraEl, false);
		// output should match input with XML declaration
		assertEquals("<?xml version=\"1.0\" encoding=\"UTF-8\"?>" + input, output);
 
	}
 
 
	public void testCompact() throws ParserConfigurationException, SAXException, TransformerException {
		String input =
			"<body>\n" +
		    "  <el>  This is an element\n" +
			"    <el2> This is another one</el2>\n" +
		    "    <el3>and another <!-- with some comments --></el3>\n" +
			"  </el>\n" +
		    "</body>";
		Document d = XmlUtil.toDocument(input);
		String output = XmlUtil.getXmlString(d.getDocumentElement(), true);
		assertEquals(input.replaceAll("\n", System.getProperty("line.separator")), output);
		XmlUtil.compact(d.getDocumentElement());
		output = XmlUtil.getXmlString(d.getDocumentElement(), true);
		assertEquals("<body><el>This is an element<el2>This is another one</el2>" +
		  "<el3>and another<!-- with some comments --></el3></el></body>", output);
 
	}
 
	public void testProcessContentHandler_SimpleTable() throws SAXException {
		String input = 
			"<table>" +
			"<tr><td>A1</td><td>B1</td><td>C1</td></tr>" +
			"<tr><td>A2</td><td>B2</td></tr>" +
			"<tr><td>A3</td><td>B3</td><td>C3</td></tr>" +
			"</table>";
 
		SimpleTableContentHandler stch = new SimpleTableContentHandler();
		XmlUtil.processContentHandler(stch, input);
		List<List<String>> table = stch.getTable();
		assertEquals(3, table.size()); // 3 rows
		assertEquals(3, table.get(0).size()); // 3 columns in first row
		assertEquals(2, table.get(1).size()); // 2 columns in second row
		assertEquals("C1", table.get(0).get(2)); // contents of row 1, cell 3
	}
 
	/** A test class that extends XmlUtil.AbstractStackContentHandler.
	 * The "real" DeviceContentHandler populates DeviceTO and DevicePropertyTO objects here, 
	 * but for the purposes of this unit test, I'm just storing the result in 
	 * structured Lists of Maps.   
	 */
	public static class DeviceContentHandler extends XmlUtil.AbstractStackContentHandler {
		List<Map<String, Object>> result = new ArrayList<Map<String, Object>>();
		Map<String, Object> d = null;  // current device
		Map<String, String> prop = null;  // current property
 
		// the 'path' variable is maintained by the AbstractStackContentHandler as a 
		// slash-delimited path of the current element within the XML document. 
 
		/** paths that match this pattern define attributes of a device */
		Pattern p1 = Pattern.compile("^devices/device/(name|className|type|active|universeNumber)$");
 
		/** stacks that match this pattern define attributes of a deviceProperty */
		Pattern p2 = Pattern.compile("^devices/device/deviceProperties/deviceProperty/(key|value)$");
 
		/** process the start of an XML element */
		public void element(String path) throws SAXException {
			if (path.equals("devices/device")) {
				d = new HashMap<String, Object>();
				d.put("deviceProperties", new ArrayList<Map<String, String>>());;
				result.add(d);
			} else if (path.equals("devices/device/deviceProperties")) {
				// 
			} else if (path.equals("devices/device/deviceProperties/deviceProperty")) {
				prop = new HashMap<String, String>();
				((List) d.get("deviceProperties")).add(prop);
			}
		}
 
		/** process the text of an XML element */
		public void elementText(String path, String content) throws SAXException {
			Matcher m1 = p1.matcher(path);
			if (m1.matches()) {
				d.put(m1.group(1), content);
			} else {
				Matcher m2 = p2.matcher(path);
				if (m2.matches()) {
					prop.put(m2.group(1), content);
				}
			}
		}
	}
 
	public void testProcessContentHandler_AbstractStack() throws SAXException, IOException {
		String input = getResourceAsString("/device.xml");
    	DeviceContentHandler dch = new DeviceContentHandler();
    	XmlUtil.processContentHandler(dch, input);
 
    	List<Map<String, Object>> devices = dch.result;
    	assertEquals(4, devices.size());
 
    	Map<String, Object> device = devices.get(1); // second device
    	assertEquals("Art-Net", device.get("name"));
    	assertEquals("com.randomnoun.dmx.dmxDevice.artNet.ArtNet", device.get("className"));
 
    	List<Map<String, String>> properties = (List) device.get("deviceProperties");
    	assertEquals(6, properties.size()); // seven properties
 
    	Map<String, String> property = properties.get(0);
    	assertEquals("artNetSubnetId", property.get("key"));
    	assertEquals("0", property.get("value"));
 
	}
 
 
 
}
tagsoupInput.txt
<META-START>
John Cowan
<TABLE>
<ROW>
<CELL>SOUPE</CELL>
<CELL>BE EVIL!</CELL></ROW>
DE BALISES</TABLE>
<CORR NEW="U" LOC="PI"/>
<G ID="P1">
Ecritez une balise ouvrante (sans attributs)
</G>
ou fermante HTML ici, s.v.p.</META-START>
<FONT>X Y <p> ABC </FONT> xyz
QRS<sup>TUV<sub>WXY</sup>Z</sub>
<script language="javascript"><p></script>
<table><tbody><tr><th>ABC
</table><nr/>
<meta><meta><meta><meta>
<pre xml:space="default">test</pre>
<test xmlns:xml="http://www.example.org/>
</test><hr/>
(add a random HTML tag above)
<r:r:r:test/>
<b><i></B></I>
<b>
  <p>bbb</b></p>
  <p>bbb</b></p>
  <p>bbb</b></p>
<blink>&grec;
<p xmlns:xqp="http://www.w3.org/1998/XML">
 <span xqp:space="preserve">~~~</span>
</p></blink>
<html:p xmlns:html="http://...."></p>
<@/><!--Apple logo in PUA-->
<!--comment--comment-->
<!--comment--comment>
<P>]]>
<P id="7" id="8">M</p>
<p xmlns:a="urn" xmlns:b="urn"
   a:id="7" b:id="9">~~~</p>
<p id="a" idref="a"/>  BE EVIL!
<extreme sID="a" mood="happy"/>
<extreme eID="a" mood="sad"/>
<math><mi>2</mi><msup>3
  </msup></math>  <title>
<verse><seg>When,</seg><seg>in</line>
<line>the beginning</line><line>God created
the heaven and the earth.</line></verse>
<How/><To/><Markup/><Legibly/>
<Name Name="Name">Name</Name>
<list 4 text </p>
<marquee>foo!</marquee>
tagsoupOutput.txt
<META-START xmlns="http://www.w3.org/1999/xhtml">
John Cowan
<table><ROW>
<CELL>SOUPE</CELL>
<CELL>BE EVIL!</CELL></ROW></table>
DE BALISES
<CORR new="U" loc="PI"></CORR>
<G id="P1">
Ecritez une balise ouvrante (sans attributs)
</G>
ou fermante HTML ici, s.v.p.
<font>X Y </font><p> ABC  xyz
QRS<sup>TUV<sub>WXY</sub></sup><sub>Z</sub>
<script language="javascript"><p></script>
<table><tbody><tr><th colspan="1" rowspan="1">ABC
</th></tr></tbody></table><nr></nr>
</p><meta><meta><meta><meta>
<pre xml:space="default">test</pre>
<test http:_="http:_" www.w3.org="www.w3.org" _1998="_1998" xml="xml" xmlns:http="urn:x-prefix:http">
 <span xqp:space="preserve" xmlns:xqp="urn:x-prefix:xqp">~~~</span>

<html:p xmlns:html="urn:x-prefix:html">
<_></_>

</html:p></test></META-START>

device.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
<devices>
 
    <device>
        <name>DMXKing USB</name>
        <className>com.randomnoun.dmx.dmxDevice.usbPro.UsbProWidget</className>
        <type>D</type>
        <active>Y</active>
        <universeNumber>1</universeNumber>
        <deviceProperties>
            <deviceProperty>
                <key>portName</key>
                <value>COM8</value>
            </deviceProperty>
        </deviceProperties>
    </device>
 
    <device>
        <name>Art-Net</name>
        <className>com.randomnoun.dmx.dmxDevice.artNet.ArtNet</className>
        <type>D</type>
        <active></active>
        <universeNumber>1</universeNumber>
        <deviceProperties>
            <deviceProperty>
                <key>artNetSubnetId</key>
                <value>0</value>
            </deviceProperty>
            <deviceProperty>
                <key>artNetUniverseId</key>
                <value>0</value>
            </deviceProperty>
            <deviceProperty>
                <key>broadcastAddress</key>
                <value>192.168.0.62</value>
            </deviceProperty>
            <deviceProperty>
                <key>udpRecvPort</key>
                <value>6454</value>
            </deviceProperty>
            <deviceProperty>
                <key>udpSendPort</key>
                <value>6454</value>
            </deviceProperty>
            <deviceProperty>
                <key>unicastAddress</key>
                <value>192.168.0.62</value>
            </deviceProperty>
        </deviceProperties>
    </device>
 
    <device>
        <name>WinAMP Controller</name>
        <className>com.randomnoun.dmx.audioController.winampNg.WinampAudioController</className>
        <type>C</type>
        <active></active>
        <universeNumber>null</universeNumber>
        <deviceProperties>
            <deviceProperty>
                <key>defaultPath</key>
                <value>C:\DCB\audio</value>
            </deviceProperty>
            <deviceProperty>
                <key>host</key>
                <value>localhost</value>
            </deviceProperty>
            <deviceProperty>
                <key>password</key>
                <value>abc123</value>
            </deviceProperty>
            <deviceProperty>
                <key>port</key>
                <value>18443</value>
            </deviceProperty>
            <deviceProperty>
                <key>timeout</key>
                <value>3000</value>
            </deviceProperty>
        </deviceProperties>
    </device>
 
    <device>
        <name>Ye olde dmxy plugigne</name>
        <className>com.randomnoun.dmx.audioSource.winamp.WinampAudioSource</className>
        <type>S</type>
        <active>Y</active>
        <universeNumber>null</universeNumber>
        <deviceProperties>
            <deviceProperty>
                <key>host</key>
                <value>localhost</value>
            </deviceProperty>
            <deviceProperty>
                <key>port</key>
                <value>58273</value>
            </deviceProperty>
            <deviceProperty>
                <key>timeout</key>
                <value>1000</value>
            </deviceProperty>
        </deviceProperties>
    </device>
 
</devices>

This class is now part of the com.randomnoun.common:common-public artifact, which can be directly referenced in your pom.xml from the maven central repository.

Update 25/9/2013: It’s in central now

Tags:,

Add a Comment

Your email address will not be published. Required fields are marked *