Exciting things with XML

So XML exists, and you probably have to deal with it occasionally.

It’s a bit more complex than a CSV or a fixed-width file, so an enormous number of people have created an even greater number of toolkits to help you read and write the things.

Some chunk of XML
Some generic chunk of XML

What the industry standard appears to be is to create a toolkit that implements half of the XML standard, and then drop that and start another which has a different, but overlapping subset of supported features.

If you’re a member of the W3C consortium, you probably want to have several different XML implementations, with a whole raft of SAXTransformerFactory‘s and DOMImplementationRegistry‘s just to work out which implementation to use, none of which really work, but that’s the sort of thing that keeps committees busy.

Why not use XmlUtil ? I know I certainly do. Especially in that example from a few weeks ago.

It has the following methods, which I will explain briefly in the bit below this bit.

  • getCleanXml
  • getText
  • getTextPreserveElements
  • getTextNonRecursive
  • getXmlString
  • toDocument
  • compact
  • processContentHandler
  • static inner class SimpleTableContentHandler
  • static inner class AbstractStackContentHandler
  • static inner class NodeListIterator

getCleanXml

This method runs a document through tagsoup to get an ‘XML-clean’ document.

A fairly common use-case for me is having some a fairly horrible looking bit of HTML taken from the wastelands of the internet, and wanting to extract some data from it.

The tagsoup library does a fairly decent job of cleaning up things that look like HTML so that it starts to approach something that more closely resembles XML.

It’s a useful first step before performing XPath-style data extraction, or before attempting to modify a webpage to pass the W3C validator. Which is the sort of thing that SEO companies think gives you google brownie points.

Incidentally, if you search for “SEO company” on google, you get about 93,900,000 results. What I want to know is why aren’t they all on the first page of results ?

I have my own views on search-engine optimisation, which are that:

  • google knows what you’re searching for
  • google knows when you stop searching for it
  • google probably knows therefore whether a website is a good result for a particular search query or not

which, you’ll notice, doesn’t involve the phrase ‘meta tags’ anywhere in it.

You can send my $20,000 fee to the usual account.

On a slightly more conspiratorial note,

  • google is aware of your IP address
  • google is therefore more likely to show you your own ‘ads’ that you’ve purchased on it’s network
  • which is therefore more likely to get you to throw more money at google seeing as it’s doing such a good job getting your extra-specially-important site at the top of those search results

That was a bit of a longer side-rant than I thought it would be. I might move this bit into a separate blog entry. Maybe. Later.

So anyway, this is how you’d use this getCleanXml thing:

// uses the Scanner trick from http://weblogs.java.net/blog/pat/archive/2004/10/stupid_scanner.html
String rubbishHtml = new Scanner(new URL("http://www.microsoft.com").openStream(), 
  "UTF-8").useDelimiter("\\A").next();
String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false);Document d = XmlUtil.toDocument(cleanXml);
Element titleEl = (Element) XPathAPI.selectSingleNode(d, "/html/head/title");

getText

getText() returns the text between an opening and closing element in an XML document.

So extending the example above:

// uses the Scanner trick from http://weblogs.java.net/blog/pat/archive/2004/10/stupid_scanner.html
String rubbishHtml = new Scanner(new URL("http://www.microsoft.com").openStream(), 
  "UTF-8").useDelimiter("\\A").next();
String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false);
Document d = XmlUtil.toDocument(cleanXml);
Element titleEl = (Element) XPathAPI.selectSingleNode(d, "/html/head/title");
String title = XmlUtil.getText(titleEl);System.out.println(title);

currently generates

Microsoft Australia | Devices and Services

with nary a regex in sight.

getTextPreserveElements

Same as getText(), but allows a predefined list of tags to be returned as well.

This is useful when processing formatted text in which you might want to keep certain tags (B, I, IMG etc)

String input = "<p>Here is some formatted text <b>in bold</b>, <i>in italics</i>," +
  " <blink>blinking</blink>, and with an image at the end <img src=\"frog.png\"/></p>";
Document d = XmlUtil.toDocument(input);
Element paraEl = d.getDocumentElement();
System.out.println(XmlUtil.getText(paraEl));
System.out.println(XmlUtil.getTextPreserveElements(paraEl,   new String[] { "b", "i", "img" } ));

which would give you something like

Here is some formatted text in bold, in italics, blinking, and with an image at the end 
Here is some formatted text <b>in bold</b>, <i>in italics</i>, blinking, and with an image at the end <img src="frog.png"></img>

getTextNonRecursive

Same as getText(), but doesn’t attempt to recurse into child elements within the parent XML element

String input = "<p>This is the bit we're interested in <span>but not this bit</span></p>";
Document d = XmlUtil.toDocument(input);
Element paraEl = d.getDocumentElement();
System.out.println(XmlUtil.getText(paraEl));
System.out.println(XmlUtil.getTextNonRecursive(paraEl));
This is the bit we're interested in but not this bit
This is the bit we're interested in

getXmlString

Converts a Document into a String. As in the following example, which attempts to extract the ‘keywords’ and ‘description’ meta tag elements from google’s home page, and send them to stdout.

String rubbishHtml = new Scanner(new URL("http://www.google.com").openStream(), 
  "UTF-8").useDelimiter("\\A").next();
String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false);
Document d = XmlUtil.toDocument(cleanXml);
NodeList nodes = XPathAPI.selectNodeList(d, "/html/head/meta");
for (int i=0; i<nodes.getLength(); i++) {
	System.out.println(i + ": " + XmlUtil.getXmlString(nodes.item(i),  true));
}

which currently, and not too surprisingly, generates:

0: <meta content="/images/google_favicon_128.png" itemprop="image"/>

It’s amazing that anyone can find them, really.

toDocument

Converts a String into a W3C Document object

Most of the examples above use this method, so hopefully you get the idea

compact

Removes whitespace surrounding text nodes contained within an Element

Such as this attempt to generate the Cliff’s notes for All’s Well That End’s Well:

// from http://www.ibiblio.org/xml/examples/shakespeare/all_well.xml
String input = 
	"<PERSONAE>\n" +
	"    <TITLE>Dramatis Personae</TITLE>\n" +
	"    \n" +
	"    <PERSONA>KING OF FRANCE</PERSONA>\n" +
	"    <PERSONA>DUKE OF FLORENCE</PERSONA>\n" +
	"    <PERSONA>BERTRAM, Count of Rousillon.</PERSONA>\n" +
	"    <PERSONA>LAFEU, an old lord.</PERSONA>\n" +
	"    <PERSONA>PAROLLES, a follower of Bertram.</PERSONA>\n" +
	"    \n" +
	"    <PGROUP>\n" +
	"        <PERSONA>Steward</PERSONA>\n" +
	"        <PERSONA>Clown</PERSONA>\n" +
	"        <GRPDESCR>servants to the Countess of Rousillon.</GRPDESCR>\n" +
	"    </PGROUP>\n" +
	"    \n" +
	"    <PERSONA>A Page. </PERSONA>\n" +
	"    <PERSONA>COUNTESS OF ROUSILLON, mother to Bertram. </PERSONA>\n" +
	"    <PERSONA>HELENA, a gentlewoman protected by the Countess.</PERSONA>\n" +
	"    <PERSONA>An old Widow of Florence. </PERSONA>\n" +
	"    <PERSONA>DIANA, daughter to the Widow.</PERSONA>\n" +
	"    \n" +
	"    <PGROUP>\n" +
	"        <PERSONA>VIOLENTA</PERSONA>\n" +
	"        <PERSONA>MARIANA</PERSONA>\n" +
	"        <GRPDESCR>neighbours and friends to the Widow.</GRPDESCR>\n" +
	"    </PGROUP>\n" +
	"    \n" +
	"    <PERSONA>Lords, Officers, Soldiers, &amp;c., French and Florentine.</PERSONA>\n" +
	"</PERSONAE>\n";
 
Document d = XmlUtil.toDocument(input);
System.out.println("Before compact():");
System.out.println(XmlUtil.getXmlString(d, true));
 
XmlUtil.compact(d.getDocumentElement());System.out.println();
System.out.println("And after compact(), the much more readable:");
System.out.println(XmlUtil.getXmlString(d, true));

which generates:

Before compact():
<PERSONAE>
    <TITLE>Dramatis Personae</TITLE>
    
    <PERSONA>KING OF FRANCE</PERSONA>
    <PERSONA>DUKE OF FLORENCE</PERSONA>
    <PERSONA>BERTRAM, Count of Rousillon.</PERSONA>
    <PERSONA>LAFEU, an old lord.</PERSONA>
    <PERSONA>PAROLLES, a follower of Bertram.</PERSONA>
    
    <PGROUP>
        <PERSONA>Steward</PERSONA>
        <PERSONA>Clown</PERSONA>
        <GRPDESCR>servants to the Countess of Rousillon.</GRPDESCR>
    </PGROUP>
    
    <PERSONA>A Page. </PERSONA>
    <PERSONA>COUNTESS OF ROUSILLON, mother to Bertram. </PERSONA>
    <PERSONA>HELENA, a gentlewoman protected by the Countess.</PERSONA>
    <PERSONA>An old Widow of Florence. </PERSONA>
    <PERSONA>DIANA, daughter to the Widow.</PERSONA>
    
    <PGROUP>
        <PERSONA>VIOLENTA</PERSONA>
        <PERSONA>MARIANA</PERSONA>
        <GRPDESCR>neighbours and friends to the Widow.</GRPDESCR>
    </PGROUP>
    
    <PERSONA>Lords, Officers, Soldiers, &c., French and Florentine.</PERSONA>
</PERSONAE>

And after compact(), the much more readable:

<PERSONAE><TITLE>Dramatis Personae</TITLE><PERSONA>KING OF FRANCE</PERSONA><PERSONA>DUKE OF FLORENCE</PERSONA><PERSONA>BERTRAM, Count of Rousillon.</PERSONA><PERSONA>LAFEU, an old lord.</PERSONA><PERSONA>PAROLLES, a follower of Bertram.</PERSONA><PGROUP><PERSONA>Steward</PERSONA><PERSONA>Clown</PERSONA><GRPDESCR>servants to the Countess of Rousillon.</GRPDESCR></PGROUP><PERSONA>A Page.</PERSONA><PERSONA>COUNTESS OF ROUSILLON, mother to Bertram.</PERSONA><PERSONA>HELENA, a gentlewoman protected by the Countess.</PERSONA><PERSONA>An old Widow of Florence.</PERSONA><PERSONA>DIANA, daughter to the Widow.</PERSONA><PGROUP><PERSONA>VIOLENTA</PERSONA><PERSONA>MARIANA</PERSONA><GRPDESCR>neighbours and friends to the Widow.</GRPDESCR></PGROUP><PERSONA>Lords, Officers, Soldiers, &c., French and Florentine.</PERSONA></PERSONAE>

processContentHandler

Runs a SAX ContentHandler over a Document

You need a ContentHandler in order to process it, so why not keep reading to see an example of one of those.

Class SimpleTableContentHandler

A simple ContentHandler that parses data from a HTML table (into a List of Lists).

The example here parses the currency exchange tables from xe.com, and selects a currency using polar method of G. E. P. Box, M. E. Muller, and G. Marsaglia, as described by Donald E. Knuth in The Art of Computer Programming, Volume 3: Seminumerical Algorithms, section 3.4.1, subsection C, algorithm P.

It then goes on tell you whether you’d have made a profit or loss by selling that currency 12 months later.

public static List<List<String>> getCurrencyRates(String date) throws MalformedURLException, IOException, SAXException, TransformerException {
	// this doesn't actually work due to http://www.xe.com/errors/noautoextract.htm
	// but I'm sure you enterprising types can get around that
	String url = "http://www.xe.com/currencytables/?from=AUD&date=" + date;
	String rubbishHtml = new Scanner(new URL(url).openStream(), 
	  "UTF-8").useDelimiter("\\A").next();
 
	// hint:
	// String rubbishHtml = Text.getFileContents(date.equals("2012-01-01") ? "c:\\rate1.txt" : "c:\\rate2.txt");
	String cleanXml = XmlUtil.getCleanXml(rubbishHtml, false);
 
	Document d = XmlUtil.toDocument(cleanXml);
	Element tableEl = (Element) XPathAPI.selectSingleNode(d, ".//table[@id=\"historicalRateTbl\"]");
 
	XmlUtil.SimpleTableContentHandler stch = new XmlUtil.SimpleTableContentHandler();	XmlUtil.processContentHandler(stch, XmlUtil.getXmlString(tableEl, false));
	List<List<String>> table = stch.getTable();
	return table;
}
 
// see what amount of abstract currency units you can get in exchange for
// the shiny Australian cylinder with a kangaroo on it.
List<List<String>> startTable = getCurrencyRates("2012-01-01");
List<List<String>> endTable = getCurrencyRates("2013-01-01");
 
// pick an investment strategy
int tradingOption = (int) (Math.random().nextGaussian() * startTable.size());
 
// sanity check that the row with the same index in each table contains the same currency
if (!startTable.get(tradingOption).get(0).equals(endTable.get(tradingOption).get(0))) {
	throw new RuntimeException("Rate tables don't line up");
}
 
System.out.println("You're investing in the " + startTable.get(tradingOption).get(1));
System.out.println("Of which you could pick up " + startTable.get(tradingOption).get(2) + " for a measly AUD$1 way back in 2012");
System.out.println("and then sell for AUD$" + endTable.get(tradingOption).get(3) + " each barely a year later");
 
// and then let's pretend that xe.com is accurate, and that rounding, 
// buy/sell prices and transaction costs don't exist
BigDecimal bd = new BigDecimal(startTable.get(tradingOption).get(2));
BigDecimal bd2 = new BigDecimal(endTable.get(tradingOption).get(3));
BigDecimal val = bd.multiply(bd2);
 
System.out.println("landing you AUD$" + val);
System.out.println(val.compareTo(new BigDecimal(1))>0 ? "Winner!" : "You lose!");

which generates:

You're investing in the Belizean Dollar
Of which you could pick up 2.0266687920 for a measly AUD$1 way back in 2012
and then sell for AUD$0.4810598095 each barely a year later
landing you AUD$0.97494890299911512400
You lose!

Class AbstractStackContentHandler

Similar to the Apache Digester, but much much smaller, this ContentHandler creates a ‘stack’ of elements separated by a slash, which it passes through to the concrete implementation.

This example parses a device.xml file (as used in DMX-web, an example of which is attached to the foot of this blog entry), and lists the name of all the devices contained within.

public static class DeviceContentHandler 
  extends XmlUtil.AbstractStackContentHandler {	List<String> names = new ArrayList<String>();
 
	// process the start of an XML element
	public void element(String path) throws SAXException { }
 
	// process the text of an XML element 
	public void elementText(String path, String content) throws SAXException { 
		if (path.equals("devices/device/name")) {
			names.add(content);
		}
	}
 
	// return the names of all devices
	public List<String> getNames() { return names; }
}
 
DeviceContentHandler dch = new DeviceContentHandler();
XmlUtil.processContentHandler(dch, Text.getFileContents("c:\\device.xml"));
List<String> names = dch.getNames();
for (int i=0; i<names.size(); i++) {
	System.out.println(i + ": " + names.get(i));
}

which generates:

0: DMXKing USB
1: Art-Net
2: WinAMP Controller
3: Ye olde dmxy plugigne

Class NodeListIterator

A wrapper for a NodeList that makes it Iterable

Here’s an example using the WindowTreeDom object from a couple of weeks back

// get the Windows window tree in an XML object and attempt to find the Outlook windows in there
WindowTreeDom wtd = new WindowTreeDom();
Document windows = wtd.getDom();
NodeList outlookWindows = XPathAPI.selectNodeList(windows, 
  ".//window[@class='rctrl_renwnd32']/window[@class='AfxWndW']/window[@class='#32770']");   
 
for (Node n : new XmlUtil.NodeListIterator(outlookWindows)) {    Element e = (Element) n;
    logger.debug("Found possible window " + XmlUtil.getXmlString(n, true));
}

Which could come in handy to prove that your commercialised electronic marketing campaigns look readable in all three types of email client that exist.

So I used to have some more code here, but have moved all that to github. Well some of it. The classes are here:

* XmlUtil.java
* XmlUtilTest.java

And here’s the github repos and the maven site docs:

common-public
git@github.com:randomnoun/common-public.git

common-public
com.randomnoun.common:common-public

Update 2013-09-25: It’s in central now
Update 2021-01-29: It’s in github now

Tags:,

Add a Comment

Your email address will not be published. Required fields are marked *