Monday, January 26, 2009

JAXP pipelines using SAX

XProc looks handy, but is not in a usable state yet. So I had to roll my own pipeline as a one-off recently, and seemed to struggle more than I expected. The requirement was fairly simple

  1. Ingest some HTML and convert to well-formed XML for processing;

  2. filter that XML to remove unwanted content;

  3. convert the XML to a different format.


Step 2 was a new bit - I already had well-tested code for the other two parts. So I wanted to re-use that as much as possible. JAXP pipelines using SAX looked to be (and is!) very nice for this, but examples seemed a bit thin on the ground. I've put a version of it here, in the hope that others may find it useful.

/* Create an InputSource for the pipeline input document. */
InputSource in = new InputSource(new ByteArrayInputStream(StringUtils.getBytes(text, "utf-8")));

/* Step 1. TagSoup parsing to get well-formed XML */
XMLReader reader = new Parser();

try {
SAXTransformerFactory stf = (SAXTransformerFactory) SAXTransformerFactory.newInstance();

StringWriter sb = new StringWriter();

OutputFormat outputFormat = new OutputFormat();
outputFormat.setOmitXMLDeclaration(true);

XMLSerializer serializer = new org.apache.xml.serialize.XMLSerializer(outputFormat);
serializer.setOutputCharStream(sb);

/* Step 2. Remove unwanted markup from the well-formed XML. */
InputStream stripContent = getResourceAsStream("strip-content.xslt");
XMLFilter removeUnwanted = stf.newXMLFilter(new StreamSource(stripContent));

/* Step 3. Convert to preferred markup format. */
InputStream xsltResourceInputStream = getResourceAsStream("xhtml2dial.xslt");
XMLFilter xhtml2dial = stf.newXMLFilter(new StreamSource(xsltResourceInputStream));

removeUnwanted.setParent(reader);
xhtml2dial.setParent(removeUnwanted);
xhtml2dial.setContentHandler(serializer.asContentHandler());

reader.parse(in);

return sb.toString();
} catch (TransformerException e) {
throw new ConversionException(e.getMessage(), e);
} catch (IOException e) {
throw new ConversionException(e.getMessage(), e);
} catch (SAXException e) {
throw new ConversionException(e.getMessage(), e);
}

No comments: