- Ingest some HTML and convert to well-formed XML for processing;
- filter that XML to remove unwanted content;
- convert the XML to a different format.
Step 2 was a new bit - I already had well-tested code for the other two parts. So I wanted to re-use that as much as possible. JAXP pipelines using SAX looked to be (and is!) very nice for this, but examples seemed a bit thin on the ground. I've put a version of it here, in the hope that others may find it useful.
/* Create an InputSource for the pipeline input document. */
InputSource in = new InputSource(new ByteArrayInputStream(StringUtils.getBytes(text, "utf-8")));
/* Step 1. TagSoup parsing to get well-formed XML */
XMLReader reader = new Parser();
try {
SAXTransformerFactory stf = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
StringWriter sb = new StringWriter();
OutputFormat outputFormat = new OutputFormat();
outputFormat.setOmitXMLDeclaration(true);
XMLSerializer serializer = new org.apache.xml.serialize.XMLSerializer(outputFormat);
serializer.setOutputCharStream(sb);
/* Step 2. Remove unwanted markup from the well-formed XML. */
InputStream stripContent = getResourceAsStream("strip-content.xslt");
XMLFilter removeUnwanted = stf.newXMLFilter(new StreamSource(stripContent));
/* Step 3. Convert to preferred markup format. */
InputStream xsltResourceInputStream = getResourceAsStream("xhtml2dial.xslt");
XMLFilter xhtml2dial = stf.newXMLFilter(new StreamSource(xsltResourceInputStream));
removeUnwanted.setParent(reader);
xhtml2dial.setParent(removeUnwanted);
xhtml2dial.setContentHandler(serializer.asContentHandler());
reader.parse(in);
return sb.toString();
} catch (TransformerException e) {
throw new ConversionException(e.getMessage(), e);
} catch (IOException e) {
throw new ConversionException(e.getMessage(), e);
} catch (SAXException e) {
throw new ConversionException(e.getMessage(), e);
}
No comments:
Post a Comment