RTF hyperlink conversion with ScroogeXHTML (XPath based post processing)

ScroogeXHTML for the Java platform supports hyperlink conversion to HTML in two ways. Many RTF documents use special RTF keywords which include the hyperlink target as “invisible” text, so that the HTTP address was already available in the document. Other RTF documents however only use underlined and blue text, but no hidden HTTP link addresses. In this case, the conversion requires a different solution.

Earlier versions of ScroogeXHTML used a hard-coded solution for text-to-hyperlink conversion, which did not support advanced tweaks and manipulations of the result HTML.

Now, the next release of ScroogeXHTML provides an XPath based post processor class, as a starting point for customized blue/underlined hyperlink conversion.

One line of code is required to add the post processor class:

scrooge.getPostProcessListeners()
       .add(new ConvertUnderlinedToHyperlinks());

The post processor will locate all text elements which are underlined and blue, and turn it into a hyperlink.

With some custom code, the post processor may be adjusted to your special needs, for example it may use a dictionary (map) to assign specific URLs to the blue/underlined text.

Note: this is a breaking change, the next version will no longer have the property ConvertHyperlinksForBlueUnderlinedText.

Source code excerpt:

  @Override
  public void postProcess(PostProcessEventObject e) {
    try {
      XPathFactory xpathFactory = XPathFactory.newInstance();

      String exp = String.format("//span[contains(@style, 'color:%s;') and contains(@style ,'text-decoration:underline;') ]", color);
      XPathExpression xpathExp = xpathFactory.newXPath().compile(exp);

      NodeList hyperlinkNodes = (NodeList) xpathExp.evaluate(e.getDocument(), XPathConstants.NODESET);

      // Iterate over all found nodes
      for (int i = 0; i < hyperlinkNodes.getLength(); i++) {
        Element linkNode = (Element) hyperlinkNodes.item(i);

        // remove the hyperlink style
        String style = linkNode.getAttribute("style");
        style = style.replace("color:" + color + ";", "");
        style = style.replace("text-decoration:underline;", "");
        if (style.isEmpty()) {
          linkNode.removeAttribute("style");
        } else {
          linkNode.setAttribute("style", style);
        }

        // create anchor with href attribute
        Element anchor = e.getDocument().createElement("a");
        String linkText = linkNode.getTextContent();
        anchor.setAttribute("href", linkText);

        // insert the a element
        Node parent = linkNode.getParentNode();

        if (linkNode.getAttributes().getLength() == 0) {
          anchor.setTextContent(linkText);
          parent.removeChild(linkNode);
          parent.appendChild(anchor);
        } else {
          parent.insertBefore(anchor, linkNode);
          anchor.appendChild(linkNode);
        }
      }
    } catch (XPathExpressionException ex) {
      LOGGER.error(ex.getMessage(), ex);
    }
  }
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s