Apache Ant/Cleaning up HTML

From Wikibooks, open books for an open world
Jump to: navigation, search

Motivation[edit]

We want to clean up HTML that is not well formed. We will use the Apache Tika tools to convert dirty HTML to well-formed XHTML.

Sample Ant File[edit]

<project name="tika tests" default="extract-xhtml-from-html">
    <description>Sample invocations of Apache Tika</description>
    <property name="lib.dir" value="../lib"/>
 
    <property name="input-dirty-html-file" value="input-dirty.html"/>
    <property name="output-clean-xhtml-file" value="output-clean.xhtml"/>
    <target name="extract-xhtml-from-html">
        <echo message="Cleaning up dirty HTML file: ${input-dirty-html-file} to ${output-clean-xhtml-file}"/>
        <java jar="${lib.dir}/tika-app-1.3.jar" fork="true" failonerror="true"
            maxmemory="128m" input="${input-dirty-html-file}" output="${output-clean-xhtml-file}">
            <arg value="-x" />
        </java>
    </target>
</project>

Sample Input[edit]

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>Dirty HTML</title>
    </head>
    <body>
        <p><b>test</b></p>
        <p><b>test<b></p>
        <p>test<br/>test</p>
        <p>test<br>test<br>test</p>
        <p>This is <B>bold, <I>bold italic, </b>italic, </i>normal text</p>
    </body>
</html>

Sample Output[edit]

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Encoding" content="ISO-8859-1"/>
<meta name="Content-Type" content="application/xhtml+xml"/>
<meta name="dc:title" content="Dirty HTML"/>
<title>Dirty HTML</title>
</head>
<body>
        <p>test</p>
 
        <p>test</p>
 
        <p>test
test</p>
 
        <p>test
test
test</p>
 
        <p>This is bold, bold italic, italic, normal text</p>
 
    </body></html>