Removing Microsoft Word HTML in PHP

So I’ve been trying to figure out how to remove the html code from a string of data that MS Word adds to their code when exporting an htm file.

Here’s what I found to be most successful although it may not be perfect in ALL instances.

function cleanHTML($html) {
/// <summary>
/// Removes all FONT and SPAN tags, and all Class and Style attributes.
/// Designed to get rid of non-standard Microsoft Word HTML tags.
/// </summary>
// start by completely removing all unwanted tags

$html = ereg_replace(“<(/)?(font|span|del|ins)[^>]*>”,””,$html);

// then run another pass over the html (twice), removing unwanted attributes

$html = ereg_replace(“<([^>]*)(class|lang|style|size|face)=(\”[^\"]*\”|’[^']*’|[^>]+)([^>]*)>”,”<\\1>”,$html);
$html = ereg_replace(“<([^>]*)(class|lang|style|size|face)=(\”[^\"]*\”|’[^']*’|[^>]+)([^>]*)>”,”<\\1>”,$html);

return $html;
$stringData = cleanHTML(“$stringData”);

Leave a Reply

Your email address will not be published. Required fields are marked *

Connect with Facebook


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>