I recently wrote a Word 2003 document that I later turned into a blog post. The transition between Word doc and HTML presented some problems. Word offers two HTML options in its save dialog: "Save as HTML" and "Save as Filtered HTML". In practice, that means you get to choose between totally nasty HTML and slightly less nasty HTML.
I searched around for any existing Word cleanup solutions and found the Textism Word HTML Cleaner, and Tim Mackey's set of regular expressions. The Textism solution is great, but requires a subscription for files over 20kb. And I wasn't quite happy with Tim's regular expressions, either. So I created my own Word HTML cleanup solution.
This c# 2.0 code removes all unnecessary cruft from Word documents saved as HTML, stripping the HTML down to the bare-bones basics:
static void Main(string[] args)
{
if (args.Length == 0 || String.IsNullOrEmpty(args[0]))
{
Console.WriteLine("No filename provided.");
return;
}
string filepath = args[0];
if (Path.GetFileName(filepath) == args[0])
{
filepath = Path.Combine(Environment.CurrentDirectory, filepath);
}
if (!File.Exists(args[0]))
{
Console.WriteLine("File doesn't exist.");
}
string html = File.ReadAllText(filepath);
Console.WriteLine("input html is " + html.Length + " chars");
html = CleanWordHtml(html);
html = FixEntities(html);
filepath = Path.GetFileNameWithoutExtension(filepath) + ".modified.htm";
File.WriteAllText(filepath, html);
Console.WriteLine("cleaned html is " + html.Length + " chars");
}
static string CleanWordHtml(string html)
{
StringCollection sc = new StringCollection();
// get rid of unnecessary tag spans (comments and title)
sc.Add(@"<!--(\w|\W)+?-->");
sc.Add(@"<title>(\w|\W)+?</title>");
// Get rid of classes and styles
sc.Add(@"\s?class=\w+");
sc.Add(@"\s+style='[^']+'");
// Get rid of unnecessary tags
sc.Add(
@"<(meta|link|/?o:|/?style|/?div|/?st\d|/?head|/?html|body|/?body|/?span|!\[)[^>]*?>");
// Get rid of empty paragraph tags
sc.Add(@"(<[^>]+>)+ (</\w+>)+");
// remove bizarre v: element attached to <img> tag
sc.Add(@"\s+v:\w+=""[^""]+""");
// remove extra lines
sc.Add(@"(\n\r){2,}");
foreach (string s in sc)
{
html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase);
}
return html;
}
static string FixEntities(string html)
{
NameValueCollection nvc = new NameValueCollection();
nvc.Add(""", "“");
nvc.Add(""", "”");
nvc.Add("–", "—");
foreach (string key in nvc.Keys)
{
html = html.Replace(key, nvc[key]);
}
return html;
}
Some caveats:
If you're feeling frisky, you can cut and paste the code above to build it yourself. Or you can just download it, lazyweb style:
That's what you get for trying to use word as a Blog editor. ;)
Haacked on January 9, 2006 5:40 AMThe deliverable, in this case, WAS a Word doc. So it didn't make sense to do it as HTML first. Or so I thought at the time, anyway..
Jeff Atwood on January 9, 2006 5:49 AMYour macro does a thing that reminds me of a tangential questions, which is why so many people idon't/i use styles in Word. For all the yacky-yack about CSS in HTML, you'd think people would be a lot hipper to the virtues of a style sheet in Word.
Given the way we work around here, a conversion from WordML/.doc format to HTML is no good unless it preserves (references to) our styles. As it happens, Word's conversion chooses to apply local formatting to certain converions (w/in tables, if I remember correctly), making our styles useless.
mike on January 9, 2006 5:56 AMFor all the yacky-yack about CSS in HTML, you'd think people would be a lot hipper to the virtues of a style sheet in Word.
But you can't view the markup "tags" in Word. It's all magical and hidden, which makes it far more difficult to deal with. At least with HTML you can do a view source and see what's wrong; with Word you're just plain screwed. This happens to me all the time in Word, too. I'm happily editing a paragraph and all of a sudden I delete some hidden magic tag and I'm hosed. Drives me absolutely bonkers.
Word's conversion .. mak[es] our styles useless.
Right. When you save a Word doc as HTML, every single paragraph has a style. Every. Single. Paragraph!
If you really really want HTML, you should start with HTML, because Word is absolutely not the way to build HTML.
Jeff Atwood on January 9, 2006 6:07 AMWhen I have to paste something from word into an HTML doc, I found the http://www.fckeditor.net/ has a nice utility that imports text from MS Word. It's also really nice about creating XHTML.
Good little editor if you have people posting lots of stuff from word onto a website
Ryan Smith on January 9, 2006 6:18 AMAnybody got an idea why Word creates html that way?
Some form of "We need to sell more operating systems, so we need to make the applications work more" ? :)
Documents written for paper and slides should not be equal to a web page. It is a great feature or it can with jeff's tool be a great feature, but since it is not 100% compliant html and does not display equal on every browser, then why make it to dramatic? Why make so complicated html when the viewer probably not have the requiered font, browser or operating system?
No this is not a i-hate-ms thread, I just dont like word and still have nightmares from my study when office97 crashed because it could not handle the size of my paper and number of equations. Yes I know things have changed, but if it creates html like that, how does it handle word documents? :-(
/P
Peter Palludan on January 10, 2006 3:17 AMAny reason you didn't use HtmlTidy? (http://tidy.sourceforge.net/). Its easy enough to run SWIG over it and use it from .NET.
I was just about to suggest HTMLTidy as well, although it doesn't seem to do a complete cleanup of word. I'm looking to creating an open sourced Java project that whitelists standard HTML tags and reduces Word HTML to simple HTML as well. (Small blurb on project intentions here: a href="http://www.critical-masses.com/projects.html"http://www.critical-masses.com/projects.html/a --scroll down to HTMLMin) I've been busy and never really got off the ground on that one, although I do have a need for something to replace the combination of HTMLTidy and regex cleanup I'm using now. I just learned that Blogger has a Word ad-in that pretty much does the job, though.
jake on January 10, 2006 8:21 AMBloggers add-in only works with Blogger. If you could write a word add-in that works with just about any blog/CMS package, you might have something.
I know there are blog tools out there, but with so many folks using word, why not make a word plugin?
Eric D. Burdo on January 10, 2006 9:47 AMThis is great, thanks. I normally compose by blog posts in Word, as I like to keep them as an off-line collection of Word documents as well (for my memoirs one day, you see).
Brady Kelly on January 10, 2006 12:13 PMI was just about to suggest HTMLTidy as well, although it doesn't seem to do a complete cleanup of word.
That's right, HTMLTidy doesn't clean out all the craziness. To see the craziness yourself, just save a Word doc as HTML or Filtered HTML and view it in a text editor.
Bloggers add-in only works with Blogger
Vertigo Software (eg, we) wrote the blogger add-in for Word:
http://help.blogger.com/bin/answer.py?answer=1180topic=14
The add-in is written in VB6 for compatibility reasons..
I have backported the code to .NET 1.1 and added a few features - remove class attributes when they contain Microsoft classes (i.e. name beginning with Mso), ignore spans and ignore divs (leave those tags intact, but still remove attributes). Is it fine if I post my modified version (it is your code after all)?
Sam on January 11, 2006 5:01 AMMade a version for .NET 1.1. A few bugs fixed (quoted classes were not removed, not all empty tags were deleted).
a href="http://webdevel.blogspot.com/2006/01/clean-word-html-command-line-tool.html"http://webdevel.blogspot.com/2006/01/clean-word-html-command-line-tool.html/a
Sam on January 18, 2006 6:04 AMVery intersting thread.
I also knew about HTMLTidy and I was not convinced, first because it's not so easy to use, and second that it's incomplete.
I'm used to the Dreamweaver "Clean Word 97/100" function. From what I remember, around 2/3 of the page size was deleted, but I still had to make manual searches for "MSO" craps.
How do you rank this Dreamwear MX tool, if you know it?
what does the word nasty means
harry on January 20, 2006 11:10 AMHeh... whan I saw the post I thought that mircale solution that will clean up Word HTML preserving the formatting and looks of the document is finally here. I wonder if there's one :)
Sergei Shelukhin on January 22, 2006 12:44 PMThis is great. I'd love to be able to use this with a Drupal website I'm working on. Until I know how to create a Drupal input filter module in PHP, I'll just use one of the two posted variations to cleanup Word and Publisher files for posting. Or, perhaps I'll add a form so user can drag and drop a file and display the cleaned up HTML they can copy and paste into Drupal.
Terry Westley on March 14, 2006 11:52 AM"Yes I know things have changed, but if it creates html like that, how does it handle word documents? :-("
Considering the Word format has changed to an XML based on, I would guess alot has changed, and conversion to HTML will be easier and slightly cleaner.
If anyone is looking for a word html cleaner that doesn't use regexps (so it won't break with varying versions of word), I wrote one in javascript that actually parses the html character by character to remove everything but white listed tags and attributes.
a href="http://ethilien.net/websoft/wordcleaner/cleaner.htm"http://ethilien.net/websoft/wordcleaner/cleaner.htm/a
Ethilien on March 23, 2006 8:49 AMHey there, I have been on a fruitless search for years on something that might help me get docs straight into nice sanitized HTML.
I have a VB6 app I wrote a while back (for compatibility reasons) that is based on htmltidy although I did the rest with regular expressions. Works fine, but I just wanted something neater and now I can use .NET It is time for another go.
Any suggestions on going from a word doc to an online published clean html file?
I guess I will have to write an ActiveX control that takes in a Word Doc and publishes the crap-free result (think it will be better to have the processing done on the client, although updates could be an issue)... hmm i have typed to much.
Tony on March 27, 2006 8:51 AMAny suggestions on going from a word doc to an online published clean html file?
That's exactly why I wrote this post-- I needed to go from Word doc to published HTML file!
1) Save the word doc as "filtered HTML"
2) Run this utility on the saved HTML file
voila. ;)
Jeff Atwood on March 27, 2006 12:40 PMThanks dude!.. your code does a great job cleaning those word tags!.. I love it.
rosdi kasim on September 12, 2006 2:40 AMThere is another way you know. Instead of creating a file in Word then going through the rigamorole to clean it, why not just start with the a href="http://openoffice.org"OpenOffice.org/a writer and have compliant documents from the start? Since it's free and you can even a href="http://portableapps.com/apps/office/openoffice_portable"carry it on your USB drive/a, it seems like the simplest solution to me. It even open all of your old Word documents, though WordPerfect support is sketchy. The only thing it doesn't do as well as or better than Word in my experience is macros, and I rarely use them on documents intended for the Web.
scott cushman on October 3, 2006 9:18 AMIf you're interested in converting a BLOCK of MS Word (from a say copy/paste operation), I just blogged about how to do this. You may be able to use the same technique for an entire Word HTML doc. Just put the DHTML control into Design Mode (see post below) and then save web.Document.InnerHTML to a file.
Copy Paste HTML From MS Word: IE's DHTML Editing Control (in a .NET WinApp)
http://blogs.msdn.com/noahc/archive/2006/10/16/copy-paste-html-from-ms-word-ie-s-dhtml-editing-control-in-a-net-winapp.aspx
Thank you! This will come in very helpful as I'm converting an intranet site at work, and alot of the pages are in god-awful Word HTML format.
Chris on December 21, 2006 12:02 PMThanks for your function. Works fine!
Here's the VB (.NET 1.1) function:
[code]
Public Function CleanWordHtml(ByVal html As String) As String
Dim sc(7) As String
'get rid of unnecessary tag spans (comments and title)
sc(0) = "!--(\w|\W)+?--"
sc(1) = "title(\w|\W)+?/title"
'Get rid of classes and styles
sc(2) = "\s?class=\w+"
sc(3) = "\s+style='[^']+'"
'Get rid of unnecessary tags
sc(4) = "(meta|link|/?o:|/?style|/?div|/?st\d|/?head|/?html|body|/?body|/?span|!\[)[^]*?"
'Get rid of empty paragraph tags
sc(5) = "([^]+)+nbsp;(/\w+)+"
'remove bizarre v: element attached to img tag
sc(6) = "\s+v:\w+=""[^""]+"""
'remove extra lines
sc(7) = "(\n\r){2,}"
For Each s As String In sc
html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase)
Next
Return html
End Function
[/code]
Thank you so much!
This was exactly what I needed right now.
Gr,
Ben
Your link for the console app is incorrect. I found the correct link on http://www.dickson.me.uk/2007/02/08/howto-blog-using-a-microsoft-word-2003-file/ it's http://www.codinghorror.com/blog/images/WordHtmlCleaner-executable.zip
boardtc on July 18, 2007 3:24 AMrunning wordhtmlcleaner /? at the command line causes an exception, as does trying it on a html saved from openoffice...
boardtc on July 18, 2007 3:29 AMThanks... You have save me time.
Shrini on August 15, 2007 2:39 AMAfter struggling to find why curly quotes and the dash in the input word file got converted to garbage in the html, I found I needed to change the file encoding type used by File.ReadAllText. This change works for me (I have no idea what encoding scheme is used when you don't specify the .Default):
string html = File.ReadAllText(filepath, System.Text.Encoding.Default);
Nice code!
But I have one enhancement:
Change
// Get rid of empty paragraph tags
sc.Add(@"(]+)+nbsp;()+");
to:
// Get rid of empty paragraph tags
sc.Add(@"(]+){1}nbsp;(){1}");
Otherwise, it will remove unexpected data.
For example, following line will be fully removed:
nbsp;
My fix resolves the problem.
Thanks.
*THANK YOU!!!*
I have tens of word documents I need to convert into accessible html format. Your cleaner app saved me hours and hours of work and frustration!
*THANK YOU!!!*
I don't consider myself a stupid person (much), but how does one use this application? I want in!!
Laurie on February 26, 2008 2:11 AMThis is superb!!! I have been converting PDF to html which has been very messy and used word as the spell checker as none of the available freeware could do the job properly. I am trying to create the simplest of html for the worlds most basic mobile phones as only the symbian ones can view pdf at the moment. This will save me hours. Thanks
Laurie - Extract it to a directory on your drive. Bring up a command prompt and change to that directory. type the program name followed by the full path and filename of the html doc. It will drop the fixed html in the same dir as the program.
Kris on March 7, 2008 1:57 AMhow do i use this????
Vic on March 20, 2008 3:21 AMWorks well for me too - been searching for years for an elegant solution to converting simple Word documents which covers Word tables of content, semantic markup for headings for SEO and simple tables!
Nice clear instructions still at:
http://www.dickson.me.uk/2007/02/08/howto-blog-using-a-microsoft-word-2003-file/
Download of console app now:
http://www.codinghorror.com/blog/archives/000485.html
I find it did throw an exception if you still have the source .htm file still open - so need to close Word first.
Dave Chaffey on April 22, 2008 4:37 AMThat's great. One nice addition would be to search for common style attributes and replace them with generated class attributes.
So if there is a load of span style='font-size:10.0pt;font-family:"Courier New";color:#A31515'
tags, like Word generates, it word add
span.style1
{
font-size:10.0pt;
font-family:"Courier New";
color:#A31515'
}
to the in-page style tag and replace the style attributes with class='style1'
A bit more work but it would be really useful. HtmlTidy doesn't do this from what I can see.
Chris W on May 28, 2008 8:56 AMHey guys:)
Really,that damn MS WORD is a pain in the ass when working with HTML.As for free soloutions,there is one that I find quite good at:
http://www.wordhtmlcleaner.co.uk/
It can handle files up to 1mb in size.I haven't tested it with very complex word files,but so far it has been adequate for me...
Great work on the code,by the way!:)
Mehdi on June 18, 2008 10:26 AMThis site doesn't clean as but it starts from scratch:
http://www.documentsfortheweb.com/free/
The HTML is clean and it even makes CSS.
WORKSFORME!
Thanks FredK
I had a go and its good, they even
have a converter just for word. Cool!
This is a work in progress, but the Drupal modules
http://www.drupal.org/project/word2web and
http://www.drupal.org/project/xslt_book
clean up Word HTML with XSLT expressions, and do a quite nice job with it. And since it's just XSL at the core, you can download them, rip out the stylesheets, and use them in whatever environment you like.
tom on June 30, 2008 12:18 PMThanks for sharing that very handy code! And thanks to everybody else for all the illuminating comments, too. What a great thread.
For the record, though, to call Word's HTML output garbage isn't really fair. To the garbage.
That's an interesting point about using OpenOffice. Have it but had never thought of using it to get around this M$ issue.
svend on July 6, 2008 3:24 AMthanks for the great!
little suggests:
1. keep the extension name.
(if .html used would modified to htm)
2. add one target name if arguments exsist.
(ex: whc foo.htm target.htm)
great thanks again. ^^y
vegalou on July 31, 2008 1:53 PMHi.
I what to clean WORD TAGS, but I need also save format.
so the line:
sc.Add(@!--(\w|\W)+?--);
it's clean my format-style.
how to fix it?
please help me!
Shlomi on August 6, 2008 9:53 AMWhy don't you guys try Office 2000 HTML Filter 2.0 http://www.microsoft.com/downloads/details.aspx?FamilyID=209ADBEE-3FBD-482C-83B0-96FB79B74DEDdisplaylang=EN
This worked great for Word 2007 except for a ' turns out to be ?T but I cannot do a simple Replace. Left and right quotes as well as a few other characters turn into weird looking things like that as well. Any ideas how to avoid this?
Jeremy on January 7, 2009 8:43 AMhi
i use the following
// remove inner ?... declaration
//
((\\s*\?)(.*?)(\s*/\s*\))
//remove o:p like constructs
(\\w\:\w\(.*?)(\/\w\:\w\))
before running tidy and it works for me till now hoping it continue to work
kazim
kazim mehdi on January 14, 2009 5:58 AMTHANK YOU VERY MUCH!!!!!!!!!!!!!!!!!!!!
Onur Onal on February 26, 2009 3:32 AMI just wrote a tool to do this work.
It can clean the nasty HTML while preserve the appearance.
and it also has the function to get pure HTML as other tools do.
I think it's fast and easy to use.
I name it HTML Cleaner for Word, which can be downloaded from my site at
http://www.wonderstudio.cn/soft/down/cleanerW.zip
or just from download.com at http://download.cnet.com/HTML-Cleaner-for-Word/3000-2079_4-10913372.html?tag=mncol
I hope it will solve such problems for you.
http://www.algotech.dk/word-html-cleaner-input.htm
I used this at work and it did the trick.
Gus on June 16, 2009 12:45 PMhttp://www.algotech.dk/word-html-cleaner-input.htm#developer
i used this at work and it did the trick
Gus on June 16, 2009 12:45 PMHey Jeff,
Love the tool.
Of course, since I'm such a klutz at the keyboard, I found a bug when I typed the file name wrong. There should be a return after line 16: Console.WriteLine("File doesn't exist"); to prevent the user seeing the exception get thrown.
Luckily, I'm feeling frisky :)
Jason Kemp on February 6, 2010 9:46 PMLike many of you, I needed a way to clean the crap from the HTML code created by Word from a DOC file. I use the HTML for generating help files with RoboHelp and the excess HTML coding added by Word was causing problems.
After many weeks of research and testing, I found a commercial program that does a good job of cleaning all the garbage out of HTML. It is called WordCleaner (not the program created by Ethilien above).
You can get more information on WordCleaner at this address:
John Larson on February 6, 2010 9:46 PMI know Windows is not big on drag-and-drop, but is there a drag-and-drop version of this in the works so we can just drag a folder full of mucked up Word HTML docs onto the program and have it clean them all in batch?
Brandon on February 6, 2010 9:46 PMExcellent. Thanks a lot for the time savings!
Jason on February 6, 2010 9:46 PMWhy didn't you just ask this on Stack Overflow ;)
Here's a link for anyone running into this problem in other versions of Word as well. Looks like Word 2007 actually has a feature to publish a document as a blog post with clean HTML.
http://stackoverflow.com/questions/67964/what-is-the-best-free-way-to-clean-up-word-html
Even Mien on February 26, 2010 5:13 AMJeff, I have ported your code to JavaScript.
Thank you, it helped so much.
function cleanWord(str){
// get rid of unnecessary tag spans (comments and title)
str = str.replace(/\<\!--(\w|\W)+?--\>/gim, '');
str = str.replace(/\<title\>(\w|\W)+?\<\/title\>/gim, '');
// Get rid of classes and styles
str = str.replace(/\s?class=\w+/gim, '');
str = str.replace(/\s+style=\'[^\']+\'/gim, '');
// Get rid of unnecessary tags
str = str.replace(/<(meta|link|\/?o:|\/?style|\/?div|\/?st\d|\/?head|\/?html|body|\/?body|\/?span|!\[)[^>]*?>/gim, '');
// Get rid of empty paragraph tags
str = str.replace(/(<[^>]+>)+ (<\/\w+>)/gim, '');
// remove bizarre v: element attached to <img> tag
str = str.replace(/\s+v:\w+=""[^""]+""/gim, '');
// remove extra lines
str = str.replace(/"(\n\r){2,}/gim, '');
// Fix entites
str = str.replace("“", "\"");
str = str.replace("”", "\"");
str = str.replace("—", "–");
return str;
}
SerkanYersen on April 22, 2010 6:44 AM
I also recently met the problem of obtaining clean HTML from Word. I found solution in DOC to HTML converter that can be downloaded from http://opilsoft.com/doctohtml.html. It does a very good job for me, actually I don't have to edit anything manually in produced HTML code.
Alan Morris on April 26, 2010 12:24 PMThanks for you all, after a year's development, HTML Cleaner for Word has been updated to 1.8, now it can be downloaded at http://www.htmlcleaner.com
Wong Frank on July 4, 2010 3:13 AM
* Download the VS.NET 2005 solution (3kb)
* Download the CleanWordHtml console application (3kb, requires .NET 2.0 runtime)
The download links appear to be dead...
http://www.codinghorror.com/blog/files/WordHtmlCleaner-vsnet2005-solution.zip
http://www.codinghorror.com/blog/files/WordHtmlCleaner-executable.zip
Many of the comments above are spam.
Any advances?
Thanks.
hm2k on July 7, 2010 11:45 AMMy solution is simpler: If it's going to wind up as HTML eventually, don't write it in Microsoft Word. Use the free OpenOffice.org Writer, the open source derivative of Sun Microsystems' Star Office Writer. It can compose documents with all the styles and nice appearance of Word documents, but when it exports the document as HTML, the resulting code is much, much cleaner. Any styles used in the document are included as HTML inline style blocks. It can also import Word documents, although I have no experience using it as a filter to clean up Word files prior to export to HTML. OpenOffice.org is a free suite similar to Microsoft Office, and is available for Microsoft Windows, Apple Macintosh and Linux systems. (Up to version 2.4 it also ran on Windows 98 systems, and you can probably still find the installer in archives.)
Upaj Os on December 8, 2010 9:14 AMhttp://www.memonic.com/ - This is a very easy way to strip all the Word cruft away.
Keith Hill on July 2, 2011 9:28 AMThe comments to this entry are closed.
|
|
Traffic Stats |