I <3 Steve McConnell*
Coding Horror
programming and human factors
by Jeff Atwood

March 28, 2005

Building Mht Files from URLs revisited

I finally finished updating my Convert any URL to a MHTML archive using native .NET code CodeProject article. It's based on RFC standard 2557, aka Multipart MIME Message (MHTML web archive). You may also know it as that crazy File, Save As, "Web Archive, Single File" menu option in Internet Explorer. It's basically a way to package an entire web page as a (mostly) functonal single file that can be emailed, stored in a database, or what have you. Lots of interesting possibilities, including quick and dirty offline functionality for ASP.NET websites using loopback HTTP requests.

This was a truly painful total rewrite, but it offers tons of new functionality:

  • Completely rewritten!
  • Autodetection of content encoding (eg, international web pages), tested against multi-language websites
  • Now correctly decompresses both types of HTTP compression
  • Supports completely in-memory operation for server-side use, or on-disk storage for client use
  • Now works on web pages with frames and iframes, using recursive retrieval
  • HTTP authentication and HTTP Proxy support
  • Allows configuration of browser ID string to retrieve browser-specific content
  • Basic cookie support (needs enhancement and testing)
  • Much improved regular expressions used for parsing HTTP
  • Extensive use of VB.NET 2005 style XML comments throughout

If you're interested, you can download the VS.NET 2003 solution from my blog until the CodeProject site gets updated. Here's a screenshot of the demo app packaged with the Mht.Builder class:

screenshot of Mht.Builder demo app

Posted by Jeff Atwood    View blog reactions

 

« On Necessity John Carmack on Java, Phones, and Gaming »

 

Comments

Looks great! Now are you going to extend it to be able to put more than one page into the archive?

Oliver Sturm on March 29, 2005 05:45 AM

I don't think that's possible.. I believe *clicked* links will always try to resolve a real host and access the network instead of checking the MHT file for the resource. I can run an experiment to see if it will work or not, but I doubt it.

Jeff Atwood on March 29, 2005 09:28 AM

Well, WinMHT can definitely do it: http://www.winmht.com

The RFC talks about "subsidiary resources". Obviously it's no problem if these are HTML pages again... a multi-page archive, as created with WinMHT, can be opened in IE without problems and the various pages inside can be browsed from the archive if they are interlinked (WinMHT can also create a TOC page).

Oliver Sturm on March 29, 2005 11:48 AM

That's interesting. The links it builds are in this format:

mhtml:file://C:\Documents and Settings\jya13970\My Documents\My MHTs\Spidersoft - WinMHT Start Page.mht!http://www.spidersoft.com/winmht/start.asp

mhtml:file://C:\Documents and Settings\jya13970\My Documents\My MHTs\Spidersoft - WinMHT Start Page.mht!http://www.spidersoft.com/winmht/default.asp

Hmm. I wasn't aware of these crazy mhtml:file:// format links and the exclamation points..

I guess if I remapped all the links to that format, you could create one giant MHT that contained all the sub-pages of a website.

I need to also revisit how Firefox deals with this. It is an RFC standard, but last I checked there was a special add-in you needed to view them in FF.

Jeff Atwood on March 29, 2005 03:04 PM

I guess the links are probably relative in the file (otherwise it wouldn't work as soon as you copy it elsewhere), so part of the links you are showing aren't really in the file itself. However, would be great if your MhtBuilder could do this!

As for Firefox: it doesn't support MHT files natively, which strikes some people as funny because actually Thunderbird does. I'm not sure about the exact extent of the support, but it's definitely possible to view MHT files attached to email directly in the mailer. I looked at this a while ago, but I believe I was able to find a bug tracking entry about this at the time.

Oliver Sturm on March 30, 2005 04:10 AM

Can your product possibly pull down linked active pages like ASP? I'm building active content systems with a local server and being able to bundle all the pages as single files would be insanely great! IE's save as mht will not do that.
Great effort so far, keep up the good work!
David

David on May 14, 2005 02:20 PM

How can I save stuff from my local disk? I have a report that I'm creating in HTML, and I already have my stuff locally. Any easy way to just "get" the stuff from there and package into an MHT? The image links are all local (<img src="file.png">)...

Thanks

Kurt Koller on May 20, 2005 12:27 PM

Also, I ran your demo, saved the default codinghorror pages to disk, and tried to open it with Microsoft Word, which told me that it's not a valid single page archive file.

Kurt Koller on May 20, 2005 12:33 PM

and as a followup to that, if I save from IE, it works fine in word.

Kurt Koller on May 20, 2005 12:36 PM

> I'm building active content systems with a local server and being able to bundle all the pages as single files would be insanely great! IE's save as mht will not do that.

Yes, I would like to get to this.. eventually.. I wasn't aware it was possible until Oliver pointed out WinMHT.

> tried to open it with Microsoft Word

Does it open from IE?

Jeff Atwood on May 20, 2005 12:48 PM

I am having the same problem as Kurt.

> Does it open from IE?
Yep

I want to be able to convert any html page on my web server's local disk into .mht, rename it to .doc, and be able to open it in word. Any advice?

Raj on May 26, 2005 08:13 PM

Well, I never tested Word.. it never occurred to me that you could even do this!

Jeff Atwood on May 26, 2005 11:48 PM

hi thx first for the nice app
most sites i tried so far seem to work fine.
But when i try a download (web complete) on <a href="http://www.heise.de/">http://www.heise.de/</a>; it crashed after a few seconds.
I assume itīs a problem cause of invalid filename that it wants to create on the local hdd like "this is a long , filename.txt"

arne on May 31, 2005 02:21 PM

> tried to open it with Microsoft Word, which told me that it's not a valid single page archive file.

I know what this is now. Word is looking for a trailing "--" at the final boundary. So at the very end of the file change this..

------=_NextPart_000_00

to this..

------=_NextPart_000_00--

Why, I have no idea, but this one change causes the file to open fine in Word 2003.

Jeff Atwood on June 6, 2005 01:09 PM

But when i try a download (web complete) on http://www.heise.de/ it crashed after a few seconds.

--

The problem with that URL is its *insane* use of the <link> tag. Just take a look at the top of the file for all the <link> elements. Not easy to fix, because I assume most linked elements are embedded. In this case, they're not at all..

Jeff Atwood on June 6, 2005 05:41 PM

Why would you want to use MHT for anything. It is a Microsoft psuedo-standard, and IE is the the only browser that will open them, so you can forget Linux users, and Mac users, except for those few who have IE for Mac, while Windows my currently be the most widely used OS for normal users, the internet should not be a place where only Windows users are welcome. Considering the flaws in IE, I would loath to build a site that required opening the view up to security holes, just to make life easier for me.

Yes a MHT file can store images and multiple webpages in one file, but so can a tarred or zipped folder, and as far as browsing those pages, the technique is relative links.

Nathaniel Troutman on July 14, 2005 03:10 PM

It's not a pseudo-standard, it's RFC2557 almost verbatim!

http://www.ietf.org/rfc/rfc2557.txt

The main benefit is keeping everything in a single file.

Jeff Atwood on July 14, 2005 03:59 PM

Have you patched the code for the word problem? Just curious.

Also yes, this is a standard. There are extensions to firefox to save/read this as well. And a bunch of other things support it as well.

Kurt Koller on August 5, 2005 09:21 PM

I've made two code changes to allow for the file to be opened in Word 2003. This made it work for me anyway.

Kyle

In builder.vb starting on line 474 change the procedure to the following:

Private Sub AppendMhtBoundary(Optional ByVal bEndOfFile As Boolean = False)
AppendMhtLine()
If bEndOfFile = False Then
AppendMhtLine("--" & _MimeBoundaryTag)
Else
AppendMhtLine("--" & _MimeBoundaryTag & "--")
End If
End Sub

In builder.vb on line 438, change procedure call to: AppendMhtBoundary(True)

Kyle on August 15, 2005 05:30 PM

Is it possible to compile this into a dll, for use in vb6, or just convert the code?

Thanks

Fred on August 29, 2005 01:41 PM

Any way to apply gzip compression to MHTs?

Taras Tielkes on November 12, 2005 08:51 PM

Great idea, thanks.

Have you considered extending this to save in other widely used formats - like perhaps .doc?

Because .MHT doesn't seem to be that widely/consistently supported yet, I'm a little hesitant about adopting it as a format for collecting/archiving saved content.

So a version that supported the .doc format would be a welcome alternative (IMHO). I know, it's not a truly open standard... but the OPen Office folks seem to have managed, so perhaps it's possible?

Gregg on December 16, 2005 10:33 AM

This is a great, great tool and will work perfectly for my intranet reporting project. Thanks Jeff!

Texrat on January 5, 2006 12:12 PM

the program is very cool but it seems i must use the comercial versions

what i need is to convert a local file to mht and send this with vb.net built in mail

Andre

Andre on February 6, 2006 03:38 PM

I use Visual Studio 2k5
How create just the simple .mht file? Wtih the headers and etc?

Paca on April 27, 2006 12:46 AM

Hi Jeff,

I have posted a question on The Code Project in regards to my question, but I thought it might be easier to reach you here.

Is this app supposed to loop though the entire site folder and create a mht file for each html file? 1 for 1. It seems to only get the first file it finds, creates an mht (perfectly I might add) but stops there.

I have an intranet that has 12000 files, and the reason I am interested in your app is because I need to convert them into mht, and then upload them into Sharepoint. This then allows suers to view all pages in IE Browser, but edit the pages easily in WORD.

So as you can imagine, a batch process is paramount.

Oh and I was a little confused about why it doesn't keep the orginal file name, but rather taking on the Page Title instead. Why is that.

Thanks for any help you might have, and this page has some good reading

Sean on June 27, 2006 05:34 PM

Hi Jeff,

Any chance you can just let me know if your script can convert each file on a website (batch style), and I am just doing it wrong, or it only does one file per run. I have a rather fast approaching deadline to get these files converted.

Sean on June 29, 2006 03:34 PM

Everyone seems to think that only IE can open an mht file, which is incorrect. Opera supports it natively. I believe Firefox has an extension for it somewhere if you're willing to look for it, download it and keep updating it.

Jadd on November 20, 2006 09:00 AM

The problem that really annoys me, is with IE7 (final version), where saving files as MHTML does NOT guarantee that IE7 itself will actually be able to read them from disk!
If anyone has a work-around (apart from using Firefox, or saving as Complete HTML from IE7), let me know! ildotthomasatiinetdotnetdotau

Ian on November 21, 2006 09:15 PM

Hi,
Thank you very much for providing this well crafted code. I need to save html as "Web page complete", "Web page archive", "Web page as PDF" for an application to backup blogs. Currently I am doing it using your code . What I am interested in is to show the download progress as the web page is being saved. So I was thinking of combining the functions provided in MHT builder into the extended web browser control found at http://www.codeproject.com/csharp/ExtendedWebBrowser.asp.

I would request your help/guidance/indication/criticism on it.

S M Mahbub Murshed on February 3, 2007 02:13 PM

I noticed in your list of functionality that it now supports iframes. Do iframes cause issues when trying to save in MHT format?
Thanks!

Roger Benedict on March 21, 2007 11:01 AM

Hi

I really need to work out how to convert a locally stored HTML file to MHT?

If I enter a file path I get an error:

Unable to cast object of type 'System.Net.FileWebRequest' to type 'System.Net.HttpWebRequest'.

Bob Davidson on June 8, 2007 07:21 AM

I've used your class to make a little application for myself to download and updates episode guides, and it works very nicely except for a couple of urls that only display as text in IE7 once saved as mht files:

http://epguides.com/smallville/

Any ideas?

Jacques on August 24, 2007 12:04 AM

This works great for most sites, but I have found a few where the mht file displays as text only in IE7.

Here is one example: http://epguides.com/smallville/

Any Ideas?

Jacques on August 24, 2007 12:07 AM

I have been trying to get your cool mht tool to work with file based urls without much luck (such as file:///C:/App-Dev-2.0/test.htm). I have looked through the code and have added some additional code in the GetUrlData sub in the WebClientEx.vb to account for file based urls:

Dim bFile As Boolean = False
Dim freq As FileWebRequest '= DirectCast(WebRequest.Create(Url), FileWebRequest)
Dim wreq As HttpWebRequest '= DirectCast(WebRequest.Create(Url), HttpWebRequest)

If LCase(Microsoft.VisualBasic.Left(Url, 8)) = "file:///" Then
freq = DirectCast(WebRequest.Create(Url), FileWebRequest)
bFile = True
Else
bFile = False
wreq = DirectCast(WebRequest.Create(Url), HttpWebRequest)
End If

Everything works fine until it hits the FinializeMht sub in Builder.vb. I get a null exception with Dim sr As New StreamWriter(outputFilePath, False, _HtmlFile.TextEncoding).

Any thoughts?

Thanks so much.

itsky on December 2, 2007 09:19 AM

Jef all the methods take the url and then convert to mht. A method that takes a string of html and converts it to mht would be an added bonus as well.

Rahul Atlury on December 26, 2007 11:21 PM

Hello Jeff,
I have seen your article on CodeProject and from there I was redirected to this URL on CodingHorror for downloading the MhtBuilder 2.0. But I don't see any link on this CodingHorror http://www.codinghorror.com/blog/archives/000249.html to download. Can you pls provide me the link where I can download the code. Also I tried your code available on CodeProject to create the MHT pages. It doesn't throw an exception and saves the *.mht file but The MHT file will not open in IE7. It shows all the things in plain text format. Is there any issue with MSIE7.0, VS2005, Vista Home Premium. Pls help me.

Thanks in advance.
Nitin

Nitin on May 11, 2008 11:46 PM







(hear it spoken)


(no HTML)




Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved.