I <3 Steve McConnell*
Coding Horror
programming and human factors
by Jeff Atwood

May 13, 2008

Is HTML a Humane Markup Language?

One of the things we're thinking about while building stackoverflow.com is how to let users style the questions and answers they're entering on the site. Nothing's decided at this point, but we definitely won't be giving users one of those friendly-but-irritating HTML GUI browser layout controls.

an example HTML GUI editor

I have one iron-clad design guide: this is a site for programmers, so they should be comfortable with basic markup. None of that nancy-boy GUI toolbar handholding nonsense for us, thankyouverymuch. If you can sling code, a little bit of presentation markup is child's play.

We will support some sort of markup language to style the questions and answers. But what markup language?

I mentioned in podcast #4 that we consider Wikipedia a defining influence. Let's see how Wikipedia handles markup syntax. This is what the edit page for Joel Spolsky's Wikipedia entry looks like:

Wikipedia Edit page for Joel Spolsky entry

It's an effective markup language, but I think you'll agree that it's more intimidating than humane. Wikipedia's How to Edit a Page and the accompanying Wikipedia syntax cheatsheet helps. Some. I'd argue that writing a Wikipedia entry is a step beyond mere presentational markup; it's almost like coding, as you weave the article into the Wikipedia gestalt. (Incidentally, if you haven't ever edited a Wikipedia article, you should. I consider it a rite of passage, a sort of internet merit badge for anyone who is serious about their online presence.)

Let's consider a simpler example. What we're looking for is some kind of middle ground, a humane text format. Let's start with some basic HTML.

Lightweight Markup Languages

According to Wikipedia:

A lightweight markup language is a markup language with a simple syntax, designed to be easy for a human to enter with a simple text editor, and easy to read in its raw form.

Some examples are:

  • Markdown
  • Textile
  • BBCode
  • Wikipedia

Markup should also extend to code:

10 PRINT "I ROCK AT BASIC!"
20 GOTO 10

Here's what that looks like expressed in a variety of lightweight markup languages. Bear in mind that each of these will produce HTML equivalent to the above.

Textile Markdown
h1. Lightweight Markup Languages

According to *Wikipedia*:

bq. A "lightweight markup language":http://is.gd/gns
is a markup language with a simple syntax, designed 
to be easy for a human to enter with a simple text 
editor, and easy to read in its raw form. 

Some examples are:

* Markdown
* Textile
* BBCode
* Wikipedia

Markup should also extend to _code_: 

pre. 10 PRINT "I ROCK AT BASIC!"
20 GOTO 10
Lightweight Markup Languages
============================

According to **Wikipedia**:

> A [lightweight markup language](http://is.gd/gns)
is a markup language with a simple syntax, designed 
to be easy for a human to enter with a simple text 
editor, and easy to read in its raw form. 

Some examples are:

* Markdown
* Textile
* BBCode
* Wikipedia

Markup should also extend to _code_: 

    10 PRINT "I ROCK AT BASIC!"
    20 GOTO 10
Wikipedia BBCode
==Lightweight Markup Languages==

According to '''Wikipedia''':

:A [[lightweight markup language]]
is a markup language with a simple syntax, designed 
to be easy for a human to enter with a simple text 
editor, and easy to read in its raw form. 

Some examples are:

* Markdown
* Textile
* BBCode
* Wikipedia

Markup should also extend to ''code'': 

<source lang=qbasic>
10 PRINT "I ROCK AT BASIC!"
20 GOTO 10
</source>
[size=150]Lightweight Markup Languages[/size]

According to [b]Wikipedia[/b]:

[quote]
A [url=http://is.gd/gns]lightweight markup language[/url]
is a markup language with a simple syntax, designed 
to be easy for a human to enter with a simple text 
editor, and easy to read in its raw form. 
[/quote]

Some examples are:

[list]
[*]Markdown
[*]Textile
[*]BBCode
[*]Wikipedia
[/list]

Markup should also extend to [i]code[/i]: 

[code]
10 PRINT "I ROCK AT BASIC!"
20 GOTO 10
[/code]

None of these lightweight markup languages are particularly difficult to understand -- and they're easy on the eyes, as promised. But I still had to look up the reference syntax for each one and map it to the HTML that I already know by heart. I also found them disturbingly close to "magic" for some of the formatting rules, to the point that I wished I could just write literal HTML and get exactly what I want without guessing how the parser is going to interpret my fake-plain-text.

Which leads directly to this question: why not just stick with what we already know and use HTML? This c2 wiki page titled Why Doesn't Wiki Do HTML? makes the case that -- at least for Wiki content -- you're better off leaving HTML behind:

  1. In a Wiki, the emphasis is on content, not presentation. Simple Wiki markup rules let people focus on expressing their ideas.
  2. Why not use a domain-specific markup language designed to do "the simplest thing that could possibly work"?
  3. Some HTML tags are difficult to work with and can break the flow of your thoughts. The table tag, for example.
  4. Does the average user really need total HTML and CSS layout power?
  5. Allowing the full range of HTML tags can lead to major security vulnerabilities.
  6. Many people don't know HTML. A simple Wiki markup language is easier to learn.

I'm not sure I agree with all of this, but it can make sense in the context of a full-blown Wiki. It's worth considering.

After all this research on humane markup languages, much to my chagrin, I've come full circle. I now no longer think humane markup languages make sense for most uses. I agree with the guy at fileformat.info -- HTML is generally the better choice:

  • Simplicity

    If the source and destination are the web, why not use the native markup language of the web?

  • Readability

    HTML is a bit less readable than the lightweight markup languages, it's true. But basic HTML is not onerous to read, particularly if we hide the repetitive paragraph tags.

  • Security

    With a bit of careful coding, it is possible to whitelist specific HTML tags that you will allow. This way you avoid exposing yourself to risky/vulnerable tags.

  • Conversion

    It's not at all clear that any existing lightweight markup language has critical mass, with the possible exception of Wikipedia's flavor. On the other hand, text parsers and tools will always understand HTML.

  • What people know

    A lot more people know HTML than any given flavor of humane text. If you're a programmer, you damn well better know HTML. For the handful of wiki-like functions we may need, it's possible to add some optional attributes to the HTML tags. And wouldn't that be easier to learn than some weird, pseudo-ASCII derivation of HTML?

I do think we'll adopt some of the cleverer functions of Textile and Markdown, insofar as they remove mundane HTML markup scutwork. But in general, I'd much rather rely on a subset of trusty old HTML than expend brain cells trying to remember the fake-HTML way to make something bold, or create a hyperlink. HTML isn't perfect, but it's an eminently reasonable humane markup language.

Posted by Jeff Atwood    View blog reactions
« Cleaning Your Display and Keyboard
Oh Yeah? Fork You! »
Comments

The official implementation of Markdown supports HTML in the input, so you can use Markdown, and your users will still be able to use HTML if they want to.

Peter Hosey on May 14, 2008 5:05 AM

Isn't the textile language just sort of troff lite? We can leave troff in the horrible bad old days where it belongs, please.

reed on May 14, 2008 5:09 AM

The biggest feature I can see in wikipedia that would seem to be missing in basic HTML is the automatic cross referencing functionality. A user shouldn't have to look up the URL to type [a href="http://en.wikipedia.org/Markup_Languages#Light_Weight"] when the server can figure it out for them from [[lightweight markup language]].

I guess you'll be adding some special syntax to html for those sorts of issues?

Mike on May 14, 2008 5:09 AM

We opted for Markdown in our CMS, because clients in combination with visual editors invariably screwed things up horribly. Although the output would be well-formed, it was inevitably nonsense, and it was far too easy to copy and paste the wrong bits of formatting from Word or somesuch (and lo, if we disabled that bit of functionality, there'd be complaints that they could no longer copy and paste other bits of formatting from Word).

Markdown has a double-pronged advantage for us:

1. It's simple for clients to learn how to mark stuff up properly. Because they have to think at least a tiny bit about the separation between content and formatting, it's easier for them to retrospectively tweak the markup to match what the content's supposed to convey as opposed to what Word made it look like

2. We can stick raw HTML into posts where a client's asked us to do something more complicated than they can manage themselves—Markdown's smart enough to leave the HTML as-is. Our clients, not being programmers, aren't likely to ever put in HTML themselves (and are aware that if they do, they stand a greater risk of screwing up their pages and so caveat emptor).

Works well for us.

Mo on May 14, 2008 5:10 AM

Html is harder to learn then the others when it comes to people without any experience. It has tags and attributes, which can be hard to wrap your mind around. These Lightweight types are easier to use for a beginner. Wikipedia is not a wiki for developers, it is for users who have never made a website before.

Think about your audience, if it's developers, they would be able to use html and have no problems with that. Although, they might then be able to interfere with your site code, which can be quite damaging. Leaving an <a> tag open, <table> open, javascript(!) etc.

Though, I have to say that it is a lot easer to do simple styling like bold and italic in bb code than html (specially if you are to make it xhtml strict valid)

Thomas Winsnes on May 14, 2008 5:12 AM

I really think you ought to provide a simple wysiwyg editor, with the ability to edit code by hand. There's plenty of free, cross-browser applications available that you simply need to drop in and tell it what tags to allow.

Why make people do the markup by hand just because they can? That's like making a user edit a config file instead of providing them options within the program, just because they can.

I'm glad you at least went with HTML though, so no need to learn a new markup syntax. Especially with the completely _unintuitive_ underscore to mean italics. I can't think of a worse choice. I mean, there's the slash which is s/anted like italics, or the underscore, which looks much like an underline. Ugh. =)

Sammy Larbi on May 14, 2008 5:12 AM

At Pendant's corner over here I have noticed than none of the markup examples would produce the HTML above. Replace "Some examples" with "A Few examples" at first glance.

Pedant on May 14, 2008 5:13 AM

Use ReST!

Calvin Spealman on May 14, 2008 5:15 AM

We had the same discussion when we developed a wiki-like interface in our application.

It seemed that Markdown was easier for users to understand than Textile after initial tests.

I would not go the HTML way since it allows users to break any semantic value you could find in their entry.

I would neither create my own language based on both Markdown and Textile, since users, especially blog users, are very used to one of them. You would just create confusion and mistakes.

Vincent on May 14, 2008 5:15 AM

Amen. I curse every second that I have to think about (or, God forbid, actually look up) the correct markup to link something or make whitespace non-wrapping or whatever. I already know HTML. You already know HTML. And you, over there, who doesn't already know HTML: the time you spend learning the tiny subset of HTML that you need to post a comment to a web site will be much more worthwhile than spending that same time learning one of the umpteen subtly different "lightweight" markup systems out there.

John on May 14, 2008 5:17 AM

Technically, wiki syntax should be
"[http://is.gd/gns lightweight markup language]", not [[lightweight markup language]]]. But that's because I'm anal. :]

I personally disagree with you, html markup, while easy to understand for us coders, is quite harder to type than Textile or Wikimarkup. (and less pleasing to the eye, imho)

lucasbfr on May 14, 2008 5:22 AM

Isn't the point to be able to let people express their ideas quickly and easily? Why not let us use the GUI editor buttons, it's not like we're trying to prove our l33tne55 to anyone; we just want to push text into the computer efficiently.

Failing that, BBCode since it's simple and doesn't create tons of visual clutter - we're writing human readable text with markup, not Perl ;) - unless we're writing about Perl, of course.

Whatver you choose, please let it handle code in a sane way - what I mean is a little scrolling box with the code in rather than a five-screen scrolling mess, not decorated with line numbers, and in a format that can be easily copy and pasted (so no random blank lines or loss of indenting). Oh, it should handle non-wrapping lines of code correctly too without destroying your page template or making the browser have a horizontal scrollbar.

James on May 14, 2008 5:25 AM

I've recently been mulling over this very subject, because my company uses a CMS with a *horrible* GUI that completely mangles input, produces invalid markup, etc.

Markdown certainly looks the 'easiest' to learn, although I'm suspecting there's a lot more to it than presented here (off to research later ...)

In my experience, though, however simple the HTML subset, and however much training you give re: elements, attributes, valid nesting, etc., people will always struggle with that most fundamental of beasts: the humble hyperlink. Let's be fair, to a non-developer, a URL is a pretty complex string of syntax. And editors simply resort to copy+pasting. If I were adressing a 'low-tech' audience, I'd seriously consider one of:

a) Denying any out-bound links + clever wiki-style auto-linking
b) Auto-linking allowing out-bound links via a search engine (or similar)
c) Robsust URL parsing looking for obvious errors

bobby on May 14, 2008 5:30 AM

+1 Markdown...

It allows HTML and does a very nice job of very easy to use formats...

Jake Good on May 14, 2008 5:30 AM

I'd like HTML (and as a result Markdown is good too).
It would be nice to do some slightly pretty with code snippets though.

Des Traynor on May 14, 2008 5:33 AM

Why invent something new when there are so many reasonable choices? I agree with those of you that say that an HTML posting syntax would be ideal. If that is not possible for security reasons, please don't invent something new. Let me leverage the time spent learning textile or markdown or whatever existing markup technology you decide to use. My time is valuable and I'd rather spend it conveying a message rather than learning a new way to format a message.

Jay on May 14, 2008 5:34 AM

addressing those wiki points -

1. HTML focuses on content, not presentation - semantic html let people focus on (or sometimes even gain deeper understanding of the format of) their ideas.

2. Why use domain-specific markup when you've already got global markup that serves all your needs?

3. Tables aren't any less difficult to understand then the puzzling mixture of dashes, asterisks, and brackets that wikis employ

4. No - they don't need it, and you don't have to give it to them.

5. Only if you leave yourself open to it... "we're too lazy/busy to address security concerns" is not a good reason.

6. What makes Wiki markup easier to learn then HTML? Why would you learn a new markup language, which will just get converted back to HTML again? Isn't that a little redundant? If people need to learn a markup language, why not learn the one that is universally used in every page on the web?

... I'm not a big fan of wiki markup either - bbs tags are only marginally better.

matt on May 14, 2008 5:36 AM

Just a thing: a way to get code coloration is, I think, necessary. Seriously.

Also, bbcode blows (and 9 out of 10 bbcode parsers are purely regex-based translators, thus break down real fast), thanks for not using it.

Masklinn on May 14, 2008 5:36 AM

I generally agree that a subset of HTML is fine for formatting. If all you want to do is have lists and paragraphs and bold and italics, it's exactly as clear as almost any other markup language. If you're willing to automatically add p tags on double newlines, most people can muddle through without touching it at all.

However, Mediawiki is a special case, in that the html tags don't actually fully represent what most of the corresponding markup means. As you say, it represents the structure of the data, and the structure you give the data using wiki markup has side effects beyond the formatting you'd get from basic html.

For example, take the 'triple equal sign' - on first glance, it's just an h3. That doesn't tell the whole story, though - there's some deeper meaning to that tag. Not only does it do your h3 formatting, but it also generates a named anchor, and it automatically appends a link to it in the table of contents. It does have the same logical meaning as h3, but it does more - h3 is a subset of triple equal. You could of course impart that power upon h3, but I'd argue that's even more confusing than having a separate syntax.

This doesn't even touch on the templating language or the category system, both of which have no equivalent in html. So with mediawiki, you *know* html won't meet all of your needs - so coming up with a language that does allow for everything only makes sense.

Jeremy T on May 14, 2008 5:39 AM

I usually love your posts... but this is exactly the kind of attitude which stops developers from making good UI imo. You've obviously thought about this a lot, but you immediately ruled out all of the best approaches by making a big assumption about your target audience.

Just because you expect every good programmer to be comfortable with markup doesn't make it so... and as you often remind us, there are plenty of bad programmers out there.

Maybe I got it a bit wrong... but I don't think you should expect your users to understand your markup, or even HTML. Showing the markup and allowing the user to edit it is fine, (ala wiki) but not implementing nice buttons and an interface... you shouldn't demand anything of users that isn't necessary imo.

I'll take it all back if you plan on having the buttons as well... but thats not how the post came across. :)

Jheriko on May 14, 2008 5:39 AM

Jeff, glad to hear you've settled on HTML for the input method. As you say, we developers already (should!) know HTML.

My only issue with use of HTML versus lightweight markup is the few extra characters needed to type out HTML tags, as opposed to the comparatively fewer characters needed to do formatting in the lightweight markup languages. But that's just one more reason that developers should all know how to touch type, right?

It seems like several other commenters are advocating Markdown... one option to accommodate these folks might be to give users an option to have their posts parsed for HTML, for Markdown syntax (or whatever lightweight markup language you choose), or both. Post entry in forums based on the UBB.threads (http://www.ubbcentral.com/) package does this, for example.

Jon Schneider on May 14, 2008 5:43 AM

As much as the bugs in Blogger annoy me, the one thing they do right is to allow the user to go to source and edit the HTML. For the users that don’t understand markup, they have a WYSIWYG editor.

Reinventing a markup language is the wrong approach. I've been creating HTLM pages since 1994, and every time I edit a Wikipedia page I roll my eyes because I still have to look-up that URL syntax. I agree with Calvin, use REST if you want to implement "automatic cross referencing functionality" but remember, that is a server-side function, not a mark-up issue.

One recommendation, I would create a White List of HTML you will support. This way you don't have to try to manage a Black List of restricted tags.

Josh Hurley on May 14, 2008 5:43 AM

It seems to me that an assumption is being made that all developers know how to code in HTML. As a desktop developer I rarely, if ever, have to touch web code and hence will have to invest time and effort into learning a whole new 'syntax' if I am expected to format my posts correctly.

While I understand that there will always be a need to have some kind of markup I cannot see the reasoning behind forcing us to hand craft it. If I am popping onto the site to post a question (or indeed an answer) then I probably have that problem space loaded up in my brain. Having to interrupt that and find out the correct way to markup my post seems like a sure fire way of reducing the integrity of that post.

I can see no logical reasoning why you feel the need to forgo a simple GUI driven text box, that requires minimal thought while using, in favour of forcing us to learn whatever you choose to be 'my way is the best way'.

That just smells of elitism.

One of the books on your reading list says it all - "Don't Make Me Think: A Common Sense Approach to Web Usability".

Martin Wallace on May 14, 2008 5:46 AM

Take a page from MS' book. Code up some intellisense for whatever markup you use. Even if markup isn't intuitive you can hint the user to where they need to be...

Add some syntax highlighting and users won't know they aren't using something they already know...

JPunyon on May 14, 2008 5:47 AM

I've always liked 37signals' solution - give them just a few whitelisted tags for bold, italic, links, quotes. Forget about attributes. Keeps it pretty clean, and they can explain it in a sentence.

Evan on May 14, 2008 5:48 AM

In the spirit of various other articles on this very blog, wouldn't the correct answer be to allow both html _and_ simpler markup?

I know HTML inside out, but given the choice I'd rather write in textile whenever possible.

Jack on May 14, 2008 5:52 AM

"this is a site for programmers, so they should be comfortable with basic markup."

Well, at you put your prejudices involving what a 'real programmer' is right there where everyone can see it.

Embedded programmer who's enver marked up with anything other than a highlighter pen on May 14, 2008 5:54 AM

Isn't the point of software to make things easier? Just because as a programmer, I can code in HTML doesn't mean I want to code in HTML just to write a comment. WYSIWYG is not a bad word.

Although, for a developer site there is a very limited set of markup needed:

* Plain format (the default)
* Lists (ordered/unordered)
* Bold/Italic/underline
* Hyperlinks
* Sourcecode (the biggie for a programming site)

One nice thing would be color formatting sourcecode automatically based on language.

Jeff Cuscutis on May 14, 2008 5:55 AM

If I were You I'd definately go for plain old html but whitelist only a bunxh of tags and limited attributes per tag, and ofcourse validate properly before accepting anything. I've done it before and it's rather simple.

Yet I strongly disagree with the whole "If you're a programmer, you damn well better know HTML" thing; I might be comfortable programming for the web but that doesnt mean all programmers are. I still know some that only do specific FoxPro based stuff, or even just a small subset of c on embedded devices. These people are definately programmers, but don't have any reason to know anything about HTML.

Therefor, and because I am not you, I'd still go for a nicer custom wysiwyg editor to generate nice lean and valid HTML with a code button for advanced users perhaps. I have this love for the KISS principle, but I realize it applies to my users' point of view, not to mine.

Kris

kris on May 14, 2008 5:56 AM

I find the simple formatting^ offered by the Australian broadband forum Whirlpool to be quite good. It isn't as full featured as many others but in most instances, its more than enough - they conveniently allow you to enter in raw HTML if that suites your purposes better as well.

^ http://whirlpool.net.au/wiki/?tag=whirlcode

Al on May 14, 2008 5:57 AM

I'll note that the blog Making Light uses a subset of html for comment markup, including urls, and the users there seem to have no trouble figuring it out. The users are typically science fiction nerds, but at least 2/3rds do not come from technical backgrounds, but when sufficiently motivated can even figure out html, with prompting.

I'll append a section of prompting, but I had to mangle the angle brackets to abide by the "no HTML" rule you have for comments. (Irony!)

HTML Tags:
[strong>Strong[/strong> = Strong
[em>Emphasized[/em> = Emphasized
[a href="http://www.url.com">Linked text[/a> = Linked text

Spelling reference:
Tolkien. Minuscule. Gandhi. Millennium. Delany. Embarrassment. Publishers Weekly. Occurrence. Asimov. Weird. Connoisseur. Accommodate. Hierarchy. Deity. Etiquette. Pharaoh. Teresa. Its. Macdonald. Nielsen Hayden. It's. Fluorosphere. More here.

sled reference on May 14, 2008 5:58 AM

Have you tried out markItUp:

http://markitup.jaysalvat.com/home/

It's an excellent javascript utitity that puts a friendlier face on the standard textarea. I use the HTML version to allow my users to enter snippets of XHTML, but it also works with MarkDown, Textile, etc.

I also wanted control over what tags and attributes the users are allowed to enter, so I wrote an extension that validates the XHTML by parsing it against a list of valid tags/attributes (defined in JSON).

Ben Mills on May 14, 2008 5:58 AM

Good for you. I'm sick to death of being coaxed into joining some new forum or social networking site or wiki and finding out that I have to learn a new, totally arbitrary set of rules that are kinda sorta like the ones I already know but not quite, and always getting mixed up between the 8 different markup styles.

On the other hand, I don't think it would hurt you to have a WYSIWYG editor - some developers may know HTML but are very slow typists and having to type HTML would slow them down significantly. Just make sure you leave the option for people to use a normal textarea that's not horribly mangled (Community Server, I'm looking at you).

Aaron G on May 14, 2008 5:58 AM

I'm wondering if you have put any though into automatically supporting syntax highlighting for code snippets? I've used several sites that do this (forums.devshed.com, for example), and while it's rarely perfect it can really help when reading through posted questions.

Joel Coehoorn on May 14, 2008 6:00 AM

Doesn't the conclusion here go against the reasoning in the "XML: The Angle Bracket Tax" post a couple of days ago? I know you can get away without all the closing tags in HTML, so it is slightly better than XML, but to willfully twist your own words:

1. Should HTML be the default choice? The authors of most styled text entry code developed that would probably say NO to this.
2. Is HTML the simplest possible thing that can work for your intended use? NO.
3. Do you know what the HTML alternatives are? YES
4. Wouldn't it be nice to have easily readable, understandable posts, without all those sharp, pointy angle brackets jabbing you directly in your ever-lovin' eyeballs? Ummm, Yes?

As pointed out above, whatever you do it isn't going to actually be HTML. You're going to have to add your own stuff to it and limit it in some ways. I admit my HTML knowledge is basic, but I've no idea how to enter some syntax highlighted javascript in HTML, but i can manage it in mediawiki syntax.

Chris on May 14, 2008 6:02 AM

Your link to "why doesn't wiki do HTML" is broken. It should be

http://c2.com/cgi/wiki?WhyDoesntWikiDoHtml

Dave on May 14, 2008 6:02 AM

I vote for html format with a WYSIWYG editor, such as one of these:

http://www.geniisoft.com/showcase.nsf/WebEditors

Just because I know html, doesn't mean I always want to type it or any other markup to enter a comment. All of these editors I have looked at supply a Design (lazy) mode and a raw html mode.

Michael Lang on May 14, 2008 6:02 AM

@Jheriko: seriously?! Any *competent* (not "good") programmer must be comfortable with markup. Otherwise it's not, *by definition*, competent programmer.

As for bad programmers out there, they're not stackoverflow.com's target audience and the more of them runs away screaming, the better -- their input equals noise and degrade the value of the site for people who actually do have something useful to say

Peter on May 14, 2008 6:07 AM

You could be different and split the difference. Add a few tags to html, like a <markdown> tag or a tag for the other commonly used internet formatting options, with everything untagged defaulting to a whitespace sensitive version of html (so people don't have to type paragraph tags).

It requires some additional processing work, and that's never fun, but it seems trivial to me to implement and is adaptable to different formats in the future.

Ben on May 14, 2008 6:08 AM

So you are going to use plain-old HTML!

What about new-lines? Are we going to have to type "BR" all over the place?

Also... people are going to want to post HTML code *AS CODE*, without having to type all those escape sequences just to post some example of a DIV that isn't working or whatever.

Of course, all these things COULD be done - we ARE programmers!

But will anyone bother? Or will they just go somewhere else?

Finally, you may be making a mistake in saying that it is "secure, with careful parsing" - this sounds like pride coming before a fall to me!

Syd on May 14, 2008 6:10 AM

An additional consideration to my above thoughts occured to me. You could add pre-processors to tags for each language <c++>, etc, and they could make an attempt to apply proper indentation and code highlighting that would be more versatile than a language agnostic version.

Ben on May 14, 2008 6:11 AM

The one thing I know that argues against using actual HTML for post styling is if you want people to be able to post markup or even code-- if someone posts an example containing a 'for' loop, the angle brackets can cause all manner of weirdness. And if they try to post some sample HTML, then look out Francis.

So you either have to define a markup pattern that passes through untouched whatever's inside ('code' is a common choice) or else move to something like Textile or Markdown and set it up to encode stuff like angle brackets so it passes through untouched.

Eric Meyer on May 14, 2008 6:11 AM

I really think you should just use (or build) a good WYSIWYG editor. As a coder I can write in assembly language, but it doesn't mean I want to.

If you want to offer the ability to alter the raw markup then you can give users that option, but I want my editor fully cooked please...

Chris E on May 14, 2008 6:12 AM

I agree with all those that just want to use HTML. It fits the target audience and likely usage model:

Audience = programmers
Usage = occasionally

It's not like I will be on stackoverflow.com for hours on end everyday trying to write programs. There's no need to learn anything extra or new.

Solburn on May 14, 2008 6:16 AM

@Jeff:
> "I'd much rather rely on a subset of trusty old HTML than expend brain cells trying to remember the fake-HTML way to make something bold"

Ironically enough, there isn't a way to make something bold* in (modern semantic) HTML since, as you pointed out in a previous blog entry, HTML is the 'model', not the view :)

* The <b> tag was officially "discouraged" way back in 1999 with HTML4.
The current HTML5 working draft doesn't go as far as deprecating it (yet), but it does say "The b element should be used as a last resort".

Graham Stewart on May 14, 2008 6:17 AM

"Any *competent* (not "good") programmer must be comfortable with markup. Otherwise it's not, *by definition*, competent programmer." - Peter, about five comments ago.

Brilliant, just absolutely brilliant. You couldn't make this kind of ignorant comment up. Reminds me of the Java coders who can't believe there's still a place for C in the world ("It's, like, over a decade old, man! Move on!" - Java School grads, everywhere).

Embedded programmer who's never marked up with anything other than a highlighter pen on May 14, 2008 6:18 AM

Try to make a plain jane table.

In HTML it's sane. In light-weight markup lingo, it sucks; they've tried to reduce the tr and td tags even more; Wiki makes it completely incomprehensible.

The "rich" text editors are rich, but not robust. To write HTML fast, you have tag completion and a suggestion system for the values. RTEs fail on this point.

I find it immensely more pleasing to just get the raw HTML, dump it in my text editor of choice, and copy + paste it back again.

Rob Janssen on May 14, 2008 6:21 AM

Seeing as how most of the potential users of Stack Overflow currently reside at forums, it might make sense to cater to them and use whatever markup language is most prevalent across the inter-tubes. In my travels I've found the most widely used markup is BBCode or HTML.

BBCode is easy to use and offers the potential to add custom tags to allow special functionality. I think it's your best bet.

Matt Briancon on May 14, 2008 6:24 AM

<CleverEndQuote>Again.</CleverEndQuote>

Adam on May 14, 2008 6:24 AM

And you'd be building that in VB.NET Jeff?

Or have you jumped ship to C# now too? Is all that "love" you show about VB.NET just empty talk?

VBMan on May 14, 2008 6:25 AM

I'd still prefer a row of buttons. What I'm a coder so on your site I have to hand code because I can? Why not combine them, let me code or click a button when I'm lazy.

Mike on May 14, 2008 6:26 AM

A problem with HTML is unclosed tags. Leaving, for example, a bold tag open can cause text farther down your page (ie your footer) to render in bold. So if you do allow straight HTML, you'll have to create a script which finds and closes any tags left open. Considering all the different types of tags, this is no easy task. I'd recommend Texttile or Markdown for this reason.

Chad Braun-Duin on May 14, 2008 6:26 AM

Darn, was meant to be <CleverEndQuote>Again.</CleverEndQuote>

Adam on May 14, 2008 6:27 AM

@Jeff
> Incidentally, if you haven't ever edited a Wikipedia article, you should. I consider it a rite of passage, a sort of internet merit badge for anyone who is serious about their online presence

I'm not serious about my online presence but I am serious about programming. You've edited wikipedia, I haven't. I know C, you're a web celebrity. Go figure.

Anon on May 14, 2008 6:30 AM

Actually, what I think matters most is if you provide a good, accesible quick reference on the editing page.

It doesn't really matter which language it describes, as long as it's present and small but complete enough. I mean, people will need to look HTML up too eventually, editing your site.

Adam on May 14, 2008 6:30 AM

I agree totally. Of the four markups you presented, the only one that was readable enough that I didn't have to refer back to the rendered version to see what was going on was the BBcode. (For a couple I'm still not sure how the first quoted section's end is delimited). But BBcode is practically html with square brackets, so why bother?

As for developers like Martin who don't know HTML, I'd say you should be prepared to learn. This isn't Swahili we are talking about. Learning a markup language for a real developer should be trivial.

In fact, I'd say lack of basic HTML skills in posts might be a good way to spot the posers.

T.E.D. on May 14, 2008 6:34 AM

It depends what you want to do ?

HTML is fine for just formatting (That's what it is for!) but you then have the problem of cleaning the HTML, filtering broken syntax, and your pages are not a consistent format anymore....

Wiki syntax is more than formatting it adds meaning to the text which as a side effect might format it, e.g. Tables since they are a standard format can be read and processed by the wiki as data, internal links work both ways automatically, categories/tags aggregate data automatically etc ...

Perhaps you should use XML instead? [The Angle Bracket Tax] ;-)

BTW Internal links in mediawiki are [[article]] or [[article#section]] external links are not much harder [http://otherwebsite.net/Light_Weight.htm] but are deliberately avoided ....

Jaster on May 14, 2008 6:36 AM

I have used freetext (http://freetextbox.com/default.aspx) for a few projects and it has worked well. I think it works for novice and advanced users.

BrianK on May 14, 2008 6:40 AM

I think one of the biggest concerns with allowing raw html is all the crazy things people can put into your website.

XSS, ugly images, ads, annoying ads, spam, and the like.

Special markup has the function of limiting what people can do.

Jeff Davis on May 14, 2008 6:42 AM

so if someone wanted to come to your site to learn more about html they'd be screwed?

is the site intended for 1337 programmers to come and get more 1337 or are you intending on allowing beginners to come and learn too?

you've completely gone off the rails on this one, especially if you're considering writing your own hybrid mark up language.

burnside on May 14, 2008 6:44 AM

I have a function for my forums that strips out <script>,<img>,etc and everything in between the tags. I have a small warning for the user on what tags not to use.

BrianK on May 14, 2008 6:46 AM

> Ironically enough, there isn't a way to make something bold* in
> (modern semantic) HTML since, as you pointed out in a previous blog
> entry, HTML is the 'model', not the view :)

That's a pretty good point. Perhaps a standard style sheet could be set up, which posters could reference?

Then again, what you are *supposed* to be using is tags like em (emphasis), and strong (strong emphasis), and let the user's browser do that however the user wants such things presented (boldface, underlineing, big font, yelling the word, whatever).

This is precisely why I don't use WYSIWYG editors for HTML. They invaribly have tons of style buttons and almost no proper emlement buttons. If your development tool completely misses the point of the language, the results can't be good.

T.E.D. on May 14, 2008 6:46 AM

+1 BBCode

easy to parse. easy to remember.

Joe Beam on May 14, 2008 6:47 AM

Stupid me, I meant to list script,img tags etc in my previous post. They got filtered.

BrianK on May 14, 2008 6:47 AM

Keep it simple! Consider the most common use, probably a short post of a few paragraphs, some bold, a link, a code section, and a list. In these cases any of the markup languages result in a much simpler and easier to understand post than would be with HTML.

Yes, with HTML you get the "I can do anything" but don't focus on the edge cases and ignore what people will be using it for 99% of the time. I've been writing HTML since '95 and one place I don't want to see it is in a forum (offhand I can't think of any forums I frequent that actually use HTML).

At this point there may well be more people familiar with with the Wiki syntax than with HTML...

Dave on May 14, 2008 6:47 AM

Direct HTML input is the autobahn to invalid XHTML.

http://iamacamera.org/default.aspx?section=develop&id=73

In ten years, we will look back with nostalgia at the days when we left comments on your site via direct HTML input -- the way we fondly recall bygone years when we configured our ISDN modems and put jumpers on hard drives to designate them master/slave.

Direct HTML input is at best, [i]quaint[/i], but by no means a long-term viable solution to online markup entry.

Carl Camera on May 14, 2008 6:49 AM

@James

http://code.google.com/p/syntaxhighlighter/ this JavaScript library seems to be the best way to document code with syntax highlighting, automatic line numbering and copy and paste support. I use it in a lot of my documentation.

Robert S. Robbins on May 14, 2008 6:50 AM

"As for developers like Martin who don't know HTML, I'd say you should be prepared to learn. This isn't Swahili we are talking about. Learning a markup language for a real developer should be trivial."

It may be trivial, but it should also be optional. The interface should never get in the way of usability. Jeff makes that very point himself in "Reducing User Interface Friction" (<a href="http://www.codinghorror.com/blog/archives/000866.html">http://www.codinghorror.com/blog/archives/000866.html</a>)

"Reduced interface friction goes a long way toward explaining the popularity of services like twitter and tumbr. What's the minimum amount of effort a user can expend to produce something? The answer could be a key competitive advantage.

That single input box on the Google homepage starts to look more and more like an optimal user experience. It might be unrealistic to reduce your application's UI to a single text box-- but you should continually strive to reduce the friction of your user interface."

Martin Wallace on May 14, 2008 6:51 AM

Please don't use wikipedia as a model markup language. It's badly defined to the extent that the only 'compliant' parser is mediawiki itself, which consists basically of a long series of regexes. It's a huge shame that one of the largest consolidated sources of information on the web is all written in a language that's extremely difficult to robustly parse.

As far as using HTML goes, it depends on your target audience. For stackoverflow, I would agree, but for more lay-person sites, HTML seems unnecessarily complicated. One forum my wife and I both post on uses a subset of HTML, and I've lost track of how many times I've had to tell my wife what the syntax for links is. "a href" is second-nature to us, but it's not intuitive if you're not already familiar with HTML.

Nick Johnson on May 14, 2008 6:57 AM

I think making it HTML or even subsetted HTML would work well. But my first thought was that people are going to mess up your layout. True, posts or pages should be content-centered, but all the more reason to limit the freedom of users to a small but large enough set of layout items. Look at Myspace! It's eye- and brainhurt, because everybody puts in their own fonts, sizes, colors, background images, etc... Of course you can limit all this, but I think you will have only 5 to 10 remaining html tags, and then consider these in lightweight ML's. Looking up markup syntax is a bitch, but if you put the syntax of the 5 or 10 most used (and probably 95% of the time, are only used) near the editing-field, it doesn't matter if it's html, BBCode, or whatever imo.

Just please don't let people change fonts, font size, add their own smileys, and css mods.

You will have a select crowd of intelligent people writing articles and solutions to problems. But the questions themselves are going to be asked by Co0dingNewb015. Maybe people with a low post count can have even a subset of your subset of layout items.

Now I'm just rambling

Ps: You're right Jeff. Nobody reads 200 blog comments, except you (Podcast 1). I tried to see I wasn't 'double posting' but stopped around 30 something.

joon on May 14, 2008 7:00 AM

I think anyone who calls himself a web developer should be proficient in HTML. Not just good or familiar with it but proficient. Take Visual Studio for example. I see too many developers squeak by working in Design Mode and when work in Design Mode breaks down (as it often does) they are lost in the sea of code in source mode. I don't even use Design mode. I code entirely in source mode. Its a sad state in our profession when a good percentage of developers can't "debug" HTML code. Sorry about the detour.

BrianK on May 14, 2008 7:02 AM

Whatever markup you go for, please make sure you only offer a limited but useful set of formatting/style tags.
One of the problems with sites that have a lot of user formatted content is that they have a horrible inconsistent mix of styles, layout and structure that makes flicking through the site a constantly jarring experience.

Things I'd want when posting a question/answer/article:

- the ability to include real source code (without having to alter it to remove HTML characters etc) AND have all my formatting/indenting preserved AND have the code automatically coloured in the post.

- include images/diagrams (without having to host them myself on some other site).

- link to other articles on the same site and to relevant external sites (e.g. Sun, MSDN, W3c etc).

- attach example source files.

Graham Stewart on May 14, 2008 7:08 AM

(My point being that none of the goals above are entirely satisfied by HTML, or most of the other simple markups)

Graham Stewart on May 14, 2008 7:09 AM

See reddit's comment box. Little expandable notes on how to use markdown (very handy as a reference when you forget something). And as someone mentioned near the top, the official markdown engine supports html tags. Best of both worlds.

dude on May 14, 2008 7:09 AM

I agree with sticking with HTML, if you aren't going to toolbar the text widget.

One good example is Lifehacker (and probably other Gawker blogs). They have a live comment preview system that uses HTML markup. Nice.

Otherwise, do it with less friction. Give me some toolbar icons.

piyo on May 14, 2008 7:11 AM

As others have said, if source-code is going to be included in messages inline, and I think this is a highly desirable feature, raw HTML is not a good choice, for three reasons:

1. You have to escape special characters, which means at the very least splattering < and &amp; (or should that be &amp;lt; and &amp;amp;?) everywhere.
2. You need explicit <br> or <p> tags. (Again, should I write <br> and <p>?)
3. You need painful contortions of &nbsp; &nbsp; to get indentation right. (&amp;nbsp;?) It's annoying to read C++ without indentation, but it's generally impossible to try to guess what Python code with the indentation stripped out is supposed to be.

All of this makes it hard to paste in source code, and hard to edit it in-place. Even if you allow <pre> tags, it makes it pretty nasty to embed HTML code which might contain </pre> somewhere.

Maybe you should go for 78 columns in a monospace font for everything. ;)

Weeble on May 14, 2008 7:14 AM

Meh. I vote for pre tags around the lot, and autolinking of urls. Plaintext is THE humane WYSIWYG markup language.

james on May 14, 2008 7:17 AM

I've done sites with comments allowed in HTML. It works good, the only issue I run into is that people like to do things to screw up your layout when they leave open table tags/divs (which you probably should then setup some system to make sure their tags are closed) and I've also ran into issues where spammers put in Javascript redirects or popups. So HTML isn't perfect either.

The traditionally [B]Bulletin Board[/B] format is also widely known so that it won't take a user looking up things. Or limit the HTML a user can use.

Kris on May 14, 2008 7:17 AM

Jeff,

If you want a fun challenge, figure out how to make the form input color-coded a'la Visual Studio or Expression Web. HTML is easier to read and write with all the blue and red tags.

Zack on May 14, 2008 7:20 AM

Flickr allows simple HTML tags such as:
<a href=&quot;URL&quot;>link</a>
<strong>strong</strong>
<b>bold</b>
<blockquote>blockquote</blockquote>
<em>emphasis</em>
<i>italic</i>
<img src=&quot;URL&quot;>
<u>underlined</u>
<s>strike</s>
<del>deleted</del>

Ali Karbassi on May 14, 2008 7:23 AM

I read through maybe the first 15 comments which were mostly anti-HTML (to some extent), so I'll chime in some encouragement. HTML makes a _lot_ of sense for your purposes, and all these esoteric things are quite annoying in the end. (I have on two separate Wiki systems inadvertently created links to nonexistant pages just by using formatting marks that seemed innocuous at the time, for one example. In general, remembering _which_ fake-HTML the current textbox wants is the problem.)

I'm a big fan of just saying "these are the tags I want to allow," then maybe extending them with extra attributes or use cases as needed (e.g. <a page="lightweight markup language">LWML</a> or <a>lightweight markup language</a>). No need to have two syntaxes floating around (HTML + Markdown, I guess, is popular with lots of people.) Making sure the input comes out as well-formed XHTML is a solved problem, to be sure.

Domenic Denicola on May 14, 2008 7:36 AM

Wow, uh, your comment box strips out angle-bracketed phrases, instead of passing them through. Well, here's an ironic rephrasing of the first sentence of my second paragraph...

I'm a big fan of just saying "these are the tags I want to allow," then maybe extending them with extra attributes or use cases as needed (e.g. [a page="lightweight markup language"]LWML[/a] or [a]lightweight markup language[/a]).

Domenic Denicola on May 14, 2008 7:38 AM

I'm a fan of Markdown myself. It's easy to learn and already accepts a lot of conventions that pre-date HTML (like asterisks and underscores). I *know* HTML, but that doesn't mean I want to use it. In fact, HTML is so annoying to type, I would rather use a graphical editor and clean up any mistakes afterwords.

Also, I agree with those who said:

1. Not all programmers do any sort of markup. You should offer a graphical editor, and the option to turn it off.

2. If you offer any sort of HTML, it has to be a small subset.

Rhywun on May 14, 2008 7:42 AM

My websites use BBCode because the module I use for forums supports that. I never quite understood why BBCode because you end up using much of the same syntax as HTML except you use square brackets instead of angle brackets.

<b>Bold Phrase</b>
[b]Bold Phrase[/b]

What's the difference? Why create an entirely new syntax when one is already available and well documented?

Textile was written (and it isn't from troff!) with the idea that marked up text should be readable as plain text. Underline (now italicize) by putting underscores around something. Bold by putting asterisks around it. Make a list by putting asterisks in front of each line. Simple to understand, clean, and easy. Unfortunately, not very powerful.

I personally prefer to enter things in HTML. I know it, and I don't find it all that unreadable. What I really can't stand is each site having different standards. I don't mind learning something, but I hate learning to do the same thing dozens of different ways. HTML is standard and that's good enough for me.

My suggestion: A modified HTML. One where you don't put <p> for paragraphs breaks and things in the format of http://xxxx.yyy or xxxxx@yyyy.zzz are automatically linked. But at the same time allows you add a bit of HTML for the more complex stuff.

That way, can type your entire comment without a lick of markup code, but if you know you want to emphasize a word here or there or add a link, you know how to do it. That'll satisfy everyone.

David W. on May 14, 2008 7:43 AM

I was researching on the exact same topic today for my project and I've chosen markup specification from

http://www.wikicreole.org/

Particularly I liked their reasoning and father of wiki is behind that too I think.

lubos on May 14, 2008 7:47 AM

+1 markdown or wysiwyg.

XML derivatives were made for ease of parsing, not ease of use. The rule [Don't make me think] is superseded by [Don't make me do extra work]. Of course, optimally you'd just give a wysiwyg textbox with options to switch to markdown view. Just as 90% of your readership knows/should know HTML, 98% of your readership knows/should know how to use a wysiwyg text editor. Even if somehow 90% of your readers knew or should know emacs, it doesn't give you license to require knowledge of emacs commands at stackoverflow.

Of course, I completely understand if Joel completely overrode your objections to build a new markup language that cross-compiles to VBscript, javascript, PHP, XHTML, Markdown, ARM, SPARC, and is hot-pluggable as a Linux kernel module. Otherwise, I might have just heard 50 thousand heads exploding in the distance.

Jimmy on May 14, 2008 7:52 AM

The more I think about it the more I think a basic WYSIWYG editor is the only real way to go.

It requires minimal thought to use and allows you to properly support the various features needed for a useful coding site (e.g. press the "Insert Code Block" button and get prompted to select which language it is, so that syntax-colouring can be applied)

Graham Stewart on May 14, 2008 7:55 AM

I think it's a good idea to provide options. There's no guarantee that EVERY user of stackoverflow is going to be comfortable using HTML, especially if they just want to write a quick post. For example, even if you enforce HTML, then parse newlines as BR and P automatically; don't make me think.

Let users set up their markup preference in their profile, be it HTML, BBCode, Markdown, whatever.

I also agree that colored syntax for code blocks is a great idea, since the whole focus of this project is on code.

Erick on May 14, 2008 7:55 AM

First, why do you assume every programmer is familiar with HTML? Your site will appeal to a WIDE range of developers who may never have written HTML.

Second, HTML is very broad. Do you really want your users entering inline styles? Reusing your parent CSS classes? re-arranging your layout with relative and absolute positioning? You certainly don't want users to enter javascript of any kind.

Third, can you really whitelist HTML? Can you deal with all the clever XSS hacks? (<a href="http://ha.ckers.org/xss.html">http://ha.ckers.org/xss.html</a>). If so, you have crippled HTML to the point that it resembles lightweight markup, except your users won't know in advance which parts of HTML will work

I would love to offer a WYSIWYG editor + friendly editable markup that doesn't open up big XSS holes. If you make that work with HTML please let us know how you did it.

Mark Porter on May 14, 2008 7:56 AM

Can't we just type all our comments in Wasabi?

Martin Wallace on May 14, 2008 7:57 AM

Cool, let's make the users generate their own POST command too.

Enabling HTML editing is great, but requiring it for simple formatting just adds friction to the communication process. +1 for the GUI editor.

Kevin Dente on May 14, 2008 8:00 AM

I use Markdown. Clean syntax, particularly for linking, and it gives me the freedom to use HTML if I want. Works great for me.

Having done a fair amount of Wiki work, I absolutely hate how MediaWiki formats tables, though I find most of the rest of it's syntax at least tolerable.

Markdown is, in my opinion, the best compromise between light-weight formatting, and the raw power of HTML.

Jeff Craig on May 14, 2008 8:00 AM

Imagine you want code-coloring. So instead of

<source lang=qbasic>
10 PRINT "I ROCK AT BASIC!"
20 GOTO 10
</source>

you have to write

<pre>
<span class="codeLineNumber">10</span> <span class="codeStatement">PRINT</span> <span class="codeString">PRINT</span>
<span class="codeLineNumber">10</span> <span class="codeStatement">GOTO</span> <span class="codeNumber">10</span>
</pre>

?

That's ugly.

Matthias on May 14, 2008 8:02 AM

One benefit of using something like Markdown is you automatically get things like escaping and potentially code coloring, which is arguably a very important aspect for stack overflow. I personally use reStructuredText for most my HTML editing because it takes care of the HTML aspects for me such as escaping XML and coloring code.

Eric Larson on May 14, 2008 8:03 AM

WTF? do I have to use &amp;lt;???

Imagine you want code-coloring. So instead of

<source lang=qbasic>
10 PRINT "I ROCK AT BASIC!"
20 GOTO 10
</source>

you have to write

<pre>
<span class="codeLineNumber">10</span> <span class="codeStatement">PRINT</span> <span class="codeString">PRINT</span>
<span class="codeLineNumber">10</span> <span class="codeStatement">GOTO</span> <span class="codeNumber">10</span>
</pre>

?

That's ugly.

Matthias on May 14, 2008 8:04 AM

We use MediaWiki in our internal Intranet, and we found that the Wiki Syntax is hard for non-technical users, but technical users usually "got it" after a week or so. I think it's one of the cleanest Syntax, because of it's headings (==), it's tables ({|bla) and it's lists (* ).
BBCode is a bad solution for a non-existant problem in my opinion, as it is essentially HTML with square brackets.

Bare HTML works fine, but keep in mind that there are multiple ways to do lists.
<ul>
<li>Bla
<li>Blu
</ul>

works, but without the closing </li> tags, you are not XHTML Compliant anymore. You could either:
* Live with that
* Write a parser that tries to fix that, with all the bug testing and fixing that goes along with that
* Use another syntax

It should be noted that Wiki Syntax != Wiki Syntax. Pretty much every Wiki Software has it's own Syntax, that is not 100% compatible with other Wiki systems.

Markdown looks like my favorite: It exactly does what is needed, with an intuitive syntax.

Michael on May 14, 2008 8:10 AM

YES! HTML markup is king! Don't make us learn another markup language! Everyone who disagrees with you (and me) is crazy and/or an idiot!

Peter on May 14, 2008 8:12 AM

I'm less worried about bold and italic text than for code, I would love to see some code coloring (keywords like int in different color for example), but that's a lot of work, but it will be sweet.

Juan Zamudio on May 14, 2008 8:14 AM

My 15 year old learned HTML for his MySpace page.

Charles on May 14, 2008 8:15 AM

Jeff, why you don't use the wiki technology of fogbugz?

Eduardo Diaz on May 14, 2008 8:16 AM

now i'm really confused, XML has angle bracket tax and HTML doesn't. not only that but, as I type this I get: "Your comments: (no HTML)". Hm?

:/

/mp

Mauricio Pastrana on May 14, 2008 8:16 AM

Sorry, Jeff. I have to call shenanigans.

In "The Angle Bracket Tax," you had fairly harsh words about working with tags within a human-read document. You pointed out how XML tags can degrade a document's readability, because they add extranneous noise around the text. You also envisioned an ideal world where the tags are hidden, created and managed in the background.

Fast forward today and you appear to say the exact opposite, only we're talking about HTML instead of XML. The loss in readability is now worth it because the layout becomes much more precise.

You were pining after interfaces that hide tags a few days ago. In the XML argument, you said "You might argue that XML was never intended to be human readable, that XML should be automagically generated via friendly tools behind the scenes, never exposed to a single living human eye. It's a spectacularly grand vision." If I replace the word "XML" with "HTML", your vision becomes reality, as there are countless WYSIWYG HTML editors on the market today. But today's post puts you firmly in the camp of inline markup editing.

Personally, I prefer inline editing to WYSIWYG, and XML over fancy, fuzzy markup replacement. I also think that XML is a wonderful way to facilitate communication among disparate systems. It may not have been the original intent, but sure as hell is an awesome side-benefit.

I think you agree with me, but first you need to clarify your position.

Frank on May 14, 2008 8:17 AM

This, written by Jeff just this week about XML. Somehow, I don't see how requiring HTML will escape this criticism either.

"Wouldn't it be nice to have easily readable, understandable data and configuration files, without all those sharp, pointy angle brackets jabbing you directly in your ever-lovin' eyeballs?"

K|O|G|I on May 14, 2008 8:18 AM

"If you can sling code, a little bit of presentation markup is child's play."

Clearly you don't play around on forums very much. You're delusional if you think your site will be primarily good programmers. It will be 10% good programmers and 90% noobz and script-kiddies like everywhere else.

Use BBCode, HTML, or whatever, but don't expect the users to understand it. Personally, I don't care - I must know 50 different markups used on different forums - you just figure it out, and if you're not smart enough, you don't.

Jasmine on May 14, 2008 8:18 AM

Yes, you should use HTML for stackoverflow. I'm not sure if it's the best choice for CMSs in general, but for programmers it is the better choice. While I think something like <a href="http://haml.hamptoncatlin.com/">Haml</a> would be fun and interesting, HTML would provide the perfect barrier to entry - not to easy and not to difficult. Like you said, programmers should know it.

It is extremely annoying when I enter a comment somewhere, include an HTML link, and the comment is rendered with the href value as a link and the other HTML converted to angle brackets and crap in the comment. It's made worse when I can't edit it.

Lance Fisher on May 14, 2008 8:19 AM

Too funny. Your argument in favor of ubiquity and convention was exactly my point against your argument yesterday in your anti-XML post.

dinah on May 14, 2008 8:23 AM

HTML is fine by me. If you don't know it, now is as good a time as any to learn it.

PaulG. on May 14, 2008 8:25 AM

Whatever mark-up you go for you should also allow HTML, if just to accommodate those nice IDE's which allow you to copy code as HTML (automatic highlighters suck).

[ICR] on May 14, 2008 8:28 AM

Seriously...how advanced comments to you normaly write on a forum?

I never, ever, use any more than these(bb-code).
[b]
[img]
[url] (often automaticly generated from correct urls)
[code]
[quote]

For these simple things html is overkill, first of all you would have to create a huge whitelist, the simple [b]-tag can be written in hundred different ways using html. A whitelist for css would be even harder to write, imagine parsing font-size:9999 etc.

Secondly, the code-tag usualy does server-side syntax highlighting, same thing with quote-tags, it can be used to link to the original message. Doing this with classic html-tags would be realy confusing.


Syntax highlighting is also a (the only) good reason why to use WYSWYG-editors, these usually(?) allow you to paste pre-formatted text directly from your IDE (At least the one on the msdn-forum does this even though that editor sucks in thousand other ways).

crazy ivan on May 14, 2008 8:31 AM

Jeff,

What you describe sounds very similar to what Dan Brettle has written in NeatHTML. Have you heard of this?

From Dan's description:
"NeatHtml™ is a highly-portable open source website component that displays untrusted content securely, efficiently, and accessibly. Untrusted content is any content that is not trusted by the website owner. Typical examples include blog comments, forum posts, or user pages on social networking sites. NeatHtml uses an “accept only known good” (whitelist) approach to security to help prevent attacks which are not yet known."

You can read more about it @ http://www.brettle.com/neathtml

I think he strikes it right on the nail. Allow use of HTML but keep it safe.

CyteShoppe on May 14, 2008 8:33 AM

I'm with you 100% on scrubbed HTML. It's easiest to implement /and/ explain ("You can use HTML.") Most novices already know HTML. It's like learning your ABCs these days.

If you don't know HTML, does that really matter? Seriously now. People can read plain text just as well. This will be a wiki fer chrissakes. If your plain text is /that/ much of an eyesore, the other 1337 HTML h4xx0rz can pretty it up.

Furthermore, you *learn* before you *teach*. If you don't know HTML, fine. You can learn it by, I dunno, /using the site/. Read the relevant HTML literature, which is sure to be present.

Chuck Rector on May 14, 2008 8:34 AM

My vote would go to HTML with a couple of minor extensions that would handle most comments with no markup at all. First, treat a blank line as an implicit paragraph boundary. Second, treat an unadorned URL as a link.

To avoid problems with parsing more complex HTML, these extensions should only be active at the root level and deactivated inside any open HTML tags.

Beyond this, any markup system you choose will require users to type something. HTML is much more widely understood than most of the alternative markup languages - especially amongst programmers.

Stephen C. Steel on May 14, 2008 8:34 AM

I think, as others have suggested, that you have stumbled upon some of the benefits of XML in thinking about markup that you seemed to overlook when discussing XML in the last entry. The existing tools and standards make very quick work of the types of things you want to do: white list certain tags, validate input, make sure it's well-formed, etc. Just write a simple XML Schema(or DTD or RELAX NG) and validate the input against it.

Mike Ivanov on May 14, 2008 8:34 AM

What about consistency? If you use something like Markdown, every title, list and emphasized text will look the same. If you allow HTML, you're going to have bold title, italic titles, different types of headers, maybe a few font tags and whatnot; all kinds of lists, and all kinds of emphasized texts.

I used to have my own markup language, but I switched to Markdown for all of my projects.

LKM on May 14, 2008 8:34 AM

Add another solid vote for Markdown. It mimics what I'd naturally do for formatting in a text-only document (except for the headings bit, but that's rarely needed in a Q&A forum in any case). Plus, if you can't think of how to do something, HTML syntax is fully allowed as well.

I don't know about you, but the main formatting I'll ever do in a page are:

*mildly* emphasized text
**strongly** emphasized text
* Bullet lists
1. Numbered lists

The only time I hit the Markdown manual after discovering it the first time was to confirm that it really was as easy and intuitive as that. And the headings bits :)

HTML is obviously the lingua franca of the web, but that doesn't make it easy to read. If I want to read content embedded in HTML, I put it into a browser. If I want to write content embedded in HTML, I write it in markdown (multimarkdown, actually, which is a minor variant on markdown) then paste the generated HTML into the web page. HTML is good for doing all the other ancilliary stuff around the content, but always gets in the way o the content itself.

Of the options presented here, Textile and Markdown are the most transparent markups, IMHO.

The only thing I'd add is that, please, as is obvious in the broken discussion about XML, make sure you just escape the HTML of unrecognized tags, not filter them out!

Tom Dibble on May 14, 2008 8:35 AM

Also, I'd strongly advise you to not do Yet Another Tweaked HTML Version. Going into a forum which speaks in HTML you have to read the manual every single time: are double-line-breaks automagically converted to <p>? Are stray tags automagically escaped? How does this particular site support quoting?

Going into a forum which speaks (ick!) BBCode the specifics are generally assured (although the quoting syntax changes inexplicably).

Going into a forum which speaks markdown or textile, and I know precisely what I'm getting.

Remember: your site will not likely be the only one people type in throughout their day. Make the experience adhere to a common standard. You users will thank you.

Tom Dibble on May 14, 2008 8:40 AM

I'm glad you're looking for alternatives to wikipedia and bbcode markup, but I'm not sure Joel Spolsky's Wikipedia page in basic HTML would be any less intimidating.

"<p>Spolsky grew up in <a href="/wiki/Albuquerque%2C_New_Mexico" title="Albuquerque, New Mexico">Albuquerque</a>, <a href="/wiki/New_Mexico" title="New Mexico">New Mexico</a>"

Using html as the editor language doesn't change the need for a 'help' page, at the very least showing allowable tags (<script> not being one of them), which means I'd need to look up what I can/cannot use.

If I wanted to make your basic html example, is that a <div> or a <span> being used for the code snippet? What's it's id or class? As a programmer, I don't think I'd necessarily be sure on the details. Does <p> need a closing </p>? How about <li> inside <ul>? Should it be <br> or <br />? Is it <b> or the css font-style: bold;?

<p>I think for a wiki, you don't want more than one way to do things. Should the heading be <h2>, or will it end up being <font size=+5> half the time? Are you sure tables are only used for tabular data and not as a layout format everywhere? Policing content seems bad enough, I'd hate to have to police meta-content as well.

<p>Finally, for each paragraph in basic html, I need to begin it with a <p>? That's terrible for writing flow and though process, but maybe thats just me.

Samson Yeung on May 14, 2008 8:42 AM

Still rebuilding Code Project, Jeff? Seriously, we have thousands of regular posters, some of whom have posted thousands of messages over the last nine years, and some who've posted tens or even hundreds of good articles. Yes, there's some dross and we're trying to filter out some of the crap before it gets posted; we're finding that traditional editors don't scale up to this level of activity, so allowing long-serving members to take a first pass on the article queue.

CP uses HTML for its articles and forum posts. Over the years the blacklist of allowed HTML has tightened as people have abused it, but generally the model has been 'trust the poster'. It may now have changed to a whitelist in the ASP.NET rewrite, I don't know.

Mike Dimmick on May 14, 2008 8:47 AM

I think you are making an assumption that everyone who goes to stackoverflow want's to comment knows HTML..

I am a DBA but read your site quite a bit and don't know a single command in HTML. I know many developers in large companies (I work for a bank) that never use HTML either so why not make it easy on all of us with a "friendly-but-irritating HTML GUI browser layout control".. but you can use html if you want to.

CHOICE IS GOOD!!

-jfc-

JFC on May 14, 2008 8:49 AM

OMG... are you building StackOverflow by commitee?

Just build it already!

j/k :)

Jonny on May 14, 2008 8:51 AM

Markdown looks easier on the eyes, but I'd need a cheat sheet handy for a while since I've never actually had to use it.

With HTML I have only one hesitation: embeded > < and &. Even programmers forget to encode these (or find them all) and if code samples are going to be a frequent occurance, then mistakes are going to happen. A lot.

You could help this by making "<code>" tags or something which will ignore *ALL* markup between them. Assume that everything surrounded by <code> is to be taken literally.

Clinton Pierce on May 14, 2008 8:52 AM

"Presentation markup" is an oxymoron. Markup is for tagging content to capture meaning, not style. If you want to give users control over presentation, then you don't want a markup language, you want a formatting language.

Modern HTML is primarily a markup language. As HTML evolved there was a push to get away from presentation and back to pure markup. Bold and italic tags persist for compatibility reasons, but in an ideal world, they'd be history. HTML is the wrong choice for specifying presentation.

Most text doesn't require any styling. Fancy formatting can enhance text, but it shouldn't be necessary to express the idea. Perhaps the best solution is not to give the user any ability to control the presentation, hoping that they'll instead focus on the content. Barring that, I'd go for some very simple formatting conventions.

Adrian on May 14, 2008 8:54 AM

Of course, my previous comment was mangled because I assumed the "no HTML" instruction meant that if I used HTML-like things in my text they'd be ignored. Text is text, right?

UI FAIL.

Embedded <, > and &amp; are going to cause all kinds of problems. Perhaps a <code> tag that causes everything inside it to be automagically encoded. Because even good programmers will make mistakes and forget to encode characters, or just not find them all in their code sample.

Clinton Pierce on May 14, 2008 8:55 AM

I support your idea of using HTML. A rather simple white-list of accepted tags (dismissing all others) and a help-page listing them all should be good enough. Instead of the typical WYSIWYG-toolbar you could provide a small list of allowed tags for quick reference on what works and what doesn't.

Additionally, you can provide predefined formatting styles via CSS classes and IDs that you allow people to use (with example on the aforementioned page on what they look like), again dismissing any other classes or IDs, as they are going to use inline-CSS anyway, if not just basic tags, hehe.

About the issue with annoying paragraph-tags, think of blogger.com; a carriage return in user input is transformed into a paragraph-tag and, when editing an entry, re-transformed into a carriage return. Now that is user-friendly. :)

Mephane on May 14, 2008 8:59 AM

Another recommendation for Markdown. As others have noted, the ability to include HTML is a huge bonus.

Also check out PHP Markdown Extra:

http://michelf.com/projects/php-markdown/extra/

some very nice additional features including Fenced Code Blocks which will be handy for stackoverflow.com.

Go on Jeff - give Markdown some prominence on stackoverflow.com - maybe it will start to gain critical mass. Your readers helped choose the name for the new site - maybe you could poll us for a choice of comment markup?

Tom A on May 14, 2008 9:01 AM

People should not have to write in HTML, but they should have the option to edit the HTML.

There are plenty of good WYSIWYG editors. Keep life simple.

Steve on May 14, 2008 9:04 AM

HTML shouldn't be used for anything like this. Lightweight markup languages let the users concentrate on the content and not on the syntax. Although this should be a programmers site, that doesn't mean one shouldn't care about usability. And by the way: I'm pretty sure, that there are some excellent C++ or Python programmers out there, who have little to no experience with HTML. Just my two cents.

Florian Potschka on May 14, 2008 9:09 AM

"Flickr allows simple HTML tags such as:
<a href="URL">link</a>
<strong>strong</strong>
<b>bold</b>
<blockquote>blockquote</blockquote>
<em>emphasis</em>
<i>italic</i>
<img src="URL">
<u>underlined</u>
<s>strike</s>
<del>deleted</del>

Ali Karbassi on May 14, 2008 07:23 AM"

That, plus a '[code]' tag should be more than sufficient for everyones needs. Who the hell uses HTML tables in comments? And if you forget anything, you can just View -> Source to remind yourself. ;-)

As for Wikipedia's markup:-- the only reason Wikipedia is so well organized is because of the constant layout editing of a few, hardcore users. Remember when the Wikipedia founder said that all those edits were for content, not layout? Meaning that around 500 users were inputting nearly all the information into Wikipedia! Hah...

P.S: doesn't having "orange" as a constant undermine the purpose of a CAPTCHA?

transciber on May 14, 2008 9:11 AM

I've got to add my voice to the "no HTML as default" crowd. I'm not a "real" developer, so maybe I'm outside your target demo, but I follow codinghorror pretty religiously, and several other developer-oriented sites as well.

HTML is just clunky. I don't see how 7 keystrokes (with repeated press/release on the shift key) to bold something vs. 1 ctrl-B keystroke is defensible. I can code it, sure, but I sure don't like to. (BBCode is scarcely better in this regard.)

Posting to forums is about speed, not precision. I can count the times I've needed to add a table to a forum post on one hand. But boldface? Italics? All the time. HTML makes me pay a hefty toll on the roads I drive every day to subsidize that bridge I only cross once in a blue moon.

Personally, I like Textile. It's based on the ersatz formatting people used for years in plain text e-mail, so it's pretty familiar. Spcifying link aliases is dead simple. And it's got the edge in speed. 2 unshifted keystrokes to bold or italicize text is a reasonable compromise in a plain text editor. And that's what I'm doing 90% of the time in a forum post. (I also like TiddlyWiki's code formatting token - three open braces on their own line to start, three close braces on their own line to end. Nice and quick.)

Jim Doria on May 14, 2008 9:12 AM

Could you please, please, *please* escape < and > instead of stripping them in the comments? Or at the very least place a reminder that you *do* strip them instead of just "no HTML"?

Adam on May 14, 2008 9:18 AM

>P.S: doesn't having "orange" as a constant undermine the purpose of a CAPTCHA?

This has been asked a million times, and answered a million times. Go look it up!

Adam on May 14, 2008 9:19 AM

Just as with licenses (http://www.codinghorror.com/blog/archives/000833.html), just pick a markup language, any markup language. They are a necessary evil.

Hoffmann on May 14, 2008 9:23 AM

In making the case for HTML as a lingua franca, you're also making the case for using XML, something you disputed in a previous post, particularly as it relates to "Conversion" and "What People Know".

Graham on May 14, 2008 9:26 AM

One of the reasons I like Markdown so much is that you can mix it with HTML, and Markdown's parser doesn't puke.

If you choose to implement Markdown syntax verbatim, you can allow people to use a combination of Markdown and HTML with nearly no additional work over just allowing HTML.

Nifty, no?

Darren on May 14, 2008 9:33 AM

CodeProject has used HTML as an option on its forums for years. In fact, it's the *only* option if you want formatted text - the choices are HTML (escape everything yourself, with newlines and emoticons converted) or raw text (output is exactly what you type).

As you suspect, this is great for those of us who are comfortable with HTML. However, there are problems:
* Not all programmers know HTML, and those that do aren't always comfortable with it.
* It's verbose. I'm typing in a bullet list here - while i'm actually anal enough about formatting to take the time to enter the proper tags, it's nowhere near as fast as just indenting and typing asterisks.
* It makes simple things difficult. Not just bullet lists, but bold, italics, special characters like angle brackets and ampersands - pasting in code samples often requires a lot of escaping.
* It puts the responsibility for proper styling on the user. The syntax for a multi-line block of code and a single keyword are different in HTML. Which one will be used? Both? Neither? Good formatting is nice, but even among users who do have a working knowledge of HTML, expecting good semantic markup is often too much to ask.
* You can't use it raw anyway. Security concerns, broken HTML (mismatched tags, pasted in from Word or just sloppy typing, etc.), CSS that isn't compatible with the site layout... You pretty much need to have a good (== error-tolerant) HTML parser server-side.
* You aren't really accepting HTML documents anyway - you're accepting snippets, with strict rules on what's actually allowed. This, combined with the need to accept malformed markup, strip either all CSS or just the more dangerous styles, pretty much kills the idea that what a user can type in is the same, predictable, no-magic-involved HTML they might be using day-to-day.
* Verbose, unreliable linking. HTML links are very powerful, but if you expect most links to be to other resources *on the same site*, then they end up adding a lot of extra typing and unnecessarily fragile.
* Many of us have been trained by the many blogs and forums that don't deal well with HTML to avoid it - so even when new users *could* use HTML, many of us won't out of fear that our replies will be mangled. This isn't so much an argument against HTML as it is an argument *for* including a visible syntax reference or WYSIWYG editor.

There is one big advantage though, and you mentioned it already: the rules for processing *well-formed* HTML are reasonably stable and will likely remain that way for the foreseeable future. The extra work required from users for marking up aspects of their text *can* pay off then by removing a lot of ambiguity and help to keep things stable over time. Whether this is worth the tradeoffs is another matter; i suspect that a good editor can do a lot to reduce the stress and frustration of markup, escaping, etc.

Shog9 on May 14, 2008 9:39 AM

I think you should use a control. I don't want to type in any formatting, just provide a simple control that applys the formatting for me.

Allow for:

Limited styling: (Bold, Italics, (Font Size and Maybe Color)
Allow for Pasting of code: (HTML PRE Tag)
Allow for hyperlinking, maybe lists

That is the basics.

Jon Raynor on May 14, 2008 9:39 AM

What happened to you being in favour of skin-ability like you argued for on the podcast? Markdown looks good to me though :)

Martin Clarke on May 14, 2008 9:43 AM

I kind of like [b]some bold text[/b] -style. Its easy and easier to remember because almost all use it. If I should try to remember something else, that would be more difficult. I should be able to select text and click B-button and the text becomes surrounded with [b][/b].

Silvercode on May 14, 2008 9:45 AM

It's worth pointing out that wikis, at least popular ones like Wikipedia, have a de facto division of labor. Some study found from the statistics that:

1) The majority of edits are very small (in terms of diff), and made by a small group of people who make a lot of small edits (call them editors).

2) The majority of content/words on the the wiki came from very large edits, made by a large group of people who contribute very rarely, often just once (call them contributors).

Essentially, wikis have the same writing process any other collaborative effort---encyclopedias, newspapers, etc. have. Contributors supply large, content rich, but less than perfect content, which is then swarmed upon by copy editors, fact checkers, decision making editors, etc. who polish it into finished form.

Since the people who will presumably do most of the formatting, cross linking, citation-adding and the like are probably a small core of people willing to put in significant time, the biggest challenge is how to make it easy for someone with useful specialist info to easily add content without much of a learning curve, even if the result is less than ideal.

Matthew L. on May 14, 2008 9:46 AM

Personally, I like Markdown but if your going for something simple, why not go all the way and just limit comments to plain text? What's the big deal with all the fancy formatting? People get by on newsgroups, email, chat and SMS just fine without special formatting.

Whatever you do support, I'm betting most people will not use the features so you're worrying about a feature that only the minority will ever use. Lack of formatting certainly hasn't been an issue with Coding Horror's comments.

Just a thought...

David Avraamides on May 14, 2008 9:49 AM

So if you go for HTML then how would you specify what language your code is?

You'll end up writing something like:

<h1>How to set a text on a label<h1>
<p>Here is how to do this:</p>
<code lang="x-csharp">
/* Here is teh codez */
myLabel.Text = "Hello World";
</code>

Yuk!

Graham Stewart on May 14, 2008 9:57 AM

A dropdown list with "Textile", "Markdown", "HTML", etc. would do. Parse to HTML for storage. Let users choose the preferred markup in their profiles. Set textile default. This simple :).

alex on May 14, 2008 10:15 AM

I'm less worried about bold and italic text than for code, I would love to see some code coloring (keywords like int in different color for example), but that's a lot of work, but it will be sweet.

Juan Zamudio on May 14, 2008 10:19 AM

Sorry for the double post.

Juan Zamudio on May 14, 2008 10:21 AM

Besides being a perfect example of the *wrong* approach to designing what should be an extremely simple web component (i.e. insisting people write markup to decorate their comments) - I mean, do you want your site to be open to beginners and those that want to learn about development? Guess not - anyway:

"Incidentally, if you haven't ever edited a Wikipedia article, you should. I consider it a rite of passage, a sort of internet merit badge for anyone who is serious about their online presence."

To me this is madness. You really want the sheep of this site to go and flock to Wikipedia *randomly* and without clear purpose, and edit some poor article? At least qualify it so people who might think your special Merit Badge is worth earning realize that they might affect others with this action.

SpongeJim on May 14, 2008 10:28 AM

One of the main themes of your website has been "usability". Hell, just a few days ago you went off on how bad XML was. Now you're proposing to use a subset of it, with the exact same difficulties? Because people "should already know it"?

Sorry, Jeff, you're off your rocker on this one. WYSIWYG exists for a reason. If I have to type "a href=blahblah" everytime I want to show a link, I just won't use your site. Whether we should know it or not is irrelevant. Whether it's easier to use is. If you really need HTML editing, include it as an extra option, but it should by no means be the only choice, or even the default.

Brandon on May 14, 2008 10:29 AM

Jeff,

Re "If the source and destination are the web, why not use the native markup language of the web?"... we invented higher level languages so that we *don't* have to write everything in the exact representation it's consumed. Compilers take our higher level code and translate it into machine code. How does "if the source and destination are CPU, why not use the native machine code of the CPU" sound to you? It sounds really obsolete to me.

Re "If you're a programmer, you damn well better know HTML", that's really myopic. How does "if you're a programmer, you damn well better know C" sound to you? You personally don't know C. I doubt that makes you less of a programmer by and of itself.

Anyway, I really hope you end up using some lightweight markup language (and thus adding more mass to it; nothing reaches "critical mass" without people supporting it beforehand) instead of HTML. You are making a community site at Stack Overflow so you have to stop thinking about *your* brain cells and start thinking about brain cells of *all* programmers that you would like to use your site. You want only web programmers? I wish you all the luck. But personally I was hoping that for less specific target group.

Ivan on May 14, 2008 10:30 AM

> Of the four markups you presented, the only one that was readable
> enough that I didn't have to refer back to the rendered version to
> see what was going on was the BBcode. (For a couple I'm still not
> sure how the first quoted section's end is delimited). But BBcode
> is practically html with square brackets, so why bother?

I vote for BBCode. The reason for the square brackets is that it lets you quote HTML/XML code snippets without any effort. On a programming site, that's a win.

rblaa on May 14, 2008 10:31 AM

How about a WYSIWYG Silverlight or Flash control? You don't want search engines crawling that page anyway.

Zack on May 14, 2008 10:38 AM

@Zack: what about people who don't have Silverlight or Flash? There are plenty of pretty good HTML editors. TinyMCE is one of them.

Cristian Ciupitu on May 14, 2008 10:50 AM

I meant HTML editors written in HTML + JavaScript (that don't need Flash, ActiveX etc.).

Cristian Ciupitu on May 14, 2008 10:52 AM

As we are currently writing our personal blog, we have to decide which way to go as well. We prefer a lightweight markup language instead of writing our articles in HTML.

Getting used to the syntax is a process of writing one or two articles and you are familiar with it if you don´t use different markup languages on different websites ;).

For user comments I second the idea of keeping it the most simple possible.

Martin Czura on May 14, 2008 10:53 AM

Even knowing html, I just find it painful to have to write all this markup. I don't understand what is so hard about Markdown? Do your users really need anything fancy?

Also, realize this about Markdown (and similar): it was not written as an easy-to-parse unambiguous markup language. It was written as a way to make writing formatted text *easy* and highly legible. You should not be stopped by some undetermined corners cases in the markup. When I write a comment, I am not writing code. These are 2 different things, and you should not apply the thinking of one to the other, I believe.

Just my 2 cents!

charles on May 14, 2008 10:55 AM

Another vote for TinyMCE/FCKEditor with "View Source" enabled, ideally remembering which mode you used last, as Blogger does.

I'd also recommend an "automatic line breaks" checkbox for source view; even Slashdot does autobreaks by default now in its Ajax comment form.

Braden on May 14, 2008 11:01 AM

Tangentially, I highly recommend HTML Purifier ( http://htmlpurifier.org ).

Braden on May 14, 2008 11:07 AM

Seems to me the conclusion you've come to is that a WYSIWYG HTML editor is what is required. No weird code and wiki functionality a button-click away. And most allow you to drop into the unadorned HTML.

Does MS-Word make you type codes? No, the focus is the content. This seems like a no-brainer to me.

And not every programmer knows HTML. Some do systems/device programming or windows forms applications, exclusively. Gasp! I know.

Robert Barth on May 14, 2008 11:16 AM

the link for 'why doesn't wiki do html' is broken, it should be http://c2.com/cgi/wiki?WhyDoesntWikiDoHtml

Wilfred on May 14, 2008 11:22 AM

I agree with your decision. I've had this fight with clients and former bosses so many times. Whether good programmers should know HTML or not is irrelevant; the fact is, a larger percentage of your readership knows HTML than knows any other markup language, I can assure you. You please the most, and the rest will have to catch up.

I would suggest adding some classes/ids for use in markup, though. Perhaps list styles etc. These can be documented briefly on the site, and anyone who knows HTML will know how to use them.

Someone mentioned the benefits of having an abbreviated link tag so that you didn't have to remember how to type an entire URI to a page; but if you plan your URIs well and use some URI mapping/rewrite magic, this shouldn't be an issue; URIs will be simple enough to remember or paste with little fuss.

Lucas Oman on May 14, 2008 11:41 AM

It's not that writing HTML for your post is hard (it's not) or that it takes a lot of time (it doesn't). You need a WYSIWYG editor for the site beacuse it forces you to focus on the content and not the presentation. Also, this allows you to more easily apply a consistent style across the website.

Jim Greco on May 14, 2008 11:58 AM

I have to throw my weight behind Markdown (as so many of the above posters have done). Of the examples you showed it is the least verbose (wikipedia's format is horrible in my opinion, extremely verbose)

Writing html in my text editor is fine, but I don't really relish the idea of inputting straight html into a web form. If I post in your community site, I'm not trying to write my own page, just trying to enter some of my thoughts.

As one of the previous posters mentioned, Markdown supports straight html, so if they are right, then Markdown seems a flexible option.

Or I can put it another way: I would feel less inclined to post on stack overflow if I had to write html to do it. Markdown on the other hand, wouldn't bother me.

Justin Standard on May 14, 2008 12:19 PM

If you did use html, could you also have a standard no frills text only mode? I hate using break and paragraph tags when an "enter" would do nicely.

brian on May 14, 2008 12:31 PM

I agree that HTML should be allowed in forms, but the problem is XSS. When you come up with a really good way to allow XHTML (attributes, too!) and prevent XSS in a bulletproof manor, please do share. I've been wanting a solution for this problem for quite some time. I even asked Haacked to explain how he does it in Subtext quite a while ago (I'm not even sure how effective it is in Subtext). While he agreed it would make an interesting blog post, he evidently does not have the time to put it together (which I completely understand).

While I am on the topic, this is the only PHP "solution" I could find: http://shiflett.org/blog/2007/mar/allowing-html-and-preventing-xss

Josh Stodola on May 14, 2008 12:43 PM

On 4: No, they don't. Which is why HTML is a reasonable choice, since whatever HTML they need to learn to make a comment is very limited and quite simple to grasp.

On 6: Who says Wiki markup is easier to learn than the subset of HTML required to post simple comments on a blog?

Anders Sandvig on May 14, 2008 12:49 PM

If you define what can be used, I'll be happy. If you just say that "HTML is allowed" and I have to guess which tags are disabled, that will be annoying.

Joseph on May 14, 2008 12:51 PM

Looks like you just dropped down to the lowest common denominator. Just because we all know HTML doesn't make it the best choice! Are you going to let us post CSS with our HTML? What attributes will be allowed - any IE specific ones? Where are you going to draw the line?

There, I've played Devil's Advocate and posted the counter-argument first. I hate HTML, but I think it's the best choice for stackoverflow.com due to the reader-base.

Rick on May 14, 2008 12:56 PM

MARKUP
Whitelist of markup is a must; I can't see a way you'd avoid having to parse and validate input anyway. You'd need to consider how to Tidy ( http://www.w3.org/People/Raggett/tidy/ ) the markup, keep the site look and feel consistent and support fixing markup when the page DOCTYPE and user agent behaviours inevitably change. This is perhaps a small risk, but something you want to be able to fix in one place rather than in every back post in the site's history.

SOURCE CODE
This is a programmer's site; making code legible is important; that includes indentation and (most likely) colorization support. You can allow users to carry this burden with tools like jEdit's Code2HTML plugin if you go with HTML markup. You take on a maintenance task if this support goes onto the server - updating the parser/encoder for every syntax change in every programming language.

[As a side note: I've noticed that automatically converting carriage return/linefeed into HTML elements can be result in interesting battles with the software when it comes to source code, depending on the approach chosen.]

Whatever markup you choose, I would create a minimal list of key must-have elements rather than supporting things just because you can. To me, that means supporting source code, links and the ability to paragraph text - pretty much anything else can be omitted to begin with and added as needs are identified.

McDowell on May 14, 2008 1:04 PM

Limited subset HTML (explicitly listed).

That is, just basic formatting and hypertext (, , , MAYBE . No need for font and color control and their inevitable massive abuse.).

BBCode is the devil.

Sigivald on May 14, 2008 1:07 PM

I agree that HTML is the markup to use. It's as simple as following the law of least astonishment, even though it is certainly not the simplest thing that could possibly work. It is ironic that the text field that I'm typing in now has "(no HTML)" at the top, which is pretty lame, or should I say, <b>lame</b>. ;-)

Simon on May 14, 2008 1:13 PM

I think you're looking at this from the wrong point of view. You're looking at the technology (various markup syntaxes in this case) whereas you should be looking at the problem you want to solve.

What's the problem?
Problem Statement: Allow [somewhat technical] users the ability to input data which includes some basic format specifications.

What kind of data?
Human-readable text.

Does this "human-readable" text support localization (i18n)?
Hopefully. In that case you need the ability to segregate text blocks based on locale (this block is "us-EN", this block is "de-DE", etc).

Is there non-localizable text?
Yes. Code. But those code blocks are themselves somewhat localizable into different computer languages (C#, C++, PHP, VB, Python, Ruby, Perl, F#, Haskell, etc). It's just a different pool of "locales" if you will.

Do you plan to support externally created content?
By and large, the web browser is a HORRIBLE data-entry application. Word, OpenOffice, whatever is much better (and faster) at producing formatted text. Will you support cut'n'paste of RTF, DOC, ODF, OOXML, etc?

There are lots more questions ;) But the bottom line, focus on the problem and then think of the solution. Don't assume the solution already exists in the form of another technology. This smacks of trying to put the square peg in the round hole.

It may be that one of the aforementioned markup languages is appropriate. It is more likely that bits of each are preferred. HTML (while a reasonable rendering language) impedes the content generation process IMHO. It may be that you need to support multiple input formats, or that you need to create your own variant.

Simon (another one) on May 14, 2008 1:33 PM

WYSIWYG with HTML, thank you.

Damian on May 14, 2008 1:53 PM

> If you're a programmer, you damn well better know HTML.

WTF is that for a stupid statement?!

I contend that a big portion (if not the majority) of the world-wide programming population doesn't know HTML because the never needed it and never will.

A crapload of code is written in C. Do you know its specs like the back of your hand? No? Oh, well, you're not a programmer then!

W on May 14, 2008 1:53 PM

I, as most programmers prefer HTML.

Also it would be really nice to have some kind of syntax highlight for code blocks, it don't need to be anything complex, just to highlight common control structures and strings.

javier on May 14, 2008 1:59 PM

Well, don't go for half measures then - if GUI handholding is worthless crap for incompetent losers, demand people use telnet from a command line and type the binary network protocol themselves. That's the only way that you'll limit contributions to true uber-geeks.

People who will never get laid even if they try to pay for it are definately the best people to ask questions about how to design software that will actually appeal to the mass market.

Bob on May 14, 2008 2:01 PM

am I crazy or the url I entered for "website" was changed?
I entered this: https://twitter.com/flupkear
and got this: http://flupkear/

javier on May 14, 2008 2:02 PM

yep, for some reason your blog is changing the Twitter url :S

javier on May 14, 2008 2:04 PM

+1 for Markdown.

As others have stated, it's extremely intuitive (I used much of the syntax in plaintext files before even learning about it), and it allows you to drop down the HTML if you want to.

James on May 14, 2008 2:12 PM

Back when I was a boy, we didn't have no fancy-dancy Wysi-whatchimacallit editing. We coded up our typesetting on things like the Compugraphic Quadex with code like:

[p10, l12, m26, t24, il18p]

That's 10 point type on 12 point leading on a 24 PICA dammit, PICA line with a first line indent of 18 points. POINTS, dad blast it! You young whippersnappers don't know how easy you have it, by golly!

ThatGuyInTheBack on May 14, 2008 2:15 PM

"I have one iron-clad design guide: this is a site for programmers, so they should be comfortable with basic markup. None of that nancy-boy GUI toolbar handholding nonsense for us, thankyouverymuch. If you can sling code, a little bit of presentation markup is child's play."

jeff
why would U think a programmer has to know markup? As somebody who does I do work in an organisation that has a range of deveopers who span COBOL, C, VB and Java. A good percentage of these would not use markup but would be regarded as valuable programmers.

So are you just confining yourslf to web programmers?

Stephen on May 14, 2008 2:25 PM

I´ve been thinking this too... possibly linking to a subset of the more modern html might be a good thing (allow strong, but not b) for instance.

Stu on May 14, 2008 2:36 PM

I also prefer HTML as formatting language. However I also like "some" preprocessing. There should be an option to turn returns into <br /> or something...

Have you checked out the YUI Rich Text Editor (http://developer.yahoo.com/yui/editor/)? You should provide that for "simple editing". In most cases it will be more than sufficient, because in most cases you'll only need plain text with some highlighting of single words.

BlaM on May 14, 2008 2:47 PM

There are a million opinions here, but I would suggest a markup that doesn't get in the way if the user is just typing a response.

I think markdown does a remarkably good job at letting you just type. In fact I often don't realize sites use it, but then find out that my lists got formatted nicely. That's great.

Questions and answers probably don't need the vast number of special options that say Textile offers. I vote for keeping it simple :)

Adam Sanderson on May 14, 2008 2:48 PM

+1 for Markdown. It looks the cleanest to me.

Tom Robinson on May 14, 2008 2:49 PM

Go with BBcode. I'm sure everyone on here already uses it in various forums.

tim on May 14, 2008 3:30 PM

*TEXTILE* looks the best IMO.

*bold* and _underline_ are pretty standard, even GTalk uses it to format text.

Looks simple enough, however, I prefer to do everything with HTML...

kevin on May 14, 2008 3:41 PM

Do yourself a favor and use wiki markup. I used to prefer HTML or BBcode (since I was more used to them). However, at work we now use an internal specialized wiki.

I can tell you that I am a complete convert. Wiki markup takes less keystrokes, less brackets, and less non-obvious syntax.

Lots of people know it these days, its very easy and powerful, and it isn't burdened by a ton of ANGLE BRACKETS.

It also allows you to "force" some continuity throughout the site visually.

I think you will be making a huge mistake if you just use HTML.

TM on May 14, 2008 3:50 PM

Let me share my exeperience in this area. I ran into this problem during creating http://dotnettipoftheday.org. The goal was to provide users with ability to enter new tips which may contain C#/VB.NET code. And of course the code should be well formatted for easier reading. I tried JavaScript WYSIWYG editors but they are far from perfect. They don't provide enough options to format code examples. Now, taking into account that all site users are .NET developers, I think that the best solution will be Silverlight WYSIWYG editor. Such editor can give you desktop-like experience and you have enough control over formatting.

kostya.ly on May 14, 2008 4:05 PM

+1 for Markup from me.

html is presentation, not content. You should be using Xml :-P

I'm only half joking. I thought this new site was supposed to be about the content. Better to have a DSL (Domain Specific Language) to handle this.

You've got to ask yourself, what do users NEED to write about? What is 'good enough' to satisfy the needs of the most. We don't care about the 1% who want to write their doctoral thesis on the site. We want people to post things quickly, easily, and be nice to read. If you look at the abhorrent mess of websites, you might soon rethink the 'all coders can do html just fine' line of reasoning. MySpace anyone? For popularity, I reckon more people are better at BBCode from boards than they are used to writing raw html, but I still don't want BBCode

Also. because of the already overused angle bracket tax, are we going to have to escape all < and > or risk that being interpreted as html also?

For coders, by coders.

1. Headings, only 2 levels necessary
2. Code blocks - necessity
- optional ability to indicate language. pretty printing is not a 'nice to have'.
3. Links. external and cross-referencing.
4. Basic markup (bold, italic, highlighted?)

We don't need colors, div's, margins, padding, javascript, alternate fonts (or do we), different size fonts, etc. You don't even need lists or bullets. Than can be done well enough manually. Simple tables might be nice.

Otherwise the postings will look like a big pile of dog crap, and in web 2.0 nobody likes crappy looking websites.

Wikipedia is a good example because they can take their DSL and convert it to anything. In my mind, what you're writing is wikipedia'ish, so look to the leaders, follow their example, and improve where they have made mistakes.

Oh, and add a 'preview' function too.

fluffy on May 14, 2008 4:15 PM

Why does it have to be just one method? Let them choose html-lite, or Textile, or whatever they prefer. That way, you don't have to create a new markup method. Just let the user select one from a (hopefully short)list.

Neil Baylis on May 14, 2008 4:38 PM

I'd prefer it if you left the choice of markup up to me for each individual submission. Sometimes I need full HTML to format something properly, often I just want to use plain text. Obviously this doesn't work for a wiki page where there are many contributors, but for individual submissions, choice of markup would allow everyone to write in a format they are familiar with.

Peter on May 14, 2008 4:44 PM

Jeff,

What kind of complex visual structure do you want to appear in your site that cannot be expressed in those simple markup languages?

Besides, the fact that I can program HTML, doesn't mean I don't prefer something simpler if it's available. So, I'd go with some of the other markup languages if you asked me.

Gustavo on May 14, 2008 5:05 PM

In the last sentence it sounds like you were going to write your own markup language, perhaps inspired by some of the above.

If you invent a new markup language, or one which uses a combination of features from other ones, you are doing it wrong.

No matter how clever you think you are, no matter how frustrated you are with existing standards, the world does not need a new markup language. I don't care about the conventions on your site. Unless you expect that I will be using your site more than any other, I want it to work with conventions I learned elsewhere.

Nor do they want to deal with the inevitable bugs your new markup language parser will have. Use an existing standard or don't use any at all. I suggest: Wikipedia, a subset of HTML, or plain old text. These are the only reasonable choices.

Perhaps I misunderstood you, because you seemed to understand this fact, but in the last sentence your "BUT I'M SO MUCH CLEVERER THAN ANYONE ELSE" brain took over.

Neil Kandalgaonkar on May 14, 2008 5:29 PM

A vote for Textile. If you just want to type, and be able to enter some bold text, lists, headings or links, it works very naturally. Novices have no problem with it either (I use it as the default formatter for a CMS backend). I prefer it (slightly) over MarkDown because I find the way you enter headings in it clunky.

Textile also makes quotes beautiful: &ldquo;Like this.&rdquo; when you enter "Like this."

Joost on May 14, 2008 6:01 PM

I argued the same thing at last year's wikimania conference. Why spend all this work on a common wiki format when we could just use a subset of HTML? For those concerned about usability, we're finally getting some decent rich text editors for HTML textareas. These will be fine for most users and they also already produce valid XHTML. Yeah, I think your completely on the right track.

Though, I do like to use Markdown for some documents. IMHO it's the best of the lightweight formats.

Aaron on May 14, 2008 6:02 PM

"This c2 wiki page titled Why Doesn't Wiki Do HTML?..." The link is broken.

Also, I'm sure Wikipedia loves it when you link your image directly to the "Edit" page for Joel's entry. Remember, folks, if you have to make a test edit, please make it to [[Chicken]], not [[Joel Spolsky]]!

Anonymous Cowherd on May 14, 2008 6:05 PM

"This c2 wiki page titled Why Doesn't Wiki Do HTML?..." The link is broken.

Also, I'm sure Wikipedia loves it when you link your image directly to the "Edit" page for Joel's entry. Remember, folks, if you have to make a test edit, please make it to [[Chicken]], not [[Joel Spolsky]]!

Anonymous Cowherd on May 14, 2008 6:06 PM

A problem with Markdown is that it interprets a single underscore as a bold tag. A hassle if you're trying to talk about programming or something that uses underscores.

Jonathan Drain's D&D Blog on May 14, 2008 6:46 PM

They all look fine for the most part except for how they handle internal and external links. There the Wiki format wins out in terms of being intuitive and easy. It's frankly the most important bit, and I think even HTML screwed that one up -- a href="" is not intuitive, it's possibly the most non-humane way I've ever seen for how to do links.

"I know, let's make the tag to link to external sites the same one as we use to make internal anchor points. And while we're at it, let's use a totally opaque acronym to designate the link element. Because LINK would have been too straightforward."

Shmork on May 14, 2008 7:04 PM

I generally avoid using markup in posts to any website (other than a wiki) simply because I have no idea what they're using for markup unless I'm familiar with the particular bb software they're using or I hit one of those idiot buttons on the text entry box to see what pops up. I think it's generally a mistake not to include at least something to give people a reminder of what they can use on the site, if you want them to use it at all.

Vizeroth on May 14, 2008 8:14 PM

What's wrong with plain old text? It's simple, it's easy, and there are established conventions (hello Usenet!) for *bold* and /italics/ and _underline_ (oh, and RAISED VOICE as well, mustn't forget that). Don't even need to make it pretty -- just about anyone with half a brain can parse such "markup" directly.

As for programmers being able to use HTML, well, yes, but that's a long way from _liking_ it. Besides, a decent programmer ought to be able to pick up a minimal markup language. A programmer that is put off by having to pick up something new isn't really much of a programmer, so the whole "programmers won't have a problem" argument is bogus.

If you really *do* want a good presentation language, use TeX. It's established, widely known, respected, and does a better job than HTML ever will....

SJS on May 14, 2008 8:45 PM

In the course of time, you will have to embrace the idea of having "friendly-but-irritating HTML GUI browser layout controls" because you will find out that many of your users are complaining.
Simplicity is the keyword I think. Just use some good HTML editor such as TinyMCE.
Let me tell you Jeff, most of the users(even if you think they should know HTML), will screw up the mark up.

Niyaz PK on May 14, 2008 9:44 PM

interesting situation Jeff --

i think if you choose html you'll be trapped with that decision, because it's hard to convert from html to anything else.

if you start with markdown or textile, then you can store them as markdown or textile for now, then if you later change your mind, you can convert them to html.

i went with markdown on a skunkowrks text-adventure game i'm writing.

the amount of time taken from choosing markdown, to finding available c# implementation, to having it working in my project was literally thirty minutes.

best of luck
lb

secretGeek.net on May 14, 2008 10:06 PM

REALLY agree.Those WYSIWTF editors screw up the markup even more than users. I don't even word processors.

Example: Type "hello world". Make "hello" bold. Delete world (and the space). Type some more "o"s after "hello" (making helloooo). Those extra o's may or may not be bold. I can never tell beforehand.

In a markup language, I always know if I have the cursor before or after the end tag.

Nicolas on May 14, 2008 10:15 PM

One day you rant about the "angle bracket tax" in XML and the next day you want a markup language based on... angle brackets? Have you become a fully fledged schizophrenic lately?

Chris Nahr on May 14, 2008 10:58 PM

HTML is a data format that works especifically with data, it doesn't care about the format or the actual rendering or handling of this data, it only describes it in a way the browser decides to handle. At least originally.
You'd strip away most features that you don't need, you wouldn't have CSS in it, because HTML is about data (words) and saying what to do with it, CSS cares about how you do it, but that's what the website would do. I can also think of a few tags that are downright annoying (if you remember the horror of 90s websites you know the pain of <blink> and <marquee>).
And as well you might remove tags that don't make sense in the context, say DIV doesn't make much sense in a forum comment, or remove attributes that aren't necesary.
You could also add more tags and attributes especific to your job.
To give an example, say wikipedia cross-referencing
<a cref="Red Power Ranger">Jason Lee Scott</a>
It might have it's problems, but most of those would be abuses of the system, and those are fixed by the community, as wikipedia has shown. The important thing is that when I read that I realize I'm making a link, and that it's not a normal link, but a cross-reference link.

Now for those who complain about the readability of a:
A bit of trivia that might be wrong (can't find a source that says it outright). <a> originally stands for "anchor". You'd put an anchor on the text between <a> tags, and could "link" that anchor to other anchors or HTMLs through a "hyper-reference" (href). It doesn't make as much sense now because links have become much more powerful. Also it's not that human readable, but we are talking about the days where 28kbps was blazing fast, <anchor> is 5 bytes heavier than <a>, and I'm not talking about hyper-reference or anything of the sort. So really it's one of those things that happens when you grab a very specific language and use it for something it wasn't meant to handle (applications and dynamic web-sites instead of powerful hyperlinked documents).

Charlie Lobo on May 14, 2008 11:11 PM

The argument could be made that if you are a programmer you should not know html. That way people like me who have done half-serious, but not critical web applications would have to do things properly and use templating engines.

Personally I would keep the paragraph tags, or tell the textarea to put them in automatically somehow. Blogger's composer widget annoys me because you can put in html, but it adds extra line breaks in the preview corresponding to the line breaks in the source, even though html is supposed to ignore whitespace other than a single space character.

John Ferguson on May 14, 2008 11:35 PM

I like your sanitized HTML plan. And I love reading 4000 comments before posting. @frederik and @pierre are crazee!!!

Jon Galloway on May 15, 2008 12:24 AM

Thank you for the great comments and links! I have a browser instance full of links I'll be exploring now. Yes, I do read all the comments, as I said in Podcast #1. :)

Also, I apologize for the way this old Movable Type system strips out content between HTML tags. I should upgrade one of these days.

> live comment preview system

Absolutely. I think this is essential. And incidentally, an HTML-subset makes this pretty easy to achieve through JavaScript..

> Your argument in favor of ubiquity and convention was exactly my point against your argument yesterday in your anti-XML post.

http://www.dehora.net/journal/2008/05/15/blubml/#comments

I realize there is some cognitive dissonance here.

I fully set out *intending* to pick one of the lightweight markup languages (Textile, Markdown, BBCode, Wikipedia, etc), but after struggling to understand their rules and peculiarities, I couldn't get past the ease and ubiquity of HTML. I kept coming back to it. I can't say that about XML after working with the alternatives. XML is everything and nothing; HTML is a very clearly defined set of tags that do very specific things.

The key words, though, are "subset of .. HTML" -- along with inferring the paragraph tag. I find that <b>, <i>, <code>, <pre>, <blockquote>, <li>, etc are simple to use and don't obscure the underlying content.

Jeff Atwood on May 15, 2008 12:43 AM

HTML is a fine markup language but It does have several flows that I think make it unfit for stack overflow:

1. If I want to type in a question or answer in English than I don’t want to mess about with the p tag, the br tag or nbsp, I just want to write it in plain text and have it saved with all whitespace intact (for example, I didn’t use angle brackets or the ampersand character in this paragraph because I don’t know how your blog software handles it).

2. Since it’s a programming forum there is a good chance the answer will contain some XML, typing ampersand, g, t, semicolon to start a tag is both tedious and will prevent me from proofreading my text.

3. And then there’s code, just try to type a medium length code block in HTML, you have to think about whitespace, you have to take care to replace some operators with HTML entities and if you copy-paste from an IDE you will either get miss-formatted plain text or HTML with more syntax highlighting markup then a forum should support.

You should make the forum easy to use, not but roadblocks in front of people trying to post questions and answers, formatting code in HTML (and, let’s face it, formatting in HTML in general) is too much work and not a good use of the posters time.

And also, not every programmer today works on web applications and a lot of programmers don’t know HTML well enough to format code.

I would go with plain-text and taking care to preserve white-space, maybe with automatic turning Urls into links like Joel’s forums, everything else will just get in the way.

NirD on May 15, 2008 1:25 AM

Absolutely! - limit HTML tags to the most basic units that you want to allow, and let people get on with things. I mean, is the bold <b></b> tag that much more difficult to remember than psuedo html [b][/b].

It's about time someone did a sanity check on pseudo-html in forums/blog software etc.

Goatslayer on May 15, 2008 1:29 AM

I thought about this problem a while back and I reached the conclusion that regular HTML + Tidy + stripping end of lines can keep you away from most of the problems in security category.

HN on May 15, 2008 1:54 AM

If you can type plain text and it is not reformatted reinterpreted or mangled then fine

If you can type code without the same and without having to escape characters or specially mark it then fine

HTML is a bad choice since it mangles plain text so you have to think about the text you are typing and not just the content!

Either use plain text only, or a minimal markup language (e.g. bbCode, Wiki) with known restrictions and a very simple syntax, or go for a full blown formatting language (e.g. TeX)

The comment above about Python "it's generally impossible to try to guess what Python code with the indentation stripped out is supposed to be." sums up my dislike of Python ....

Jaster on May 15, 2008 2:20 AM

@Jeff:
> "I find that <b>, <i>, <code>, <pre>, <blockquote>, <li>, etc are simple to use and don't obscure the underlying content."

Mmmmm.... hardly obscured at all...

<pre lang="x-csharp">
foreach (KeyValuePair&amp;lt;int,string&amp;gt; kvp in messages)
{
if ( kvp.Key &amp;gt; 0 &amp;amp;&amp;amp; kvp.Key &amp;lt; 10 )
Debug.WriteLine(kvp.Key + &amp;quot;-&amp;amp;gt;&amp;quot; + kvp.Value);
}
</pre>

Assuming you want a minimal barrier to pasting in source code (absolutely essential in my opinion) then you'd have to automatically handle that HTML-encoding for us.

That would be better, but then how do you tell the difference between <b> when someone is trying to bold a line of code and <b> literally appearing as a generic-type or in some HTML/XML source code?

No, any way that you do HTML input it is going to involve character-escaping.
WYSIWYG is the only sensible way to go in my opinion.

Also I notice that you didn't include <img> on that list. So no way to illustrate articles with useful images then? (I'm thinking class/architecture/sequence diagrams, UI images, etc).
Is stackoverflow purely intended for keybashers or is it for engineers as well?

Graham Stewart on May 15, 2008 2:23 AM

Also consider that <b> and <i> are not the "right way" to bold and italic text in HTML.
So stackoverflow will be effectively endorsing bad practise before anyone even writes an article.

Graham Stewart on May 15, 2008 2:28 AM

I agree with your article. I'm too lazy to learn all the other lightweight markup languages.

Eng Lee on May 15, 2008 2:42 AM

Don't forget to make the textarea vi'esque, also. Please.

mike on May 15, 2008 2:50 AM

No way around it: provide an UI to format text close the one ms-word provides. Strip down the formating options to the ones absolutely needed for the site, and use some lightweight JavaScript text editor.

When producing quality content it's a bad idea to make ppl switch "languages" used to write down their thoughts.

Pointernil on May 15, 2008 2:56 AM

As long as GUI is provided in the editor, I will not care what is being used under the hood :).

So your design decision should focus on bettering the GUI toolbar, so that people don't have to resolve to HTML to write anything. The underlying details are futile. Even a wiki like syntax will do, but do not expect people to learn new languages just for writing in their views.

deepank on May 15, 2008 4:02 AM

Jeff, could you please clarify if you will use some sort of graphical interface?
I have nothing against HTML, but using it as the only way to format a post is not a good idea. First it's quite ridiculous to asume every programmer knows HTML. I know several with no HTML-knowledge at all, and I only know HTML because I worked two years as a PHP programmer.
Second, even if you know HTML, it's not sure that you know the needed tag. For example, I've never used <i> or <pre> before.
And how do you define which language a pasted code-snippet is? I think that's quite important for correct highlighting.

So please add a way to layout your post without knowledge of your html-dialect!

prengel on May 15, 2008 4:37 AM

My two cents: DON'T USE TEXTILE

We (unfortunately) use Textile on one of our web applications (Sassins.com) and it causes some big problems. Namely, when someone writes text like "I -generally- use dashes", they really didn't want to strike out the words between the dashes. That's the problem with Textile.

It's happened a bunch of times on our website.

Also, thanks for the blog, Jeff! I enjoy it.

David Grayson on May 15, 2008 5:21 AM

If you are going to make us use HTML at the very least give us an HTML GUI to help with tables and such.


Also, you missed one thing about Wiki Markup (or other simplified markup) versus HTML: Because your choices are more limited it makes pages look consistent.

If you let everyone willy-nilly toss in whatever HTML they like, your site WILL be ugly because all of the pages/articles will look different.

Rickasaurus on May 15, 2008 6:23 AM

About six months ago I wrote some thoughts of my own on the subject (http://jamesmckay.net/2007/11/is-it-time-to-kill-off-wikitext/), and I came to the conclusion that it was probably time for wikitext and markup languages to make way for rich text editors.

I still think that rich text editors could be more widely used on wikis and blogs and the like, but I must admit that there are some things that rich editors simply can't handle. Source code is the primary culprit. Finding a decent way of inputting source code on my own blog has been the source of a few headaches and in the end I resorted to writing my own WordPress plugin to handle it, and turning off the built in rich text editor entirely.

I somehow wonder if the two approaches could be combined with a bit of JavaScript jiggery-pokery. I'm quite impressed with the way it's done on the ASP.NET forums: you normally use a rich text editor, but you can click on a toolbar button to enter some plain text source code, and it somehow gets preserved when you post your response. It isn't perfect of course -- my main gripe is that there is no obvious way of telling where the source code blocks start and end -- but it's certainly an idea that's worth looking into.

Out of the four markup languages that you mention, personally I think wikitext has the cleanest syntax. I'd avoid BBCode in particular -- it is *very* heavily used by spammers -- you can block about 80% of spam messages simply by checking for URLs in BBCode format.

I'd also disagree with your assessment of considering HTML to be more secure than the others. Yes you can have a whitelist, but as with any of the other formats, you need to be very careful with your parser to avoid canonicalisation vulnerabilities from jiggery-pokery with encodings and the like.

James McKay on May 15, 2008 6:32 AM

why not use wysiwyg editor like FCKEditor? you can write html source in it or use word like editor (which hides tags completly and is good for people who dont know or dont care about html).

mart on May 15, 2008 7:33 AM

Why don't the rich text editors just support the html <CODE /> or <PRE /> tags?

Sound to me like that would be a better way to go.

Chris Lively on May 15, 2008 7:35 AM

I don't really care what choice you make as long as:

You have a live preview of the comment.

You have a clear list of the approved codes with examples.

Code doesn't wrap.

I can create the comment in my text editor and then paste it in the comment area.

CuriousRustColoredApe on May 15, 2008 7:52 AM

if you'd pick HTML I think you'll also need to study some more about XSS too, particularly if you have features that allow user to input HTML. I think they often make easy target for an XSS attack.

I've seen numerous pro-coders left an XSS hole open. That's why XSS and phishing is so darn popular. Fire up any security audit tools and every one of 'em would ofter XSS scans for a trial.

just wants you to be uber-careful with accepting HTML, that is...

chakrit on May 15, 2008 10:56 AM

I hate Markdown. The link syntax is the worst I've ever seen.

I still prefer BBcode for all of my formatting. It's too bad that not many things other than forums support it.

atomicthumbs on May 15, 2008 11:14 AM

One huge benefit of (most) lightweight markup languages is that they meet the behavior most people expect (or assume...) much better, e.g. they don't strip out whitespace or ignore (i.e. not link) raw URLs or interpret angled brackets etc. On the other hand some markup languages get too smart and start doing unexpected things...

So a suitable markup language for commenting on a public site (IMHO) handles normal, email-style formatting gracefully, provides a convenient shorthand for common, non-basic formatting, and allow (a subset of) HTML to be embedded (if you really need to be flexible).

Eric Jain on May 15, 2008 11:18 AM

LiveJournal took that approach, actually. Their in-page editor has two modes: "just HTML" and "convenient HTML". Both modes limit the HTML tags allowed to a list of about 20 or so. The "just HTML" mode does nothing to the HTML typed except strip out disallowed tags. The "convenient HTML" mode only does a couple of "so common you don't want to have to keep typing them over and over again" things, like replacing \n characters with br/ tags. They added a couple of their own convenience tags for commonly-desired functionality (lj-cut, to collapse long / annoying / spoiler sections of text; lj user="" to reference another LJ user; etc). Pretty simple, but works absolutely _great_.

Thought on May 15, 2008 12:57 PM

I agree!!
Just use HTML!!!

If you cannot code a bit of HTML then you're in the wrong business...

Jonathan on May 15, 2008 1:04 PM

Wiki syntax. Ugh. I hate wiki syntax with a passion.

I've still yet to meet a person who has an easier time understanding wikisyntax vs HTML.

http://internetducttape.com/2007/09/12/wiki-mistakes-building-wikis-that-dont-suck/

engtech on May 15, 2008 1:40 PM

Also, you might want to look at the kind of software ecosystem that has risen around pasties (eg: http://pastie.caboo.se ). There should be a REST API for posting up code so that people can write plugins for their text editors, shell scripts for the console, etc.

engtech on May 15, 2008 1:53 PM

I've always been a fan of Textile for writing blog entries, since it's so similar to the posting style I have for email and the good old Fidonet days.

Having to pause to write long HTML tags seriously disturbs my flow of thought when I'm writing a blog entry. Textile works much better and doesn't get in the way of my thinking.

Johan Svensson on May 15, 2008 5:04 PM

Can I suggest, again, that the problem here is in the links? On everything else it really doesn't matter. Pick the one with the easiest links. BBCode is pretty straightforward, in my opinion, when it comes to links. Textile requires me to re-look up exactly what order things go in because it is exactly opposite of what I expect it to be. Markdown is similarly confused. BBCode is basically HTML but lighter and without all of the tag/property/class nonsense that you're not going to want to allow in your comments anyway. Wiki code wouldn't be bad, but unless you're talking about just internal links, I don't see it as being as intuitive as BBCode.

BBCode!! do it! you are convinced!!

Shmork on May 15, 2008 5:47 PM

I have had the exact same thoughts for years now!
At the Uni I work at, almost all people who get degrees there need to do Foundation Computing, which includes HTML. You've therefore got all coders, almost anyone with a degree, and anyone else that has had to learn it before (i personally had to learn HTML when I was 15 so i could edit a myspace-like page I had back in 2000). Whereas for any wiki language, you only have users of that particular site who will know it.

XTremeEd on May 15, 2008 8:02 PM

Pick something that works and go with it. I'm guessing that sooner rather than later, you're going to have more important things to spend your time on as you ramp up your company.

Bruce on May 15, 2008 8:48 PM

As there are already over a hundred comments before mine, this is pointless, but I'd definitely cast a vote in favor of a subset of HTML, and ideally a good RTE that can optionally be used. (The YUI rich text editor is a great choice. Those guys are maniacal about cross browser support.)

If you do support HTML, take a look at the way that Wordpress formats blog entries and comments. Line-breaks and paragraphs are inserted intelligently, "bad" things are stripped out, and everything else Just Works.

Isaac Z. Schlueter on May 15, 2008 9:26 PM

Again I repeat:
It's not about using HTML v2.0 as the w3 defines it. It's about using the HTML conventions of coding. Think about how many languages follow certain conventions, such as calling functions name(arguments) and arrays name[index]. And since the underlying code is HTML you want to show the best possible way how the transformation occurs.

Now some problems I though about a bit: say that we have a <code> tag. No say that there is an article that explains how to use the <code> tag, how would you write the next
"
<code>
... (your code)
</code>
"
I mean as literally that, it would code something like
<code>
<code>
... (your code)
</code>
</code>

How would it know where the code ends?
But using a purely WYSIWYG editor doesn't solve it. A person could still "inject" malicious code, and make things like the above happen. Also the biggest problem is that "What You See is What You Get" != "What You See Is What You Want" which maybe should be more important. As programmers we'd feel a GUI as a limitation, like suddenly having to wear diapers to your high school graduation, it'd be not cool (unless that's your kinda stuff).
Using both doesn't solve anything either, it makes it worse, WYSIWYG editors interpret code one way, code is badly written if not by hand, etc. etc. etc.
It'd be easier to just have coding and a preview (maybe as you go) of the code. A little guide on the side, I mean the target audience is programmers, it's ok to expect them to express their ideas better as a code than as a bunch of click and selections on their code.

Charlie Lobo on May 15, 2008 9:56 PM

One thing not touched upon is what format you actually store this user inputted information. Important, if you change your choice of markup language at later date due to user feed back or security concerns.

ian_scho on May 16, 2008 1:29 AM

@Shmork:
> "BBCode is basically HTML but lighter and without all of the tag/property/class nonsense"

So if you don't have attributes then how do you specify what language your source code is in BBCode?
Surely you'd need something like [code language="csharp"]?
Knowing the language will be essential if we want syntax-highlighting.
Looking at the four humane samples from Jeff, only the Wikipedia-format apparently supported that.

@Charlie Lobo:
> "As programmers we'd feel a GUI as a limitation"

I'm not sure how only having bold, italic, code and pre HTML tags is any less limiting than being forced to use a GUI. And as many others point out, limitation is a good thing the articles will have a more consistent layout.

Graham Stewart on May 16, 2008 1:50 AM

What's wrong with a friendly toolbar? Just because we're programmers doesn't mean we actually enjoy having to try to figure out where the help link is so we can learn what weird variant of 'make this word bold' markup you've picked.

Why are they irritating? Highlight text, click 'bold', job done.

Making me think to achieve something so trivial is irritating.

@Charlie Lobo
> As programmers we'd feel a GUI as a limitation, like suddenly having
> to wear diapers to your high school graduation, it'd be not cool
> (unless that's your kinda stuff).

What are we, persistent masochists? This isn't a programming text editor religious war - it's a way of putting a comment on a website.

izb on May 16, 2008 4:09 AM

Whatever you do, make it easy for users.

HTML is too much typing.

Stuff people have to look up is no good either.

On the other hand, the only thing people really *need* is links. Everything else they can do without if they can't figure out how to do it. If I *really* want some bold text, I'll figure out how to do it or maybe put it in CAPS or whatever. If it isn't worth finding out how to do it then I don't need it.

So if you want to make it easy for users enough to do some extra work, then:

1. Allow limited HTML, not enough to be easy to break things but enough to provide the formatting you want.

2. Allow something like Markup, notably *bold* _italics_

* bullets

And a few others. Many of these are things people were using for emphasis before.

3. Put a toolbar on your online editor that does stuff.

The more ways that work, the larger will be the minority of your users that are satisfied. Provided the different approaches don't get in each other's way. Nobody will use HTML by accident and nobody will use a toolbar by accident, so those are two good complementary approaches. I doubt Markup would be much trouble either.

Speaking for myself, Markup is best for every Markup command I remember, because it's a keystroke or two in the middle of my typing. No stopping for a mouse, and less bother than HTML. But of course people who already know HTML and don't know any Markup will find it harder. No doubt egyptian scribes who were used to hieroglyphics believed that was easier than writing in demotic.

J Thomas on May 16, 2008 4:30 AM

I guess I should mention, I learned how to use Markdown in a few minutes without ever seeing its name or discussing it with anybody. Far far easier than HTML and easier to use too, though of course limited.

I'm not sure I approve of this idea that nobody should ever learn to do anything effective that's new. If we'd had this attitude and this technology when the automobile was invented, we'd probably all be driving cars now with reins. You turn by pulling one or the other, you brake by pulling both, you start up by shaking them, and the car would have voice recognition circuits to respond to Gee Haw Giddyup etc. You'd switch to high gear by kicking a spot below the saddle.

J Thomas on May 16, 2008 4:57 AM

I agree the word is definately subset of html you don't want every post you see to be different font's and sizes and ... conform generally like wikipedia does to a standard then you can make design changes if needs be using your style sheets. Users generally like page formatting to remain the same.

Make sure there is a preview for the HTML my favourite types are the ones like on http://www.w3schools.com/ code examples where it keeps it on the same page so you can view your HTML and output at the same time.

pete on May 16, 2008 6:12 AM

While I really like the simplicity of latex, my personal preference is for a gui editor that only allows the basic: link, strong, italics, underline, ordered/unordered lists and -for sites targeting developers, a "code" style.

jake on May 16, 2008 6:58 AM

@J Thomas:
>"Nobody will use HTML by accident..."

On a programming website it is entirely possible that someone might want to actually post some HTML or XML. Or even run into issues with code in other languages: e.g. "if (a<b) "

>"On the other hand, the only thing people really *need* is links."

I would have thought that on a programming website the only thing people need is the ability to post their example source code without it getting mangled.

Graham Stewart on May 16, 2008 7:06 AM

+1 for HTML with no wysiwyg editor

Yes, there are good programmers that don't know HTML because they haven't needed to learn it. However, the programmers that can't be bothered to learn the maybe half-dozen or so tags that would be whitelisted on stackoverflow probably aren't going to get that much out of the site anyway.

Alex on May 16, 2008 7:49 AM

I am shocked. Comments usually flow minutes after each blog. This blog was posted 3 days ago and not one single comment?

Scot McPherson on May 16, 2008 9:56 AM

Oh geez, god love the cache.

Scot McPherson on May 16, 2008 9:57 AM

Some days I think the cache is actively working against me to frustrate me, so you are not alone, Scot!

Adam on May 16, 2008 10:17 AM

Whoa! Too many comments. Can't wade through them all.

I only have one concern if HTML markup is allowed (or any in-line markup is allowed).

a. Don't do smileys.

b. Provide some filtering that doesn't mistake use of &, <, and > for tags and entity references unless they pass more scrutiny than that. There are programmers who do not know HTML and when you are not thinking in HTML it is easy to use those characters without appreciating what is happening.

c. You do need to do something about explicit new-lines, but don't fall into the trap of having newlines break recognition of an element tag (e.g., between attributes or elsewhere) or an attribute-value string. This is difficult to get right.

d. <pre> should work.

I have mixed feelings about all of this myself. But if you go with allowing HTML (or another scheme with HTML mixed in) I think this guidance matters.

orcmid on May 16, 2008 11:19 AM

OK, I guess I have 4 concerns. Another one is that you have a filter on these comments that deletes explicit less-than and greater-than marks; so some of the previous comment has missing text and makes no sense. So when you say no HTML it would have been smarter to escape anything that looks like HTML (escape, less-than, and greater-than are the key characters) before splicing the comment into the page. It looks like you are filtering out the characters instead, which behavies badly on false positives, aye?

orcmid on May 16, 2008 11:26 AM

"On a programming website it is entirely possible
that someone might want to actually post some HTML
or XML. Or even run into issues with code in other
languages: e.g. "if (a<b) "

True. So you have to be able to turn HTML off. Preferably, you should be able to turn it on if you want it.

>"On the other hand, the only thing people really *need* is links."

"I would have thought that on a programming website
the only thing people need is the ability to post
their example source code without it getting mangled."

I'm getting pedantic here, but given a choice between posting code with no links allowed, versus posting links to code plus whatever other links I want, I'd take the second.

But there's no reason to face that choice so it's silly of me to argue which is more important.

J Thomas on May 16, 2008 4:26 PM

Go WYSIWYG so I don't need to do the preview step. There are good options for WYSIWYG editors these days. Enable just the heading level, bullets and other structural elements so that pages remain clean and sans awful bright red giant text.

wioota on May 16, 2008 4:27 PM

So anyway, HTML should be fine provided you can turn it off. Or better yet, have it off by default and you can turn it on. So people won't write OVER < IF EXIT THEN and find out they were writing HTML and didn't know it.

J Thomas on May 16, 2008 5:10 PM

If you allow HTML, even with a carefully constructed whitelist, I guarrantee someone will figure out how to do XSS.

HTML is just too complicated to secure.

Sean on May 16, 2008 5:13 PM

http://namb.la/popular/tech.html

Here, you can't secure html, no matter how smart you think you are. Better to escape it all.

Sean on May 16, 2008 11:47 PM

I vote for no styling. (BTW, the stackoverflow.com home page has 2 paragraphs, 3 <p> tags, no closing paragraph tags.)

Ross on May 17, 2008 4:33 AM

I have one iron-clad design guide: this is a site for programmers, so they should be comfortable with basic markup.

There you have it. It's a site for programmers, not average users. So understanding HTML is a valid requirement.

As for whitelisting certain HTML tags and forbidding others, I think that's a good idea. Of course TABLES will be a pain, but you can't have it all.

If the site uses XHTML, it should be possible to use XSL to transform the content into other formats with comparatively little effort. Just be sure the system outputs valid code. (For some reason I keep thinking I saw someone say the lightweight markup languages had the advantage here because they could be exported to LaTeX, etc. but now I can't find the comment.)

Is there any reason you can't provide a live preview of what the final page will look like? Maybe use JavaScript to update the content of an IFRAME or something.

matt on May 17, 2008 6:17 AM

@Ross: Actually you don't need </p> closing tags for valid HTML 4.01 Transitional.
But you do make a good point: even on that very simple page there are validation errors (e.g. the alt attribute is specified twice, some tags are improperly closed with />).

Apparently even those that can "sling code" and are "comfortable with basic markup" can make mistakes.

So if you do decide to enter HTML directly then not only will you have to filter it for XSS and blacklisted tags and provide character escaping, but you'd also have to run it through an HTML validator.

WYSIWYG FTW!

Graham Stewart on May 17, 2008 6:23 AM

I'm with the many people who want at least the option to use a lightweight markup language. Using HTML seems like elitism and a "because we can" mentality. I don't want to have to mess with ampersand codes and manual lines breaks, I just want to write my comment. If I need special HTML code, you can give me the option to include it inline like in Markdown, but don't make everyone use it for all comments all the time.

Rory on May 17, 2008 2:37 PM

wow.. stackoverflow.com will be some kind of oldschool site i possible never ever use. There are some causes why nearly noone uses html for communities and the web evolves to wysiwyg solutions.

First: HTML with whitelists would be something you need to learn for one page only. The next page which tries something similiar could have other whitelisted tags. Why should I think about how to write something on a site? For me it's about the content. and thinking about tags, codcomments etc. it's just thinking about something which hasn't much in common with the pure meaning of the comments.

Second: Why the hell should programmers know html? I know a lot of programmers which have nothing to do with the web. The only thing where html is needed for them is e.g. something like a chm file, and even for that they are using a wysiwyg editor with applied styles of the company they are working for.

Third: Why do you want the user to take care about conversations? You want someone to use your site. So take care of things in the back. As customer i would fire a programmer which tolds me that i need to do somethingcomplicated just because of data conversation.

Fourth: If you want to go oldschool and forget everything about user friendly interfaces, why don't you just use a newsgroup?

Wolfgang on May 18, 2008 7:19 AM

@Graham Steward
> So if you don't have attributes then how do you specify what language your source code is in BBCode?
> Surely you'd need something like [code language="csharp"]?

[code="csharp"]

BBCode allows tags to have a single attribute like that. Tags don't seem to have specific attributes, like you posted, but some tags can have an attribute specified on the tag name (like BBCode's url tag) for a specific effect or just have a general effect without having anything defined.

In the case of a custom [code] tag, having the attribute undefined could just be like a <pre> tag without any highlighting.

Harley on May 18, 2008 7:25 AM

Are people saying that the problem with HTML is that no matter how you try you can't keep people from writing malicious HTML code that will annoy or damage other users?

If so, then you could let users turn off HTML or take their chances. And if they have HTML turned off but some other markup language turned on, then it would be good to translate a limited set of HTML to the other markup language. The malicious code that doesn't translate could be ignored or left intact for readers to puzzle over.

It makes some sense to allow a subset of HTML for people who want it, since they want it so much. But I'm starting to wonder -- people who won't spend 30 seconds to find out how to do things but insist that the layout be arranged in a complex inefficient pattern to suit what they're already used to -- would people like that actually be an asset?

Won't they tend to be the sort who do everything they can to delay progress? "We don't need any new programming languages because all real programmers already know C." Etc.

J Thomas on May 18, 2008 8:39 AM

@J Thomas
Um, the goal of malicious HTML is that it would be posted to affect anyone viewing it, not to affect the person writing it. And the goal for the forum system should be to prevent it from being posted so viewers don't get affected by any malicious HTML, whether or not they are registered. Registered users shouldn't have to protect themselves on someone else's forum, and what about unregistered users?

That's why many things that take text input from viewers, like forums and blogs, don't allow HTML. Like this blog.

Allowing HTML in a forum system means you have to parse every post with a whitelist to break any unallowed HTML and attributes, which may or may not be malicious, but since it's also a programming forum, you also have to have some way to detect HTML to post "as-is", so other users can see the HTML code and not have it actually parse.

Other systems, like BBCode, attempt to bypass that issue by just not allowing HTML at all, so users can only post with specific tags that can't be made into malicious HTML.



Personally, I wouldn't see what would hurt to have a pseudo-WYSIWYG system with the quick formatting bar that, when you click a button, visibly adds the tag to the text of your post (or puts the tags around whatever you have highlighted), which is something some vBulletin forums allow you to do (they call it the Standard Editor, which is basic text with the formatting controls).

Harley on May 18, 2008 10:02 AM

Harley, I think I got what you're saying but let me spell it out at great length to make sure.

If I put malicious HTML into a message then it potentially affects everybody who sees it. So the server side has to be carefully written so it allows proper HTML but erases all malicious HTML. Anything that gets past the special server code will affect all users.

You're talking like there's no way for the server to cooperate with users who don't want that to happen and who don't fully trust the server to stop it, who would rather simply have HTML disabled in messages sent to them.

Isn't this starting to sound like a traditional problem? We have a powerful system, and then we want to let random users use some of the power, but we can't trust all of them to use it wisely. And the least trustworthy of them are good at wiggling past the safeguards we set up to stop them. Lots of complexity for a zero-sum game, where we try to provide good power but not bad power to everybody and the bad guys try to find ways to abuse it....

J Thomas on May 18, 2008 7:17 PM

The problem with having an option for users to "disable viewing HTML in posts" is that it forces the server to do additional parsing when it is displaying posts instead of just plugging in the post content. And since viewing happens more often than posting, it's less work overall for the server to parse things out of a post when it's being posted instead of when it's being viewed.

It also would likely dissuade users because it'd be a form of admitting that malicious code could be posted by another user and "you're responsible for your own safety on this forum". Which is the kind of thing that gets a site "This site may harm your computer" in Google searches.

Really, if formatting with HTML is desired, a whitelist is the best bet to prevent malicious HTML, if it's set up correctly, but it may leave very little in options for formatting, probably not much more than something like BBCode offers in the first place.

<hypothetical>Also, would you have to do your own line breaks or paragraph tags as HTML, or is that done for you (which isn't like writing HTML at all)?</hypothetical>

Also, it'd probably has to also strip out things like inline JavaScript (like, oh, <a href="#" onclick="for(;;) { alert("Hello World"); }">Click me</a>, which is pretty tame as far as malicious code would go). In which case, it'd be even more like, say, BBCode, just with < and > instead of [ and ], so then, why not just use something that's established and just modify it slightly (and leave in the GUI editor style, for users that prefer it like that)?

Harley on May 18, 2008 9:29 PM

I haven't read the comments above (just too much), but I thing you should have a WYSIWYG editor, independent the choice of markup languages. OK, the contenteditable-implementation results in HTML. But with some clever programming you can convert this to Textile/Markdown/etc.

This way you can offer users the choice between WYSIWYG (even most programmers prefer this) and hand coding (my favorite).

Doekman on May 19, 2008 2:28 AM

**************

Let them eat SYMANTEC markup, hence html.

Define some standard tags that have CSS classes ready; code, nsfw, idea, spellbee, quote, good, bad, stickittotheman, hippysource etc.

Add to the list as new groups of things become apparent.

Move with the content.

***************

Phil H on May 19, 2008 4:15 AM

<p class="spellbee sarc">
I think SYMANTEC&reg; might object :)
</p>

Graham Stewart on May 19, 2008 12:37 PM

How about using vi as the text editor?

http://gpl.internetconnection.net/vi/

Evan on May 19, 2008 4:00 PM

BBCode is significantly worse than HTML, I'll agree, and Wikipedia's is truly a brainfuck, but I'm hardpressed to see the problem with Textile and Markdown, seeing as both are handily leaning on Usenet 'formatting' conventions. Textile seems to lean on HTML tags, too.

It's also a little arrogant to think that all developers know HTML, although honestly if they know BBcode they know HTML.

Merus on May 19, 2008 11:53 PM

A good input box should provide both WYSIWYG and markup (either simultaneously or by switching).

WYSIWYG is so easier to use, and it won't stupidly convert anything typed in.
The replacement for typing tags is to have explicit keyboard shortcuts.

Markup is more powerful as it shows what is not directly visible and allows to paste pre-built content.

Musaran on May 20, 2008 4:22 AM

Are you for real? 'Programmers' most important job is to get computers to do useful things for us, things that might have taken a long time or been error-prone to do manually (like typing code). I.e to get computers to work for us.

The GUI is the best thing since sliced bread for getting something done quickly and accurately. The eight simple buttons in your example GUI above will cater for 99% of everything you would need to do in this context. Forcing end-users to type a human message in a textbox using special coded tags for formatting is turning the clock back 20 years in IT.

GM on May 20, 2008 5:10 PM

I agree that HTML is not ideal but the best of a bad lot. The problem with all the other alternatives, nice as some of them are, is that none of them is a standard. I can't tell you how annoyed I am to find, every time I'm coaxed or forced into using a new forum/wiki/CMS/whatever, that it has its own syntax that overlaps enough to be confusing, but not enough to be useful.

Until there's an ISO or ANSI standard for one of these beasts, I don't see anything better than a subset of HTML.

Unless, that is, you're the one who can lock all the wiki developers into a large room and not allow any of them to go to the toilet until they have converged on a single markup language...

Robert Goldman on May 20, 2008 6:16 PM

Check out the extension to wikipedia syntax providing syntax directed source code highlighting in http://en.literateprograms.org/LiteratePrograms:How_to_write_an_article

Get such an approach to work with markdown! You'd have a great combination.

Plus, the noweb markup syntax for interweaving documentation and source code which renders formatted and can be downloaded executable/compilable is wowsah for your venture. I'm guessing.

wowsah!

malcook on May 20, 2008 9:11 PM

"I agree that HTML is not ideal but the best of a bad lot. The problem with all the other alternatives, nice as some of them are, is that none of them is a standard."

If HTML gave you a standard that let people do markup (bold, emphasis, maybe tabs, maybe headers, links, and code that isn't affected by the above) and didn't let people do weird malicious stuff, then this would be a different discussion.

We'd be talking about what we preferred, and developers would try to provide as many of our preferences as they reasonably could and let us choose among those. So you could have a menu: HTML Markdown Wikipedia BBCode Textile WYSIWYG vi Wordpad Applewriter. Pick your poison. Sheer personal preference, and they provide what they can.

But HTML is dangerous, and the question is how to reduce the danger. The consensus is to white out everything except a small subset of HTML that you want users to have. Another approach is to spike the guns -- turn other HTML code into something that displays and isn't executed, and people can see whether they think it's malicious. But then it looks ugly and people who cut-and-paste it might convert it back into a dangerous form. (If programmers can post sample HTML code that doesn't get executed, which we might want to do, that could also get executed later by people who copy it.)

If you don't give users direct access to HTML, that's easier and safer than trying to take it away from them after they use it.

So, if there's no direct access to HTML the choices are to provide plaintext (which is good enough for me, but some readers will go elsewhere). Or provide one or more lightweight editors like Markdown. Or maybe also provide WYSIWYG. The original author said he wouldn't do that last. Apparently he wanted to weed out people who insisted on markup but who weren't willing to use a markup language. It's one way to weed out users. If he only wanted people who knew cryptography he could require users to solve a simple cryptogram before they posted. Like that.

So, the way you get a standard is you provide the best mix of features in your opinion, and over time people will tend to agree about what should always be provided, and they'll get together and make a standard. Make your standard too early and it will standardise the wrong things. Make it too late and everybody already knows what they need. If it isn't time yet for a standard, why go with an HTML standard that doesn't fit the need?

J Thomas on May 21, 2008 7:24 AM

@ Jeff

> The key words, though, are "subset of .. HTML" -- along with inferring the paragraph tag. I find that <b>, <i>, <code>, <pre>, <blockquote>, <li>, etc are simple to use and don't obscure the underlying content.

Be aware that <b> and <i>nt are saying something about how the content is formatted. One should use <strong> and <em> - the semantic equivalents.

Troels Thomsen on May 26, 2008 2:01 AM

One should learn to encode tags.

My point is, that <em> should be used instead of <i> - and <strong> instead of <b> - generally.

Troels Thomsen on May 26, 2008 2:02 AM

Why don't you use some thing similar to Latex? You can define some basic commands, and let the users (programmers, supposedly) write their favorite macro. Latex syntax is, I believe, very simple and context emphasized. Furthermore, once the programmers have mastered the basic, they have a very powerful tool to express themselves without the worrying admin with breaking the layout. Furthermore, the thing that I hate the most in HTML (XML in general) is its wordiness and emphasize on the presentation, both of which we don't really need. Last point: programmers usually strive to use the right tool for the right task. If a website by programmers for programmers uses the wrong tool (HTML) for the task (expressing ideas), will that not a bit wrong?

Lam Luu on June 1, 2008 7:16 AM

I read this in my RSS Reader so I decided this before reading the comments above, but I just want to say - after quick eyeballing the four cases its pretty clearly Markdown that is the most 'humane' of all the above.

I definitely would rather write my text comments in Markdown than HTML.

The single first problem with 'pure html' source editing is the carriage returns. In a text box - like for this comment, having to add carriage returns by putting everything into <p></p> tags gets realyl really old fast. The next thing is you want URLS to be automatically turned into links... once you start down that path you just going to keep going. you might as well face it and use a standard 'easy' markup format - like Markdown. HTML is a *lot* harder than Markdown anywhich way you look at it. Even for programmers.

Miles Thompson on June 4, 2008 8:34 PM

What's up, really? You can no longer read plain text? Are your eyes burnt by TV, comics, flash sites and colourful lights that you can't stand good old black on white? Did you all forget that content is why something is written? I LOVE old sites with a lot of information made by someone who THINKS and hate those new flash-bling-rolloverme-highlight-popup-annoythehelloutofme web2.0 "sites".

Clever guy on June 6, 2008 1:22 PM

I am sorry i don't have the stamina to read all the comments to make sure someone else's not said this before, but here's a suggestion. HTML, BBcode, RST, Textile or MediaWiki?

Solution: all of them!

Really.

Put the burden of syntax on the code, not on the community.

For example, triple single-quotes make bold in WikiMedia, double star+end asterisks make bold in Textile, [b][/b] in BBcode and <b></b> in HTML. Upon encountering ANY of those, give bold to your poor commenter.

Inconsistency. I know. But it need not be as big as problem as at first it might seem. Provide for the common case and use heuristics to decide the strange formatting. There are lots of shared assumptions between all the markups. Covering what is actually common ground will get you a long way.

In the cases where the various semantics actually diverge, use as many tricks as you can to extract what you can from that.

The bottomline case, obviously, should be to render whatever you're given as pure text -- with all the assumed markup noise removed.

Also, the precise rendering of the different markups should change with time according to trends you perceive in the data! Yes. For example, if at first you assumed blank-started lines where a common sense way to present blockquote, but you start to see more and more people using it for code, change it! Or develop a test where the system tries to find common programming constructs within and if found assumes to be code. Or even more clever ideas. It might break some comments, in a way, but it will still be an overall working system. On the lines of worse-is-better.

Also, some of the code entered by commenters might be altered. I know, "don't mess with my code!", i also feel like this. But for instance -- storing data in HTML might become cumbersome, so you change entered tags to a basic common ground markup. (I am somehow a fan of TXT so i think data should not be stored in an assumed structured way if it is not dealt ONLY in structured ways, and a simple text editor of text area is not structured, by that is just me and my manias). Common mistakes could be corrected. Obviously this has a major risk of screwing up, but i also think with proper care it can be a least common denominator that allows things to just get done with.

You can provide a basic recommended syntax (with what is most common-ground) to allow for "predictability" for users who really want to be in control -- and for those there is always HTML which is clearly specified anyway.

Also, notice that most of the systems out there already have this "interbred" syntax approach i'm forwarding -- at least by making double newlines equal paragraph even if all of the rest of the syntax is HTML.

Finally, take all the code, let it mature by throwing lots of data at it (as you will probably have when your site opens up), then isolate it as a library and open-source it. That would be great.

I would say just my 2 cents, but guess that was quite a big comment, uh?

MarcioRPS on June 10, 2008 11:50 AM

Count yourself lucky that you only have to deal with programming specific markup.

Mathmaticians would need math markup as well:
1) mathml (no)
2) mediawiki markup (somewhere between never and acceptable)
3) custom math markup --> Latex --> image (a little better than 2). There's a php file to let you do that.

Chemists would be very bad. Especially, organic chemistry.

BTW: where's the preview button?

Timbo on June 12, 2008 3:04 AM

Html Attribute for <MARQUEE ...> -- Marquee Slide Image and Text ---
http://html-lesson.blogspot.com/2008/06/marquee-slide-image-text.html

html htm on July 14, 2008 9:22 AM

Since this site is a site for programmers there is no need for any pseudo-markup-language wich doesn't do anything else then just be typed and then paresed which adds a) load to the server b) load to the brain (even if they are such simple, you need to look how to do what you want - while most would allready know HTML)...

I used to post on a site which is a questing/answer platform (for general questions about nearly everything) which used something alike... I don't know how I should do a list by now... sometimes it works, some other times the listing points are just all written after each other while being seperated by an asterix (*) which should (but doesn't allways) create a new point in that list. This is very odd and doesn't make someone look very competent when ansering a web-related questing ;-)

If you would want to use an Editor (you do not want, I understood it and I understand why), I would recommend using the editor MarkitUp! it offers great features (you can tab right out of the generated markup for example, just write what should be inside and press tab).

It is very user friendly, easy to understand and dosn't do any pseudo WYSIWYG shit.
And you can use it for whichever markup lang you want.

Alex

Xel on July 17, 2008 3:52 PM

Thanks for very interesting article. I really enjoyed reading all of your posts. It’s interesting to read ideas, and observations from someone else’s point of view… makes you think more.

Pharmacy on August 27, 2008 2:30 AM

If you would want to use an Editor (you do not want, I understood it and I understand why), I would recommend using the editor MarkitUp! it offers great features (you can tab right out of the generated markup for example, just write what should be inside and press tab).

It is very user friendly, easy to understand and dosn't do any pseudo WYSIWYG shit.
And you can use it for whichever markup lang you want.

xSS-ErrOr on September 2, 2008 4:38 AM

thank u r information

it very useful

u r blog Is very nice

matthew on October 10, 2008 4:04 AM

HTML is not a human language, but Arabic figures is not user-friendly too. 5 not look like ::.

Using of Markdown, Textile, Wikipedia instead HTML - ridiculous.
Only AGCODe need for security reasons.

fake watches on November 4, 2008 1:58 AM

One solution to this age old problem is a WYSIWYM editor that I'm developing at http://dockion.creationix.com/.

It lets you do the layout in a gui fashion, but the content is in something content oriented like markdown with toolbars to ease when you can't seem to remember the right syntax.

WMD is a great example of how my markdown editor will eventually end up.

Tim on December 6, 2008 9:04 PM

Hi guys. What can you say about a society that says that God is dead and Elvis is alive?
I am from Djibouti and , too, and now am writing in English, tell me right I wrote the following sentence: "Find affordable flights and very cheap travel deals."

Regards :o Bethany.

Bethany on April 20, 2009 11:37 AM






(no HTML)


Verification (needed to reduce spam):


Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved.