Mixing Oil and Water: Authorship in a Wiki World

February 2, 2009

When you visit Wikipedia's entry on asphalt, you get some reasonably reliable information about asphalt. What you don't get, however, is any indication of who the author is. That's because the author is irrelevant. Wikipedia is a community effort, the result of tiny slices of effort contributed by millions of people around the world. The focus is on the value of the aggregated information, not who the individual authors are.

But who is that community? According to Jimmy Wales, most of the work on Wikipedia is done by a tightly knit Gang of 500:

Wales decided to run a simple study to find out: he counted who made the most edits to the site. “I expected to find something like an 80-20 rule: 80% of the work being done by 20% of the users, just because that seems to come up a lot. But it’s actually much, much tighter than that: it turns out over 50% of all the edits are done by just .7% of the users … 524 people. … And in fact the most active 2%, which is 1400 people, have done 73.4% of all the edits.” The remaining 25% of edits, he said, were from “people who [are] contributing … a minor change of a fact or a minor spelling fix … or something like that.” 

Stack Overflow has some wiki-like aspects, and even my limited experience with the genre tells me this claim is implausible. Aaron Swartz ran his own study and came to a very different conclusion:

I wrote a little program to go through each edit and count how much of it remained in the latest version. Instead of counting edits, as Wales did, I counted the number of letters a user actually contributed to the present article.

If you just count edits, it appears the biggest contributors to the Alan Alda article (7 of the top 10) are registered users who (all but 2) have made thousands of edits to the site. Indeed, #4 has made over 7,000 edits while #7 has over 25,000. In other words, if you use Wales’s methods, you get Wales’s results: most of the content seems to be written by heavy editors.

But when you count letters, the picture dramatically changes: few of the contributors (2 out of the top 10) are even registered and most (6 out of the top 10) have made less than 25 edits to the entire site. In fact, #9 has made exactly one edit — this one! With the more reasonable metric — indeed, the one Wales himself said he planned to use in the next revision of his study — the result completely reverses.

Insiders account for the vast majority of the edits. But it's the outsiders who provide nearly all of the content.

Satisfying the needs of these two radically different audiences – the insiders and the outsiders – is the art of wiki design. That's why, on Stack Overflow, we mix oil and water:

  1. There's a strong sense of authorship, with a reputation system and a signature block attached to every post, like traditional blogs and forums.
  2. Once the system learns to trust you, you can edit anything – and we sometimes switch into a mode where authorship is de-emphasized to focus on the resulting content, like a wiki.

I'm not sure mixing these opposing elements would work for a project on the scale of Wikipedia. But I think it works for us (and when I say us, I mean programmers) because it's analogous to the version control system baked into the DNA of every programmer. Communal ownership is all well and good, but sometimes you still need to know Who Wrote This Crap. Authorship matters, ownership matters – and yet there's still something bigger, a larger goal we're all working toward, that trumps any individual contribution we might make. Both elements are in play.

Still, we absorbed a lot of tension with this design choice, because authorship and wiki are fundamentally opposing goals. How do you balance self-interest (vote for me) with selfnessness (vote for this content)? Sometimes it breaks down. There's a rough area around the edges where these two systems meet. For example, consider the Stack Overflow question titled Significant new inventions in computing since 1980.

Stack Overflow post from Alan Kay

If you knew this question was from Turing Award winning computer scientist Alan Kay, would it change the way you reacted to it? Of course it would!

But you'd never know that, because our wiki signature block only tells you:

  1. The last editor (Out Into Space)
  2. How many revisions there have been to this question so far (5)
  3. How many users have created those revisions (4)

It's a lot of information, by typical wiki standards. Who cares who wrote the question, as long as it's a good question, right?

But that doesn't entirely work; we also need to know who the primary author is, because that information will color and influence our responses to the question. I'll grant you this is an extreme example; no disrespect to my fellow programmers, but you haven't won a turing award. Even in more typical cases, attaching authorship matters. It lets us know who we're talking to, what their background is, what their skills are, and so forth. Furthermore, how can you possibly form a community when everyone is a random, anonymous contributor?

So the challenge, then, is tracking authorship – strictly for informational purposes – across a series of edit revisions. Jimbo erred in tracking only edit counts. Aaron used Python's difflib.SequenceMatcher.find_longest_match to establish ownership across revisions. This is the basic technique visualized in IBM's History Flow.

Imagine a scenario where three people will make contributions to a Wiki page at different points in time. Each person edits the page and then saves their changes to what becomes the latest version of that page.

History Flow animation

History Flow connects text that has been kept the same between consecutive versions. Pieces of text that do not have correspondence in the next (or previous) version are not connected and the user sees a resulting "gap" in the visualization; this happens for deletions and insertions.

It's very cool when applied to larger inputs; see history flow visualization of the Wikipedia entry on evolution.

Now, the differencing of text is, in itself, not exactly a trivial problem. I started by examining the Levenshtein Distance, but this algorithm is truly brute force. See if you can tell why, in this visualization of the Levenshtein distance between "puzzle" and "pzzel":

levenshtein distance example: puzzle and pzzel

The levenshtein distance is a measure of how many insertions, deletions, or substitutions are required to transform string A into string B. The larger the number, the more different the strings are. We're comparing two strings essentially letter-by-letter, which means the typical cost is O(mn), where m and n are the lengths of the two strings we're comparing. That's why you typically see Levenshtein used for comparing words, nothing on the order of paragraphs or pages.

I played around with Levenshtein for a while, but even optimized implementations are brutally slow as the size of the input increases. I quickly realized that a line-based comparison was the only workable one. We used this C# implementation of An O(ND) Difference Algorithm and its Variations (pdf).

What I ended up implementing was nowhere near as thorough as IBM's history flow, although it's probably similar to the rough metrics Aaron used. I simply sum the total size of all line contributions (insertions or deletions) from any given author in a revision, with a small bonus multiplier of 2x for the original author. We report the highest percentage of authorship in the final revision.

Alan Kay stackoverflow post wiki signature

The line-based diff approach for determining authorship is far from perfect; it'd be more accurate if it was per-word or per-sentence. But it's a fairly good approximation in my testing.

And most importantly, wiki posts by Alan Kay look like they're from Alan Kay.

Posted by Jeff Atwood
71 Comments

I've always hated the idea of anonymous authorship. Wiki is a cabal and no one has to be held accountable or responsible.

PaulG on February 3, 2009 1:06 AM

I still don't particularly like people changing my posts - It only takes a 1% edit to make an answer completely wrong.

HB on February 3, 2009 1:09 AM

Interesting analysis on authorship stats, but it's important to remember that Wikipedia encourages detailed references. Authorship really is unimportant for content that sites non-wiki references. Any content that does not should be taken with a grain of wiki salt, knowing that the vast majority of it is correct.

Matt Wiseley on February 3, 2009 1:35 AM

Agreed. Wikipedia tends to tout whatever is popular as fact. Sometimes it is... oftentimes it is not.

Practicality on February 3, 2009 1:36 AM

Isn't the problem of computing a very good approximation of the minimal set of differences on text, source code, and other data, already quite elegantly solved by the *nix-like diff tool? (and by rsync, which also uses a similar algorithm.)

Steven on February 3, 2009 1:55 AM

I simply sum the total size of all line contributions (insertions or deletions) from any given author in a revision, with a small bonus multiplier of 2x for the original author. We report the highest percentage of authorship in the final revision.

Yikes. So if I make a spelling correction in each line, I'm 100% the author?

And I'd hate to see my name (even with a percentage) next to something I didn't 100% write.

Crappy Taculus on February 3, 2009 2:03 AM

I wonder if the authorship percentage could be calculated by the weighted change in stemmed words in the article for each revision...

Reminds me of how one might 'fingerprint' audio streams...
the order isn't nearly as important as the frequency analysis...
(I always thought fingerprinting took into account how the song progressed, but it doesn't seem to)

Eric on February 3, 2009 2:22 AM

I really wish I could punch people that post first :)

Chris on February 3, 2009 2:25 AM

Nice feature, but I have to wonder about one thing: why is it inconsistent?

I search the term Alan Kay for examples.

Example 1: http://stackoverflow.com/questions/432922/significant-new-inventions-in-computing-since-1980
community wiki
5 revisions, 4 users
Alan Kay 76%

Example 2: http://stackoverflow.com/questions/58640/great-programming-quotes
community wiki
9 revisions, 7 users
epatel (82%)

Example 3: http://stackoverflow.com/questions/359877/are-there-famous-developers-using-stackoverflow
community wiki
9 revisions, 5 users

How come the first example has no parenthesis, the second example has them, and the third example doesn't even have the user/percentage?

configurator on February 3, 2009 2:53 AM

I think this is the first time that I disagree with you.

Didi on February 3, 2009 3:09 AM

Please don't call SO a wiki. It's a forum with community editing features.

wiki does not imply community editing.

'From Wikipedia, the free encyclopedia' on every wiki page .. get it .. encyclopedia. The 'community editing' set it apart making it an encyclopedia by the people for the people.

SO is not an encyclopedia. It's a bunch of opinionated programmers. Don't get me wrong, it's fine with what it is, but it is NOT a wiki and will never be.

I for one thought there was going to be q/a section and a wiki 'reference' section to that SO became the wiki for programmers. I was sorely disappointed. Allowing people to edit other peoples posts is just that, allowing them to edit other peoples posts.

Nothing on a wiki is personal, and that's the way it should be. As much as I like Alan Kay, I don't care if he said something interesting, or it was some kid in India, or a WoW addict.

SO is a game. Write some stuff, get rewarded, show off, etc.

pumpitup on February 3, 2009 3:14 AM

Looks good except for when it shows up in your search results:

http://stackoverflow.com/search?q=Significant+new+inventions+in+computing+since+1980

It is cut off.

Bryant Likes on February 3, 2009 3:21 AM

We at swarmforce are attempting to solve this problem with swarm ai. Our first product was debates, and it was tough, but we particalized data, handled revisions and corrections, edits, etc, and assigned each person a contribution percentage (and performance index we call karma) all using swarm ai. Our article product is in development and should be out soon (we also have a twitter product tackling tweet noise, called swatter). There are a bunch of companies popping up all trying to solve the same problem - too much noise on the net with not enough quality and authorship.

Court on February 3, 2009 3:23 AM

pumpitup, wiki doesn't mean encyclopaedia. It means something nearer to website with simple low-overhead collaborative editing. The fact that some people say wiki when they mean Wikipedia doesn't change that.

I don't know whether Stack Overflow is in fact a wiki; I've been there maybe twice ever. But if it isn't, the reason isn't because it's not an encyclopaedia.

g on February 3, 2009 3:48 AM

The problem with wikipedia is that many of the most active registered users (those with the most edits, not content) believe they own wikipedia. When a new person adds valuable content, these registered users come in and delete or modify what was written as if to take credit. Then the contributor has to fight to include valuable information and the registered users falsely say that the contributor is trying to claim ownership of the article.

It's a frustrating exercise and why there are so many contributors that never return.

Vorlath on February 3, 2009 5:34 AM

How unexpected. A genuinely interesting contribution.

James A. on February 3, 2009 6:53 AM

For those who might be interested in really efficient differencing algorithms, the strategy used by rsync is actually very interesting, and understandable with only a general background in hashing functions and some exposure to developing algorithms.

Here's the thesis written by Andrew Tridgell (the guy who put together rsync in the first place):

http://www.samba.org/~tridge/phd_thesis.pdf

A CS PhD thesis that a regular coder can read and understand!


I've used this idea of a rolling checksum in some of my own apps (differential backup, for example), and it's remarkable how well they work. Rsync uses large block sizes because of network latency, but you can get a very tight diff by using small block sizes if you have local access to the files.

Kevin on February 3, 2009 7:38 AM

The things that differs from SO and a normal wiki is that the status of ownership has a completely different meaning.

In a place like Wikipedia you want to create listings of ideas and define them. Thus, each repetitive edits and changes to the original post are refinements on the original idea. However I would be very surprised to find that there is a strong correlation between the author of the idea and the original author of the article. Thus the original poster is just the first in a long string of refinements (at least one hope so) that should converge on the most correct definition.

In SO however, the original poster asked a question, thus he has a vested strong interest in what will be the answers provided. Also, edits will be mostly to correct errors, or rephrase the question so it is better understood but must remain in the spirit of the original, otherwise it is a different question. Thus the original author, however badly worded his question was, should always be present as the author of the question. Not so much as a token of ownership.. but as a token of interest. Then, if you wish, you could create a metric as to the largest contributor to the question.

Thus I changing the signature at the bottom when one does an edit (whether be typos or complete rephrase) will hide the original person who asked the question. If I can make a parallel for a classroom, where a student would ask a question, of course the teacher will address the whole class in answering this question as he/she knows full well that if one student asked it, 10 others are just burning to ask it as well. However, even if another student added to the original question, a good teacher will always return to the first one who asked and ensure that the question was answered to his satisfaction. I find that not doing so is a disrespect to the student who dared ask it.

The same goes to SO, although questions benefit the whole community, one must never loose sight who asked the question in the first place, after all, of all interested people in the answer, he surely is the one who really want the answer the most.

The answers however is a different game, they are more like wikis in some regard as the goal here is to provide the best possible answer. Thus it should be encouraged to modify an answer rather than creating a new one thus creating the convergence effect of a wiki. Ownership tokens here are not as important and thus, the metric could simply be the person whose contribution was the largest according to some metric. Maybe have different metrics to measure different aspects of contributions, however the original person who answered is, in my opinion, more like the original poster of a wiki article, just the one who submitted a good draft to work on.

anyways... my 2 cent on the subject


Newtopian on February 3, 2009 7:39 AM

It doesn't make sense if you edit your own post

http://stackoverflow.com/questions/509580/multi-line-pl-sql-command-with-net-oraclecommand

56%?

Dave on February 3, 2009 8:50 AM

first!

gothael on February 3, 2009 9:07 AM

This site is almost completely unreadable without cleartype (windows XP/Opera10)
I hate cleartype, it makes everything look slightly blurred - it's bad enough that everything in software is slightly blurred without the text looking like that.

mgb on February 3, 2009 9:25 AM

mgb, uninstall the C fonts (typically installed with Vista or Office 2007) if you don't want ClearType. They're designed for ClearType and will *never* look good on any system without ClearType enabled.

The stylesheet defines fallback fonts but you aren't seeing them because you have these C fonts installed.

Jeff Atwood on February 3, 2009 9:28 AM

@mgb: just fix your ClearType settings using the Microsoft Clear Type Tuner:
http://www.microsoft.com/typography/cleartype/tuner/step1.aspx

It doesn't look blurry if you have it set up right.

Graham Stewart on February 3, 2009 9:40 AM

The history flow visualisation is quite interesting.

I assume it breaks down when large sections of text are being moved in an article though?
(e.g. if I decide to re-order the sections of an article, without re-wording any of it, then I am actually performing a fairly minor edit - but it would look massive on the history flow)

Graham Stewart on February 3, 2009 9:44 AM

Nice idea, but I question the reasoning behind it.

If you knew this question was from Turing Award winning computer scientist Alan Kay, would it change the way you reacted to it? Of course it would!

No, it wouldn't. A good question is a good question, and whether or not I answer it is not going to be influenced by who wrote it. The only exception I can think of is if I personally know the author - but squealing fanboyism for someone famous isn't going to play a part.

Put it another way - if Alan Kay asks a question that I have no interest in answering, I'm not going to change my mind just because it's Alan Kay. If Joe Bloggs from Mundaneshire asks a really interesting question, I'm not going to ignore it just because it's Joe Bloggs from Mundaneshire.

Russ on February 3, 2009 9:55 AM

I knew before you told us that the post was from Alan Kay...because he signed his name at the bottom.

hmmmm on February 3, 2009 10:00 AM

Thanks Jeff - I had to boot into safe mode to remove them/don't you just love .msi.

Graham - thanks that did help a little. But Proggyclean looks bad in visual studio, I'm trying a few cleartype programming fonts.

mgb on February 3, 2009 10:03 AM

Put it another way - if Alan Kay asks a question that I have no interest in answering, I'm not going to change my mind just because it's Alan Kay. If Joe Bloggs from Mundaneshire asks a really interesting question, I'm not going to ignore it just because it's Joe Bloggs from Mundaneshire.

That may be true, but important and powerful ideas probably means something different to Alan Kay than it does to Joe Bloggs. The identity of the asker has the potential to, in effect, change the question being asked.

Ian Menzies on February 3, 2009 10:09 AM

While I agree with the thought behind the article, that knowing who the author is is a good thing... I have to ask.. does it matter when the chances that you actually recognize the autior are slim to none?
If Wikipedia told you that CoolKid21 contributed majority of the content instead of last editor, Hottie84, would it really matter? Or even if they have real names... Doesn't matter, I would never recognize any of them, so it would all just be the same to me.

Bjarni Arnason on February 3, 2009 10:17 AM

Did Alan Kay ever get a satisfactory answer to his question? I like how he refuted most of the responses usually stating that a given invention was already invented at Xerox PARC in the 70s.

Ray Vega on February 3, 2009 10:18 AM

Why is it important to know who wrote/edited what?

If the answer/question/comment is good, I'm going to vote it up no matter who wrote it or who is the current owner of the message. If you react differently depending on who wrote something, maybe you'd be better off not knowing so you can decide by yourself if the information is good or not.

The only purpose of having access to the author is to go read more about him/her in its profile in case he/she said something meaningful.

Mike B. on February 3, 2009 10:19 AM

@mgb: Consolas is easily the best ClearType programming font I have ever used.
It was developed by Microsoft specifically for programming and it is very clear and easy to read.

Jeff did an article a while ago about it:
http://www.codinghorror.com/blog/archives/000969.html
http://www.codinghorror.com/blog/archives/000356.html

Graham Stewart on February 3, 2009 10:20 AM

Why is it important to know who wrote/edited what?
It might tell you if the answer is likely to be correct.
It is difficult on a forum to establish level of knowledge, the SO rep does this to some extent other forums have badges for long standing members.

mgb on February 3, 2009 10:22 AM

@Jim Anderson: of course, the guy signed it, so it's obviously him.

yours,
Barack Obama

TM on February 3, 2009 10:23 AM

Great work done here Jeff to work out a wiki ownership but doesnt that completely undermines the purpose.

No doubt it will influence readership and people's reactions BUT when we say something is a 'community wiki', it means its been written by the community and the highest contributor (even if he is the only contributor) does it altruistically.

So it comes down to altruism or egoism... being a common face in the community or illuminated by limelight.

**********************
--A food for thought--
**********************
Is it still relevant that i contributed 90% of the lines if someone just changed the entire point i stated by chaning one of the 10 lines, and it is still endorsed by my name?

Mohit Nanda on February 3, 2009 10:28 AM

A very slightly related problem: as part of a system for automatically assigning crash bugs to engineers for investigation, I want to establish an 'owner' for each source file in a large code base. My solution: for each change to the file in the version control database, score N points where N is the revision number of the change. Thus more recent changes are weighted higher, but if person A creates the file (revision 1 for 1 point), person B makes two changes (rev 2 and 3 for 5 points), and person C makes the latest change (rev 4 for 4 points), B is the file owner. The script that computes the owner can also tell you the top-N owners; in this case it would say B 50%, C 40%, A 10%.

We applied this to a codebase inherited from an outside source, so weighting the initial checkins low makes sense (the day-1 import of 10000 files to the source control system wasn't a creative act), but newly created files might ought to get a bonus for first checkin.

Hamilton-Lovecraft on February 3, 2009 10:44 AM

How do you get those graphs from wikipedia? (I know it's in discover, but if this is a wikipedia tool it would be fascinating)

Practicality on February 3, 2009 10:54 AM

Nice new feature, too bad it's broken.

If you read the original post by Alan Kay and the current revision you'll notice that the text is identical. The only revisions made were a couple of re-tagging and making one work into a link. I'd still say that 100% of the text there is written by Alan Kay.

Shy on February 3, 2009 11:00 AM

@Hamilton-Lovecraft Neat idea. If your language supports exceptions, then you can probably find out who last modified the line/function/file of the function calls to do some more accurate scoring based on the code that actually generated the error.

This post makes me wonder what diffing algorithms are used by the various source control systems out there.

Tim on February 3, 2009 11:09 AM

I think the fact that most of the contributions to wikipedia are unregistered users may reflect badly on stackoverflow. Since you need to pass a pretty high bar in order to edit posts (on the order of registering and being active for several weeks) you miss out on most of the brain power.

Motti on February 3, 2009 11:26 AM

The truth in the end is still that the content matters more than the author. Though of course we are more interested in the texts of famous authors and we get motivated by our heros. But if the same content was provided by someone else, the content would be of course the same.

Now Alan Kay asked, how could we find the powerful new ideas? Well, new ideas come many times from new people and new people are not those that we already know like Alan Kay. So we need to try to give more attention to also new authors: If the content is brilliant, it doesn't matter who wrote it. We can track the author for giving credit and all, but the content should be managed.

In Stackoverflow there are some ways to manage content, but if powerful new ideas come, then those should be considered.

Silvercode on February 3, 2009 11:26 AM

Jeff, this is why, even though I disagree with some of your conclusions, I still read your blog.

AWESOME. Thanks for making such a cool tool (stackoverflow).

fREW Schmidt on February 3, 2009 11:44 AM

Authorship is only an indication of quality (and interest) to me...

If the post has an orange background, I know it's Jeff responding to something he felt was important, so I read it.

If the blog replies more than about 50, then I don't even bother reading them all (always have work to do).

I like the ability to rank a post up/down, and the natural filtering out of non-contributors which would introduce noise...
So in that sense, 'rank' is a better indicator than 'author'.

Eric on February 3, 2009 11:56 AM

When you visit Wikipedia's entry on asphalt, you get some reasonably reliable information about asphalt.

Wrong. if you visit wikipedia, you get no reliable informations. Cheap, fit for masses, if many think it is right you get urban legends instead of information.

Artor on February 3, 2009 12:37 PM

A very sophisticated effort to determine the authorship of each bit of text on a wiki page is WikiTrust, see http://wikitrust.soe.ucsc.edu/index.php/Main_Page - it goes one step further and calculates a trust value based on the author's reputation combined with the time a bit of text remains unchanged.

This is a very interesting project, and it has some nice features for detecting reverts, tracking paragraphs that get moved around the page, etc.

I hope to see this live on wikipedia in the not too far away future. I'll do what I can to make it happen.

brightbyte on February 4, 2009 1:28 AM

When you look at the imbalanced Wikipedia reporting regarding the plight of the Palestinians, you can draw your own conclusions...

Pardeep on February 4, 2009 1:38 AM

@Julian Radowsky: yeah that is a pretty big failing of ClearType. It is a system-wide setting, rather than monitor specific. To be honest though, I would rather replace one of the monitors than turn off ClearType entirely - without it I feel like I'm having a 90s flashback.

Also your point about the font-size being set to 90%: that is very odd, Tahoma (on XP) is a TrueType font, so it should scale perfectly to any size. It sounds like you have a bitmap font for some reason. Either that or Opera is being weird.

@Jeff: this sudden interest in authorship wouldn't have anything to do with Joel's recent trouble would it?
http://www.joelonsoftware.com/items/2009/01/29.html

Graham Stewart on February 4, 2009 2:24 AM

Your algorithm is still a little off. I wrote an answer then I edited it several times - noone else has. Stackoverflow rated the answer as 98% MarkJ not 100%

http://stackoverflow.com/questions/507291/should-we-select-vb-net-or-c-when-upgrading-our-legacy-apps/508823#508823

http://stackoverflow.com/revisions/508823/list

MarkJ on February 4, 2009 3:04 AM

Jeff,

Is the Levenshtein Distance formulated in terms of dynamic programming? If not, you would likely get a performance benefit from choosing a text distance based on a dynamic programming algorithm similar to those used for DNA sequence alignment, e.g. Smith-Waterman.

http://en.wikipedia.org/wiki/Sequence_alignment#Dynamic_programming

Such might be good enough for a character-wise or word-wise authorship measurement.

Aubrey Barnard on February 4, 2009 3:55 AM

Along the lines of revisions and diffs at StackOverflow, what are you using to process and display the diffs?

Todd on February 4, 2009 4:43 AM

This is an excellent idea, its a shame that the accuracy is a little off, but thats something that I'm sure will be fixed given some time...


I'd imagine the expense of doing per word calculations is only there because of the sheer quantity of data to work on. If this was done early on (it might not be practical at all now) then maintaining it would be easier if the results were all stored... you would just have to update them on each edit/new post.

jheriko on February 4, 2009 5:35 AM

I see the usual Wikipedia is all rubbish comments are coming out, my experience must be really lucky because besides the occasional obvious vandalism (gone in an instant) most of the articles are mostly correct, or at least as correct as any printed encyclopedia I have ever read, they do contain common mistakes, but so do other sources ...

Did anyone come to a conclusion on who edits articles, I suspect it is a core who copyedit/spellcheck/correct, a larger number who contribute to a small number of articles, and a lot of anons who do small edits (both good and bad)

Jaster on February 4, 2009 7:01 AM

I think that it's interesting that you selected the the Alan Kay question - I remember when it was initially on the site and it was closed (or very nearly closed - I can't remember). The only reason that it was allowed to remain on Stack Overflow was because it was a question by Alan Kay. If asked by nearly anyone else, it would have been closed as not a real question (or something). There were comments along the lines of, Hey, don't close this - let's not embarrass ourselves in front of Alan Kay

I'm not sure where I'm going with this - I'm not saying that it was a bad question or even that it should have been closed (for myself, I tend to favor not closing questions unless they really, really have no value - I guess I tend toward being an inclusionist).

But I think it's an interesting observation.

mikeb on February 4, 2009 9:36 AM

Sounds like elitism, if it requires a turing award winner to be able to post something important and otherwise the post gets closed. I mean, the post would have been important even if it was posted by someone else. There really should be a category for more philosophic questions too. And this question wasn't even abstract but concrete kind of request for proposals about how we could find the powerful new ideas.

Silvercode on February 4, 2009 11:36 AM

Interesting post, but your data is all wrong in terms of Wikipedia. You failed to notice that the stats about Wikipedia authorship you quote in beginning are from 2006.

Last summer I heard a talk from PARC's (as in Xerox PARC) Augmented Social Cognition group about who edits Wikipedia, based on analysis of the most recent database dumps.

Wikipedia reached a huge peak, more than doubling in number of active contributors, by May 2007. It's between one and two thousand very active contributors (which is less than 1% of all registered editors) who contribute 50% of all the content. The other 50% of edits are made by all less active contributors and anonymous users combined. On average, very active community members added a significantly larger amount of content to the site, and on average anons took away more (e.g. in copyediting and other minor pruning).

As for your system of attaching names/faces to edits and learning who the most active contributors are, they did a project called Wikidashboard (wikidashboard.parc.com) that shows you just that. Overall, I don't know if very many people that use it. For public-facing projects (instead of internal collaboration wikis), people just care about the information, by and large.

Your point about wiki and authorship being opposing goals is also more than slightly off in my experience. There are big wikis (like wikiHow) that quite successfully attach lists of authors to articles and maintain a strong sense of being a wiki. It's not authorship and wiki that are fundamentally opposing. It's _ownership_ and wiki that are fundamentally opposing.

Steven Walling on February 4, 2009 12:03 PM

I'm not sure what's significant about this work... Everyone knows that nothing significant has occurred in computing since XEROX PARC in the 70s.

Seriously - when will we see this great new metric on SO?

Cade Roux on February 4, 2009 12:43 PM

@Graham Stewart
Cleartype is no good on my dual monitor system, the monitors are not the same and the RGB/GRB sequence is not the same on the two monitors (even though they are the same make). If I tune cleartype to look good on one of the monitors, then it's blurry on the other.


@Jeff
I have deleted the C fonts (Consolas and Calibri), and there is no difference (using Opera on XP), it seems that your style sheet forcing the fonts to 90% causes spacing problems with Tahoma, the characters bunch up and overlap in Opera (if I zoom to 110% then the spacing is corrected). May I suggest that you remove the force to 90%?

Julian Radowsky on February 4, 2009 12:49 PM

oh,,it's nice to read about your blog! i learn the more idea for choosing this articles and post my comment.. thats why i would like to know you that your so good enough for presenting yuor website and share to everyone. if you have some question would you like to ask me just visit my site. a href=antioxidant water.

jimima frailin on February 5, 2009 2:08 AM

oh,,it's nice to read about your blog! i learn the more idea for choosing this articles and post my comment.. thats why i would like to know you that your so good enough for presenting yuor website and share to everyone.

jimima frailin on February 5, 2009 2:10 AM

This is great. I know I've refrained from making good edits to wiki posts on StackOverflow in the past because I didn't want to steal perceived ownership (emphasis on perceived). Now that's not really an issue.

But this post probably belonged on blog.stackoverflow.com.

Joel Coehoorn on February 5, 2009 2:31 AM

Ah, sweet whiff of youth! History Flow is pretty good, well chosen examples. They reminded me of that other, once-berpromising vaporware tech, Ted Nelson's Xanadu, or Xanalogical Storage system, that'd have provided always up-to-date credit to any part or span of interlinked docuverse (down to granular character level!). Who wrote what; who quoted whom; where and when did a transclusion originate, and so on. Looked like it'd be happening for a while, well before the WWW [info.cern.ch] made its first apearance and until its once-White Knight Autodesk Inc. pulled the plug on it around 1992 or so. Lingers in Internet limbo since then, aka Bithell. But two years before that, after meeting the inventor eye-to-eye, I wrote this account, complete with a leading declaration of my own word-authorship of it (=71%), and attribution for the rest, 29%, to my oft-quoted subject, the promoter of the concept Ted H. Nelson. So, for the historical declarative-authorship record, I give you this:

http://www.tidbits.com/tb-issues/TidBITS-030.html

Xanadu by Ian Feldman (71%)

First Xanadu stand opens Jan. 1993, El Camino Rd, Palo Alto CA. Be there. [...]

[ I know now there should've been El Camino Real up there, but no American editor ever corrected it. ]

Ianf on February 5, 2009 2:33 AM

That history flow is cool. Where can one fine a tool to help generate such a thing in the open source arena. That would be really cool to try to apply to a code base on high change rate files and such.

Daniel on February 5, 2009 5:33 AM

Hey Jeff,

Excellent idea for smoothing the transition to wiki-mode, and kudos for your care an attention over the imprecise art fuzzy-attribution!

Also, SNAP! on the reference to History Flow -- by coincidence I also referenced that yesterday in a book review about The Visual Display of Quantitative Information (a more interesting read than it sounds, honest):
http://www.danielfortunov.com/$daniel_fortunovs_blog/2009/02/04/the_visual_display_of_quantitative_information

Daniel Fortunov on February 5, 2009 12:06 PM

Wiki stats are enormously misleading. I could write a bot which formats dates into the WikiApproved(tm) fashion, and become a top contributor, even though I really didn't contribute anything of note.

I rewrote a major section of the article on Tae Kwon Do once, and that was... one edit.

I think a better statistic would be to color each word to see where it came from, and look at the authorship of THAT. Should be quite fascinating.

Bill on February 5, 2009 12:56 PM

“I expected to find something like an 80-20 rule: 80% of the work being done by 20% of the users, just because that seems to come up a lot. But it’s actually much, much tighter than that: it turns out over 50% of all the edits are done by just .7% of the users … 524 people.”

Under the 80-20 rule, 51.2% of the edits would be done by .8% of the users. More than 50% of the edits getting done by .7% of the users isn't too far off.

Rudiger on February 8, 2009 7:17 AM

Dear lord, one prolific figure visits a web site and suddenly he's got Jeff Atwood kissing his ass.

Are we to expect every other slightly well-known programmer to have his or her bottom fondled by this drooling fanboy?

Rob on February 8, 2009 8:04 AM

Epistemic injustice.

vmb on February 10, 2009 2:33 AM

You should check out our project called WikiDashboard (http://wikidashboard.parc.com) that attributes the work of editors in Wikipedia to the articles they're heavily involved in.

Ed Chi on March 26, 2009 1:27 PM

Back up a second: a question is a different beast from a snippet of encyclopedic knowlege. I'd expect an analysis that gets closer to know your customer with respect to questions from a blog subtitled human factors.

keith on May 7, 2009 2:32 AM

If you knew this question was from Turing Award winning computer
scientist Alan Kay, would it change the way you reacted to it? Of
course it would!

But you'd never know that,...

Why wouldn't you know that? His name is included. Which prompts the question should posters included their name/signature on Wiki posts?

Jim Anderson on February 6, 2010 11:13 PM

Cleartype tuned properly is bearable, but it's still mangling the characters.

Very often we read blogs because of the author, for their particular style and way of thinking. I know when I read a news article or opinion piece I like to know who's written it, especially if I'm not familiar with their work.

Wikipedia doesn't attribute authorship directly to the user (and I don't think I've ever looked at a revisions page on wikipedia.com) probably because that's the way nearly every other encyclopaedia does it, print or online. I admit that knowing who the author is in an about.com article isn't always important to me, but I'd like to see authors given more credit in Wikipedia. Anonymous might well stay anonymous for whatever reason, but seeing a name and a profile can give added authority because the author is attaching his (or her) reputation to the content.

Plus, the more I read stuff from wikipedia, the more I ask myself 'who wrote this crap?' Not because I disagree with the content, but because the style can be disjointed and sometimes just plain unreadable. Then again, if all the real contributors are outsiders, it would be no benefit to look out for certain authors or mentally block others because those authors may only have contributed a couple of articles. Then again, if I knew the author (or at least their reputation) from outside wikipedia it would add substantially to the utility of the article (or futility, depending on the author).

John Ferguson on February 6, 2010 11:13 PM

The comments to this entry are closed.