In Finally, a Definition of Programming I Can Actually Understand I marvelled at particularly strange and wonderful comment left on this blog. Some commenters wondered if that comment was generated through Markov chains. I considered that, but I had a hard time imagining a text corpus input that could possibly produce output so profoundly weird.
So what are these Markov chains we're talking about?
One example of Markov chains in action is Garkov, where the long running Garfield cartoon strip meets Markov chains. I present below, for your mild amusement, two representative strips I found on the Garkov hall of fame:
Garfield's an easy target, though:
So let's proceed to the "kov" part of Garkov. The best description of Markov chains I've ever read is in chapter 15 of Programming Pearls:
A generator can make more interesting text by making each letter a random function of its predecessor. We could, therefore, read a sample text and count how many times every letter follows an A, how many times they follow a B, and so on for each letter of the alphabet. When we write the random text, we produce the next letter as a random function of the current letter. The Order-1 text was made by exactly this scheme:
t I amy, vin. id wht omanly heay atuss n macon aresethe hired boutwhe t, tl, ad torurest t plur I wit hengamind tarer-plarody thishand.We can extend this idea to longer sequences of letters. The order-2 text was made by generating each letter as a function of the two letters preceding it (a letter pair is often called a digram). The digram TH, for instance, is often followed in English by the vowels A, E, I, O, U and Y, less frequently by R and W, and rarely by other letters.
Ther I the heingoind of-pleat, blur it dwere wing waske hat trooss. Yout lar on wassing, an sit." "Yould," "I that vide was nots ther.The order-3 text is built by choosing the next letter as a function of the three previous letters (a trigram).
I has them the saw the secorrow. And wintails on my my ent, thinks, fore voyager lanated the been elsed helder was of him a very free bottlemarkable,By the time we get to the order-4 text, most words are English, and you might not be surprised to learn that it was generated from a Sherlock Holmes story ( "The Adventure of Abbey Grange'').
His heard." "Exactly he very glad trouble, and by Hopkins! That it on of the who difficentralia. He rushed likely?" "Blood night that.
So the text in Garkov strips is generated in exactly this way, but using words instead of letters. The input corpus is, as you'd expect, the text of many old Garfield strips.
What's amazing to me about Markov chains is how unbelievably simple they are. A Markov chain has no memory of previous states: the next state (word, in our case) is chosen based on a random dice roll and a lookup into a table of the states that tend to historically follow the current state in the input corpus. Given an adequate input corpus, they work almost uncannily well, a testament to the broad power of rudimentary statistical inference. Garfield's been around since 1978, and still going str.. well, going, so there's no shortage of material to work with.
Now let's try it ourselves. I fed the text of the last twelve Paul Graham essays to this online Markov generator, using two word groupings -- what Bentley refers to as "Order-2". Here's what I got back:
You can feel the need to take advantage of increased cheapness, however. You're not all playing a zero-sum game. There's not some fixed number of startups; we fund startups we fund to work on matters of passing importance. But I'm uncomfortably aware that this is part of any illusions about the problem of overeating by stopping eating. I couldn't simply avoid the Internet had become, because the company is the new trend of worrying obsessively about what it meant for someone, usually an outsider, who deliberately stirred up fights in a startup than just start it. You know how the A List is selected. And even that is more work.
But Markov chains aren't just useful for automatically generating Paul Graham essay parodies. They're also quite practical. You might even say Markov chains are a large part of what powers today's internet. Most remarkably, to me at least, Markov chains underly Google's trillion dollar PageRank formula:
The [PageRank] formula uses a model of a random surfer who gets bored after several clicks and switches to a random page. The PageRank value of a page reflects the chance that the random surfer will land on that page by clicking on a link. [PageRank] can be understood as a Markov chain in which the states are pages, and the transitions are all equally probable and are the links between pages.As a result of Markov theory, it can be shown that the PageRank of a page is the probability of being at that page after lots of clicks. This happens to equal t-1 where t is the expectation of the number of clicks (or random jumps) required to get from the page back to itself.
Incidentally, if you haven't read the original 1998 PageRank paper, titled The PageRank Citation Ranking: Bringing Order to the Web (pdf), you really should. It's remarkable how, ten years on, so many of the predictions in this paper have come to pass. It's filled with interesting stuff; the list of the top 15 PageRank sites circa 1996 in Table 1 is an eye-opening reminder of how far we've come. Plus, there are references to pornographic sites, too!
Markovian models -- specifically, hidden Markov Models -- are also related to our old friend, Bayesian spam filtering. They're even better! The most notable example is the CRM114 Discriminator, as outlined in this excellent presentation (pdf).
If you play with the Markov text synthesizer, you'll quickly find that Markov methods are only as good as their input corpus. Input a bunch of the same words, or random gibberish, and that's what you'll get back.
But it's sure tough to imagine a more ideal input corpus for Markovian techniques than the unimaginable vastness of web and email, isn't it?
| [advertisement] Read the largest case study ever published about lightweight peer code review in Best Kept Secrets of Peer Code Review. Free book, free shipping. |
Posted by Jeff Atwood View blog reactions
« Exploring Wide Finder ASCII Pronunciation Rules for Programmers »
Is it a coincidence that I wrote a chatbot with MegaHal(another 4th order markov chain) for a game lobby this morning or are their spy cameras in my cereal!?
Tom J Nowell on June 11, 2008 04:52 AMThis Markov-ized essay from Paul Graham does not yield a single instance of the word "Lisp" ? Wow, I'm amazed.
Leonel on June 11, 2008 04:52 AMI'd like to see this applied to music and see what happens if you feed it all the works of Mozart...
ChrisL on June 11, 2008 04:52 AMI have seen a lot of generated Morkovian Text in comments section of a lot of blogs and websites.
Somebody out there is trying hard to beat the system.. :)
The problem is that we have to read two/three sentenses from the text to even understand that we are reading an artificial prose.
If the internet starts to get saturated with websites containing generated text (Which can be used to get search engine traffic), we are in deep troube. The signal/noise ratio of the web will decrease further.
What is this post about? I don't want to read it through to get the point in case it is not worth reading.
Silvercode on June 11, 2008 04:54 AM@ChrisL,
Try it out and tell us how the results are!!
@Silvercode,
This post is about little cats doing funny stuff.
It's interesting to feed your homepage to it as well:
"When we produce output so many times they work done by generating Paul Graham essays from industry experts. Free shipping."
That's my IRC bot which uses a Markovian style algorithm. It sits in hundreds of IRC channels learning everything it hears, then generates replies based on that. Some of the best quotes are randomised on the front page (refresh to see more). You can find more at:
http://quotes.kookybot.org/top
Everything there is generated from its knowledge - it is not simply repeating back sentences it has heard. The database is 4.6 GB, almost 70 million rows.
phatmonkey on June 11, 2008 05:04 AMHave you tried getting Mr Spolsky started on garfield?
He's quite the opposite of a fan.
Just slip something into the conversation like, "Soooo, that Jim Davis programmed in C a lot you know. Great sense of humor he has too."
secretGeek on June 11, 2008 05:07 AMhttp://quotes.kookybot.org/top
That is GREAT. I am dying over here.
Jeff Atwood on June 11, 2008 05:07 AMAlso, this is something similar for Usenet:
http://en.wikipedia.org/wiki/Mark_V_Shaney
Jeff Atwood on June 11, 2008 05:09 AM@Silvercode
What is this post about? I don't want to read it through to get the point in case it through to read it is not worth reading. @Silvercode
What is not worth reading. @Silvercode
What is this post about? I don't want to get the point in case it is this post about? I don't want to read it through to read it through to read it through to get the point in case it is this post about? I don't want to read it is this post about? I don't want to get the point
> If you play with the Markov text synthesizer, you'll quickly find that > Markov methods are only as good as their input corpus. Input a bunch
> of the same words, or random gibberish, and that's what you'll get back.
That's more a feature than a bug :-) By using a specific text as an input you can produce a text sounding like the input. E.g. texts in old English or try poems etc.
David on June 11, 2008 05:19 AMAs ChrisL suggested, it will be more intersting to try Markov model on music than some text.
Generated music, if it is a function of the previous 5-6 notes will be great, I think.
Text anyway, does not make any sense.
One of my pet projects that I have yet to fully realise is building a markov-model chat bot out of the Web 1T 5-gram corpus: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
One barrier to implementation is that it's way too big to fit in memory on a single (reasonable) machine.
Also related is frac.notdot.net, an amusing free-association game that uses a 1st-order markov model to 'play' free association with a human player, expanding its corpus as it goes. The corpus is quite sizeable by now.
Nick Johnson on June 11, 2008 05:25 AM"That's more a feature than a bug :-) By using a specific text as an input you can produce a text sounding like the input. E.g. texts in old English or try poems etc."
Yes, that is a fantastic feature. If you start talking to Kooky in other languages (German and Danish are the ones I know about, there are probably others), he will reply in that language. You never see these languages show up in English channels simply because there are no relationships with the foreign words (apart from a few exceptions - die, bin in german etc).
phatmonkey on June 11, 2008 05:28 AMphatmonkey, there were some requests to get kookybot on Twitter ( http://twitter.com/orph/statuses/832119604 ), though I'm not sure how that would work, as it's a radically different environment than IRC.
Those top kookybot quotes are genius. I cannot remember the last time I have laughed so hard. WARNING, however, do not click on any links the KookyBot provides. You'll be sorry. :(
Jeff Atwood on June 11, 2008 05:31 AMt -- 1? Is that t + 1?
Kent on June 11, 2008 05:31 AMHow about Markov/Dilbert project - oh wait, it already works that way...
Goran on June 11, 2008 05:33 AMJeff, that should be simple enough, there's a Twitter Python API. I'll try it now.
phatmonkey on June 11, 2008 05:39 AMIs the same technique used for the order of the words?
i.e. Once you have used Markov Chains to randomly generate a big pool of words could you then also use a word-level Markov Chain to say "this word is typically followed by one of these words"?
Graham Stewart on June 11, 2008 05:44 AM@Graham Stuart,
I don't know. phatmonkey will be able to answer that.
But if it is possible, then we could improve the system by tolerating similar words.
Sorry Graham... I screwed up with the spelling...
Niyaz PK on June 11, 2008 05:52 AMIf only I could use Markov chains to automatically write code...
Weeble on June 11, 2008 05:58 AM"underlie", not "underly".
Never mind. Having played with the online Markov Generator I see they are doing it on a word-level there.
Like ChrisL I am now wondering what happens if you apply this to music.
Or maybe even analyse adjacent pixels in graphics/photos and see what kind of weird amorphous blob it produces.
Graham Stewart on June 11, 2008 06:02 AMInteresting...
I'm thinking tweetkov, make a twitter account, follow as many people as possible, use their random gibberish as input and see if tweetkov's tweets make any more sense than the average twitter user. ;)
...I just might do that.
Eric DeLabar on June 11, 2008 06:13 AMAny good sources for Markov chain generators which take binary input? It could be fun to see what kind of executables it would produce...
Victor on June 11, 2008 06:14 AMCould this be used to test a platform's stability? I guess Markov chains would generate something executable much more often than a simple random binary generator, if based on existing binaries.
Victor on June 11, 2008 06:19 AMinteresting. i never knew what a markov chain was before... even though i have implemented more than one... after all getting a distribution from data and then using it to produce a model by sampling the distribution isn't really a big leap of understanding. i'd always just labelled this as "a monte-carlo method" and never accorded it special distinction...
interesting that it is so effective for langauge too... although still, the sentences are gibberish. however... i wonder if a more comprehensive model of writing could not be made by analysing more than just the constituent letters of words or their order in text.
something for me to mess with coding whilst bored at work anyway... :)
"Or maybe even analyse adjacent pixels in graphics/photos and see what kind of weird amorphous blob it produces."
I'm wondering if the new game Spore uses something like Bayesian or Markovian chains for its procedurally generated artwork.
I would actually really like to see a procedurally generated computer game storyline based around Markov chains. I very much doubt anyone would notice...
Tom Clarke on June 11, 2008 06:23 AMIn response to your statement that "Lasagna Cat" is the one link we must click on I only have this to say (with compliments to Adam Sandler)...
"... what you've just said is one of the most insanely idiotic things I have ever heard. At no point in your rambling, incoherent response were you even close to anything that could be considered a rational thought. Everyone in this room is now dumber for having listened to it. I award you no points, and may God have mercy on your soul."
Jeff,
Some day you should try this site in IE8.
It looks weird.
I'm sorry, it can't be a paul graham essay parody without an excessive number of footnotes.
Nick on June 11, 2008 06:30 AMthere is a lot of talk about randomly generated code here...
genetic algorithms are good for this, and infact can independently solve or optimise many problems (nothing complicated sadly). you will probably want to write a language and compiler specially though...
the basic premise is to copy how evolution works, i.e. introduce random changes to the code and test it against some fitness function. keeping the most fit programs and culling the remainder, then repeating the randomisation process. it looks like markov chains could be applied to optimise this... (i.e. use that distribution instead of 'true' randomness) if someone hasn't already done this. :)
might be a way to avoid creating a language just for this purpose... since it could recreate good syntax for an existing language, at least some of the time.
Jheriko on June 11, 2008 06:30 AMWell, this may finaly explain the comic Zippy. It is a Markov chain.
DAdams on June 11, 2008 06:33 AMThe first time I came across Markov Chains was a blog post about creating fake Italian surnames.
http://doubtingtommaso.blogspot.com/2008/03/markov-chains.html
Pretty cool tech! Quite powerful given how broadly it can be applied.
Dennis Hotson on June 11, 2008 06:42 AMThese are those fun comments I see on some of my sites (ussually followed by a link to some spam site), I never knew quite how they were generated. If only we could somehow apply them to mad-libs...
Kris on June 11, 2008 06:44 AMWell hi! Thanks for, er, making an example of me, Jeff. Glad you like it.
- "Is it a coincidence that I wrote a chatbot with MegaHal(another 4th order markov chain) for a game lobby this morning or are their spy cameras in my cereal!?"
I can't speak for building security, Tom, but MegaHal is unquestionably what first turned me onto Markov theory back in the day.
- "I'd like to see this applied to music and see what happens if you feed it all the works of Mozart..."
It's been done. Heck, I did a very small version of it myself for a project in college -- analyzing only melodies, and looking at melodic interval and time intervals as two separate tables rather than at melodic+time as a single block. What I produced were melodies that sounded like Mozart teeteringly drunk and standing on one foot before the keyboard; but that's audibly different from truly randomly generated notes, so the project was considered a success.
One of the challenges with using Markov theory to synthesize creative output is that you have to decide how far-seeing your criteria is, and that leads to a tradeoff between novelty and coherence.
A 1-order markov model says, here's word A, which set of words {B} can follow it? Then you pick a single word B from {B}, and see what {C} contains, and so on. It turns out that 1-order chains lead to miserable nonsense. Garkov (and MegaHal, and I think probably most Markov language toys) is a 2-order model: given words A and B in sequence, look at possible followups {C}. The three-words-at-a-time process gets us a lot closer to coherence, but it also means that the sentences are less novel -- there are fewer places where [A, B] has multiple options in {C}, and there are fewer options in {C} when it *is* plural.
A high order markov model produces very coherent output but rarely produces novel output. But there's also a balance of the order of your model against the size of your corpus -- the collection of data you're feeding into it -- and as corpus size goes up, coherence creeps down. So choosing the order of your model depends on how much data you're working with as well.
Which is all to say, as it works for words, it also can work for music, but what the results are like depends on (among other things) how you parse the music on input, how much music you're processing, and the order of model you're using to analyze and synthesize new stuff.
- "That's my IRC bot which uses a Markovian style algorithm."
Markov IRC bots are one of the best ideas since sliced bread, yeah.
- "That's more a feature than a bug :-) By using a specific text as an input you can produce a text sounding like the input. E.g. texts in old English or try poems etc."
Exactly, David. Try it with arithmetic for some really bizarre stuff; try it with source code as well. Markov is naive, but very willing.
- "How about Markov/Dilbert project - oh wait, it already works that way..."
Heh. In principle, the Garkov code is actually Garfield agnostic -- it's just a bidirectional markov structure and some generic display code, paired up with a garfield display font, some garfield background strips, and a bunch of garfield transcripts. I'd very much like to do some other comics with it in the future -- the big trick is those transcripts, so if anybody wants to spend a saturday plowing through Dilbert (or, better, Mary Worth), let me know.
- "Or maybe even analyse adjacent pixels in graphics/photos and see what kind of weird amorphous blob it produces."
Ha! That could be fun. Difficult, but fun. Putting together a big corpus -- and preprocessing it somehow to get the image complexity to a nice middle area in terms of total number of colors -- would probably be the biggest challenge.
- "Is the same technique used for the order of the words?"
I know you a-righted yourself already, but Markov models done at the letter rather than the word level have been done before as well, and are rather fun. If you want to generate some new plausible non-words in the language of your choice, a 2-order model does a great job. There is a nice brief writeup that touches on that line of thinking (with a bilingual twist) : http://www.teamten.com/lawrence/projects/markov/ . I know I have seen others, but I can't google to save my life this morning.
- "Well, this may finaly explain the comic Zippy. It is a Markov chain."
Bill Griffith is a genius.
I'm going on at irresponsible length. This is fun stuff, and again, I'm glad you like the Garkov.
-j
Josh Millard on June 11, 2008 06:45 AMJosef became a novel expands on the south west side of energy). She names him as a children's advocacy group that he moved to the overbridge at the National Urban League; Alan Wolfe of the Kapelle, Reicha was served mainly by Sheffield Midland - Leeds stopping services. The line is markedly virtuosic, reflecting his own time, Miss Fellowes fights the Brookings Institution; Norman J. Ornstein, resident scholar Ludwig Schiedermair in Wallerstein, before his letters to return Timmie to liberate Timmie.
Josef Reicha wrote music publishing firm, who played in 'Der junge Beethoven' (Leipzig, 1925) gave specific examples taken from
Only because you brought it up (in a way), Jeff: Are you still using POPFile?
Manni on June 11, 2008 06:49 AMChapter 3 "Design and Implementation" of Kernighan and Pike's _The Practice of Programming_, http://books.google.com/books?id=to6M9_dbjosC, features the Markov Chain algorithm. The chapter also compares the algorithm's implementation in C, C++, Perl, Java, and Awk. Read it; it is elegant.
Karl on June 11, 2008 07:02 AMI found a svada generator some time ago, which probably works in the same way. It generates grammatically correct sentences without any meaning.
http://www.geekhouse.no/~anders/Svada/
Here is a sample output:
http://www.geekhouse.no/~anders/Svada/wrapper.cgi?lines=42&grammar=oopl
Machine intelligence music composers are getting better and better. "Vladimir" is one of the latest efforts, and it really does quite well with classical music; less so with more modern music--although I don't know that you could tell the difference between a 12 tone row based composition by a human or a robot. http://www.aaai.org/AITopics/pmwiki/pmwiki.php/AITopics/Music
khafra on June 11, 2008 07:09 AM@phatmonkey - you sir, are a freaking CompSci legend.
Andrew on June 11, 2008 07:37 AMJeff,
you can hear similar nonsense without Markov chains if you interview people for programming jobs on a daily basis.
"Never underestimate the power of human stupidity." (Robert Heinlein)
BugFree on June 11, 2008 07:39 AM@Victor,
"Any good sources for Markov chain generators which take binary input? It could be fun to see what kind of executables it would produce..."
That, my friend, would be wicked cool! Although I would run such a program inside a vault just as a precautionary act, in case our Markovian baby decides to write anything on the hard disk! :-)
Elvis Montero on June 11, 2008 07:46 AM@Victor:
It might be easier to randomly generate Assembly Language
Graham Stewart on June 11, 2008 08:01 AMMarius, owing specifically *to* it's strong gramaticallity, the Svada thing looks to be a different beast from a Markov model (at least at the word-by-word level).
Markov's greatest strength is it's simplicity -- it is a tiny whisp of a principle from which all kinds of exciting and surprising things can emerge, including e.g. novel, semi-coherent text. But you can never depend on anything more than semi-coherence on the whole.
The model's greatest weakness is it's lack of context. It knows nothing of sentence structure, or of the logical relationships between words in its tables beyond sheer proximity. Part of what makes Markov language output amusing in many cases is the tendency toward sudden surreal shift across an unexpected point of transition; that that transition is sometimes grammatical or even semantically plausible is just a chance of luck, not a feature of the model.
A pure Markov model is in that sense literally more context-free than even a Context-Free Grammar.
It's possible that the Svada code depends on a CFG for generating its sentences; or it may rely some more structurally complex model, or on templates. It's very neat work, regardless, but I don't believe at a glance that it is Markovian at all at the level of sentence-construction.
Josh Millard on June 11, 2008 08:08 AMAh, sweet Finnegan:
riverrun, past Eve and Adam's, from swerve of shore to bend
of bay, brings us by a commodius vicus of recirculation back to
Howth Castle and Environs.
Sir Tristram, violer d'amores, fr'over the short sea, had passen-
core rearrived from North Armorica on this side the scraggy
isthmus of Europe Minor to wielderfight his penisolate war: nor
had topsawyer's rocks by the stream Oconee exaggerated themselse
to Laurens County's gorgios while they went doublin their mumper
all the time: nor avoice from afire bellowsed mishe mishe to
tauftauf thuartpeatrick: not yet, though venissoon after, had a
kidscad buttended a bland old isaac: not yet, though all's fair in
vanessy, were sosie sesthers wroth with twone nathandjoe. Rot a
peck of pa's malt had Jhem or Shen brewed by arclight and rory
end to the regginbrow was to be seen ringsome on the aquaface.
The fall (bababadalgharaghtakamminarronnkonnbronntonner-
ronntuonnthunntrovarrhounawnskawntoohoohoordenenthur-
nuk!) of a once wallstrait oldparr is retaled early in bed and later
on life down through all christian minstrelsy. The great fall of the
offwall entailed at such short notice the pftjschute of Finnegan,
erse solid man, that the humptyhillhead of humself prumptly sends
an unquiring one well to the west in quest of his tumptytumtoes:
and their upturnpikepointandplace is at the knock out in the park
where oranges have been laid to rust upon the green since dev-
linsfirst loved livvy.
Very interesting, could also come in handy for unit testing scenarios
JohnM on June 11, 2008 08:21 AMRegarding Markov chains and music, didn't the old Sid Meier 3DO game "CPU Bach" use them? I loved that game/toy, wish Sid would put it out on XBLA or something.
James Gutierrez on June 11, 2008 08:29 AMMegaHal is quite impressive. I really enjoy what it produces.
Jeff Davis on June 11, 2008 08:36 AMUh, Jeffy? The rest of us heard about Markovs in like 1980-something. Really those "buzzword bingo" cards work on the same principle. Are you telling us this is your first introduction to them?
For more fun, try the Emacs dissociated text plugin (M-x dissociated-press). Run it with a buffer full of text, get delightful gibberish that almost makes sense back. Try it on a buffer with code pasted from two different languages, watch it invent a new programming language before your eyes!
This is really interesting. I work at a internet harvester/analyzer and we use Bayesian spam filtering. This makes me want to look more into Markovian.
Carter on June 11, 2008 08:52 AMA Markov generated CodingHorror parody ;) :
This year's strongest contender comes to multicore performance, choice of language is no silver bullet.
How else can we explain the massive disparity between the fastest and slowest versions of Garfield strips.
If you look at the same time, or in the same words, or random gibberish, and that's what you'll get back.
But it's sure tough to imagine a more ideal input corpus for Markovian techniques than the original article.
Of course, as with all other useful things, there is a program to count the most common HTTP GET URL entries in a dozen places rather than making all the key is the many-way communication of the code size, the code in this post, make it this one.
Everything old is new again: http://faculty.dwc.edu/wellman/Soluble-Fish.html#Sol
Terrier on June 11, 2008 09:14 AMPretty interesting. Like a couple others, I see spam comments on my site from time to time that sure look like they came out of something like this. Pretty tricky, and probably damned near impossible to filter out.
D. Lambert on June 11, 2008 09:31 AMI'm actually curious whether Markov-generated text would make for an effective attack on a Bayesian filter. I don't know much about the latter, but my general impression is that the strength of a Bayes spam filter is it's ability to evaluate the raw probability of a given set of words occuring in known-spam email vs. all email, independent of the *order* of those words in a given email.
In other words, a Bayes filter as I understand it wouldn't care about the sentence structure or order of clauses -- it's not set off by specific fixed phrases so much as by the occurance or not, period, of unlikely-to-be-legit individual words in a candidate email.
The reason I'm doubtful about Markov as a means to defeat Bayesian filtering in specific is that a Markov regurgitation of source text doesn't change the probability of any given word in that corpus appearing in the output. So if you feed "clean" text into Markov, you're getting clean text out; feed "dirty" text (containing spam-flavored tokens) and you'll get dirty text out. If Bayes just counts it all up as unsequenced tokens, you haven't gained anything.
That's not to say Markov spam can't fool some filters. A filter that depends on matching a known-bad fixed string of words to reject spam may well be fooled by a Markov model that churns out unique syntheses of Jane Austen instead of just quoting the same verbatim passages again and again. So it makes sense as a gambit (spambit?), but as I understand it Bayes is pretty much the strongest general anti-spam technique going right now and Markov shouldn't be a panacea in that sense.
Josh Millard on June 11, 2008 09:54 AMArbuckle is more than just strips that have been redrawn. They're strips that are redrawn from John Arbuckle's point of view!
They're a reminder that Garfield CANNOT TALK. When you remember that Arbuckle cannot know what Garfield is thinking, the strip gains a new level of ridiculousness.
You don't need to redraw them for this, of course. Most Garfield strips are improved greatly by simply removing Garfield's thought-balloons.
I just wanted to share some randomly written text I found on one of my websites --
like the malfunctioning, spastic, wet-nightmare of an artificial omni-intelligent acid fried cyborg, battle of the bits is this mother larva queen spewing out spawned offspring in the form of bit crunchy pockets of noise paintings. elegant pulp data paradigms and musical notions assigned to snippets of life going in and out of relevance. algorithmic musical vessels of meaning for that which we assign... and then sometimes robocop assigns it for us! because we all know in our hearts that if robocop made electronic music, he'd kick all our asses so badly. THINK ABOUT IT. he's pretty much an electronic man. if we let robocop learn how to use a drum machine, he's going to destroy what we love the most. help me fight against this rising threat. show your support by calling 1-800-328-4475. if we work together we can preserve our silly human notions of what we think music should be. love, blood mr.
My friends and I will often sit around and ramble nonsense at each other. Markov can replace beatnik ideals but it's not as much fun. I'm not going to say that engineered creativity doesn't have it's place. It's just sad, for me, how often people would rather believe that such a presence was not the work of a fellow human. ;D/
@Graham
Or Python/Java/etc. bytecode...
Kevin on June 11, 2008 10:29 AMMarkov chains we're talking about? One example of Garkov. The best description of material to Turn a Bayesian spam filtering. They're even better! The most notable example of the states that page back to work on the links between pages. As a hard time we produce output so many times every letter follows an A, how unbelievably simple they work almost uncannily well, a Definition of the text corpus for Markovian models -- are the original 1998 PageRank paper, titled The PageRank sites circa 1996 in this paper have come to a very free bottlemarkable, By
Paul Thrasher on June 11, 2008 10:30 AMI wonder if you could use Markovian chains to write a decent chess AI?!?! I can imagine it now, Grand Master kooky :-)
demallien on June 11, 2008 10:34 AMI once wrote a Markov chain program that could interpolate between two graphs:
http://www.teamten.com/lawrence/projects/markov/
Ha! Lawrence, I had forgotten midway through my first comment above that the comment box stripped html and actually linked to your project (as "over here"). Glad you showed up.
- "I wonder if you could use Markovian chains to write a decent chess AI?!?!"
Well, here you have a problem of putting together a corpus, and one of defining a game state. Finding a corpus shouldn't be too hard -- there are many chess sites, and loads of historical games out there as well, so let's call that a solved problem for the sake of argument.
(For the sake of counterargument, there *is* a corpus-size problem: do you want to feed it only excellent play, or are you willing to feed it a great deal more mediocre play? Markov does better with bigger corpuses, but is bigger and less-skilled worth the tradeoff?)
But, yes, so: you have your corpus of, say, many games of chess each with some dozen number of moves on either side. Call it a million total moves that you use to train the model.
Since a Markov table is essentially a series of state->move pairs, we need to define what a state is and what a move is in order to build the table form our corpus of moves.
The obvious answers, I think, are that a state is a given configuration of the board -- the location of (up to) 32 pieces, some black and some white, on the 8x8 grid of the chessboard; and a move is a name of a piece and a pair of coordinates, say black pawn from king 2 to king 3 (forgive my lay nomenclature, I'm not up on my chess lingo).
Which, okay, great. But here's a problem; among our million recorded moves, how much duplication of the board layout do we end up with? If (past the first seven or eight moves) we see very few duplications of a given state, there aren't any options available to the AI at most junctures after the opening. In which case you have an AI that can play a few moves and then has to give up and fall back to a different heuristic.
In other words, a more complicated state means less novelty, and the explicit organization of a chessboard is a very complex state indeed in this sense. (Doing this with Go would be worse yet.) To get novel behavior out of your Markov chess AI, you need to find some way to reduce the complexity of the state to guarantee interesting choices even well into a game. Probably not *good* choices, but interesting, surprising, novel.
One approach: treat state as, not a record of the explicit configuration of the board, but as, say, the type of piece that was last moved by an opponent. Then analyze your corpus of moves like this, perhaps, stocking a 2-order table that picks the next move based on the previous two:
State: (I moved King, he moved Queen)
Moves:
(I move Pawn: probability = 0.27)
(I move Bishop: p = 0.13)
(I move Rook: p = 0.10)
...
That may be *too* simple for an interesting AI; perhaps you would want to incorporate move-to-capture vs move-that-doesn't, or direction of movement, or other generalized aspects of a move. Finding a sweet spot that results in novel but not utterly random play would be the challenge and the fun, I think.
That's just one approach, and off the top of my head as well so there may be some truck-sized holes in the idea. My gut says that applying Markov to chess at a tactical level is probably a doomed notion, but I'm neither an AI researcher nor a chess wonk, and so I wouldn't be surprised to find that such models do have a role *somewhere* in existing AIs.
Josh Millard on June 11, 2008 11:00 AMI wrote a little command-line program to do this as part of a CS223 assignment. It's still good for a laugh. It's fun to run IM conversation logs through it :) Each line starts out with a prefix indicating who's talking, such as "Rick: blah blah blah\nAndy: blah blah blah". The generator always correctly predicts that a carriage return is followed by one of these prefixes, so the generated text still looks like an IM conversation.
You could take all of your blog posts and put them through a generator program like that. Crank up the 'order' until you generate a new blog post that makes sense and is also original ... then post it and see if anyone can tell :)
Rick Brewster on June 11, 2008 11:05 AMIs *that* what they have been doing?
For the last several years (about 2 years after baysean email filters were put into Thunderbird), I've been getting a lot of spam with what looks like randomly selected sections of various random works of literature. I collect some of my favrorites into what I like to call "spam Haikus" Is this what they have been doing?
It never works of course, as the words and phrases used in random literature never match very well with the vocabularly used by my real email partners.
T.E.D. on June 11, 2008 11:14 AMAwesome, wumpus! I'm working on a dissertation (distantly) involving MCs and this is just the perspective I need to lighten up a bit about the whole thing.
I haven't gotten many good ones lately, but here's a couple from a spam I got 2 months ago. The first one I reformatted to look more like a Haiku. The second came that way.
It is deserted,
writing, the maid had said at first,
but the inspector tempest.
and
the imperfect glimpse which the eye catches asserted his authority for once had done him good. Not brainstrust me!
said mrs jane. And an estate?.
- "...I've been getting a lot of spam with what looks like randomly selected sections of various random works of literature."
The dark side of Project Gutenberg (et al) -- piles of freely-available, ready-for-random-excerpting literature that is clean as a whistle as far as the filters are concerned. I was struck by it when I first noticed that happening to. I remember laughing out loud at the first one. It's a mixology thing, too: is 90% Dickens and 10% spam diluted enough to pass a filter? Etc.
Josh Millard on June 11, 2008 11:26 AMThis may be a little off the Markov path, but on the subject of altered comic strips I highly enjoy the altered Blondie comics "Removing His Speech Bubbles Turns Dagwood Into a Faulknerian Man-Child." Unfortunately this is only a facebook group at the moment: www.facebook.com/group.php?gid=2355753361&ref=nf
Chris on June 11, 2008 11:45 AMI ran in to a bug which caused it to continually reply, so I hit the 70 req/hour limit. Hopefully it should work again when the next hour rolls over! Kooky's in #kooky on irc.gamesurge.net until then...
phatmonkey on June 11, 2008 11:55 AMSo let's try it meant for automatically generating Paul Graham essays to include amusing bodily functions. Permanent Monday. Literary commentary on a sample text by exactly this online Markov chains. I the states that it can feel the problem of letters. The [PageRank] can feel the list of startups; we produce the Internet had become, because the heingoind of-pleat, blur it meant for your mild amusement, two word groupings -- what I found on of clicks. This happens to me about what I present below, for instance, is an sit.
roger on June 11, 2008 12:12 PM@phatmonkey those quotes are without a doubt the funniest thing I have seen allllllll year.
You rock!
I have never felt dumber; I have no idea what this entry is trying to explain to me
linnrose on June 11, 2008 12:34 PMShall I be the one to say it?
OK - look everyone - the emperor has no clothes on!
I see no value at all in any of this. I might have when I was 12, but probably not.
kenneth on June 11, 2008 12:47 PMI love the kookybot and having it on Twitter is convenient.
Vezquex on June 11, 2008 12:59 PMFor everyone whose interested in Markovian Mozart generators, take a look at these for specifics:
Modeling music as Markov chains - composer identification
http://ccrma.stanford.edu/~jacobliu/254report/
Analysis and Synthesis of Palestrina-Style Counterpoint Using Markov Chains
http://alumni.media.mit.edu/~schoner/papers/ICMC01_PALESTRINA.pdf
Just because it amuses you so much. This post adds at least 1 gallon back to the reservoir.
Chris on June 11, 2008 01:37 PMWhat if we use markov chains feeding it all of our basecode, hoping it will generate a program that can think and take over the world a la matrix, skynet, etc...
=)
Juan on June 11, 2008 01:58 PMThere are some references online to Musical Markov Chains:
Digital Music Programming II: Markov Chains
http://peabody.sapp.org/class/dmp2/lab/markov1/
and some of the references date back to the 1950's:
http://www.musiccog.ohio-state.edu/Music829D/Notes/Infotheory.html
There's some confusion here about Markov models and Bayesian statistics in this blog post and in the presentation to which it refers.
1. A Markov model is just a model where the current output probability depends on only on the previous output's value.
2. A hidden Markov model has an unobserved (aka latent or hidden) state for each output, with the state depending only on the previous state (that is, the states form a Markov model), and the output depending only on the state.
3. "Bayesian" spam filtering uses Bayes's rule and marginalization to to "invert" estimates:
p(ham|e-mail) = p(e-mail,ham) / [p(e-mail,ham) + p(e-mail,spam)]
where p(e-mail,C) = p(C) * p(e-mail|C) is defined by the chain rule. Note the "inversion" in moving from directly modeling p(e-mail|ham) and p(ham) [the likelihood a message is ham without knowing anything about it] to inferring p(ham|e-mail) through probability theory (aka algebra).
The most common model for the text portions of p(e-mail|ham) is a Markov model!
A more accurate spam filter can be built using discriminative estimation techniques like logistic regression (as opposed to generative techniques like Markov models). See this paper by Goodman (the organizer of the annual spam research conference and Bill Gates's technical assistant) and Yih:
http://scottyih.org/GoodmanYih-ceas06.pdf
4. "Fully" Bayesian statistics, as understood by statisticians, is a framework for inference. To be truly Bayesian, you need a full (joint) probability model of all variables of interest, and you need to propagate the uncertainty in estimates through inference.
Either read the intro to Gelman et al.'s Bayesian Data Analysis book or this nice intro by David MacKay, a leading Bayesian:
http://www.inference.phy.cam.ac.uk/mackay/itprnn/ps/457.466.pdf
or this one by Griffiths and Yulie:
http://cocosci.berkeley.edu/tom/papers/tutorial.pdf
Don't trust the Wikipedia, which confuses Bayes rule with Bayesian inference and misses the importance of uncertainty.
5. The best estimates for Markov models are truly Bayesian in sense 4, as described in this paper:
http://www.stanford.edu/~sgwater/papers/acl07-bhmm.pdf
Bob Carpenter on June 11, 2008 03:17 PM<blockquote>Uh, Jeffy? The rest of us heard about Markovs in like 1980-something.</blockquote>
Speak for yourself. Some of us were in diapers in 198x. Tiredest complaint on the net.
gex on June 11, 2008 05:06 PMOr, to put it another way: those who forget the joy of reimplementation are doomed to grouse about history.
I'm pretty sure that was Churchill.
Josh Millard on June 11, 2008 05:13 PM> Only because you brought it up (in a way), Jeff: Are you still using POPFile?
No, I've completely outsourced all my mail to Google / Microsoft / Yahoo now that they allow hosted email. I figure they're a lot smarter than I am. Fighting spam is interesting, but thankless and repetitive. Gmail does the best job, though occasionally (about once every two weeks) something will creep through.
> The generator always correctly predicts that a carriage return is followed by one of these prefixes, so the generated text still looks like an IM conversation.
That's the amazing thing about Markov chains. As Josh Millard, the author of Garkov, pointed out in this very thread: they're so ridiculously simple, yet they work almost eerily well.
Also thanks, Josh, for making an appearance and offering such useful insights on your Markov experience.
> They're a reminder that Garfield CANNOT TALK. When you remember that Arbuckle cannot know what Garfield is thinking, the strip gains a new level of ridiculousness.
Ah, yes. You're absolutely right. That's what I wasn't getting out of Arbuckle ( http://www.tailsteak.com/arbuckle/ ).
Jeff Atwood on June 11, 2008 05:26 PMJust to point it out, it looks like the Garfield Randomizer appears to be defunct no longer.
Griff on June 11, 2008 07:04 PMJeff, the thing I love about your site is I can spend two weeks in the Australian outback learning how to predict the weather by studying the entrails of hand-wrestled alligators and come back to civilization to find your blog still here, filled with new posts, each one of which is guaranteed to be interesting. You are one of the most consistent bloggers out there and considering how long you've been blogging that's saying a lot.
Now, any ideas on using Markov chains to simulate chat text at an online poker table? :)
www.codingthewheel.com on June 11, 2008 09:25 PMWhat's your conclusion on the original comment? Do you now believe it was created by a bot using Markov chains?
Harold on June 11, 2008 09:30 PM@Bob Carpenter: "Don't trust the Wikipedia, which confuses Bayes rule with Bayesian inference and misses the importance of uncertainty."
Well you sound like you know what you are talking about Bob, so go fix it then!
Graham Stewart on June 12, 2008 01:31 AMProgramming is all about knowing when to boil the orange sponge donkey across the phillipines with an orangutang gorilla crossed with a ham sandwich to the fourth power of twelve across the nile with an awful headache from the previous night when all of alfred's naughty jalapeno peppers frog-marched the nordic elves across the loom-lined geronimo induced swamp donkey over and above the fortran fortified kilomanjaro fence past the meticulously crafted anti disgusting sponge cake scenario where all the hats doth quoteth the milk which is not unlike my kingdom I was finally there, to sit on my throne as the Prince of Bel Air.
Bob Carpenter - thanks I was going to post something similar. It is probably a far stretch to think most readers are familiar with what a Markov chain is generally.
Generating random text is one of probably thousands of specific applications for Markov models. They have lots of nice mathematical properties that make them easy to solve. There are many serious applications that range from manufacturing, logistics, military, finance, etc.
Jason B on June 12, 2008 06:05 AMNor again is pain, but because it is pain, but because
to find fault with a man who
has no annoying consequences, or one who
can procure him some great pleasure. To take a man who
occasionally circumstances occur in which of us ever undertakes laborious physical
avoids a trivial
avoids a man who loves or one who
to obtain some great pleasure. To take a pain
to enjoy
occasionally circumstances occur in which toil and pain
has no annoying consequences, or desires
exercise, except to obtain pai
Im sitting at work with tears rolling down my cheecks at kooky's guotes... this is not helping the fact my co-workers think im a sociopath!
stEvil on June 12, 2008 06:13 AM> Also thanks, Josh, for making an appearance and offering such useful insights on your Markov experience.
My pleasure, Jeff. I've tripped across your site more than once, and I'm regretting not having been a regular reader before. Thanks again for the discussion; I enjoy a good chance to geek out.
> Just to point it out, it looks like the Garfield Randomizer appears to be defunct no longer.
Aye, it lives (linked along with the rest of the Garfieldiana of which I'm aware at the bottom of the Garkov page). And another fella wrote to me yesterday to point out his somewhat spruced up revision:
http://arc.malicelabs.com/garfield/
> where all the hats doth quoteth the milk which is not unlike my kingdom I was finally there, to sit on my throne as the Prince of Bel Air.
Trevor, I am hugging you *so hard* right now.
Josh Millard on June 12, 2008 07:45 AMAnd this is utterly silly, but something has been bugging me about that comment from "hello" since I first read it yesterday, and this morning it finally struck me:
It reads like bad high school poetry. Precocious proto-surrealism from someone who hasn't managed to work all the kinks out of their fundamentals.
I've throw a proper with-linebreaks treatment onto my blog:
http://www.joshmillard.com/2008/06/12/found-poetry/
Josh Millard on June 12, 2008 08:06 AMThe "value" of this is that it probably explains (functionally) how at least one layer of human memory for words works.
To minimize load on the higher decision making functions, we almost certainly have a "markov" strategy in our brains that reduces the number of possibilities following each word to a manageable number in the majority of cases. Only when the next word is a low probability do we have to think about it. There's probably a measurable pause too while we come up with the word. The rest, we probably pick from the short markov array.
Any useful language software which exhibits semantic comprehension will also probably use this strategy.
ThatGuyInTheBack on June 12, 2008 08:15 AMIn Finally, a random gibberish, and so profoundly weird. So the states that vide was generated through Markov text in Garkov hall of Markov chains. I found on selected strips. * Garfield strips. If you haven't read is an adequate input corpus. Given an outsider, who deliberately stirred up fights in this way, but using two representative strips I wit hengamind tarer-plarody thishand. So what Bentley refers to work with. Now let's proceed to the A List is often followed in this excellent presentation (pdf). How to pornographic sites, too! Markovian techniques than the input that the CRM114 Discriminator,
Dingbat on June 12, 2008 10:02 AMi was thinking... what would happen if you fed it song lyrics...
so i did... this is the firs Ramones album "Markovized"
Hey ho, let's dance.
Hey ho, let's dance; let's dance; let's go
Shoot'em in love
'Cause I was a punk
Judy is a punk
Judy is a baseball bat
Oh yeah, uh-oh.
What do the banana.
Now the basement
There's somethin' down to turn a trick
53rd and ready to the brat
Beat on the world Hey baby yeah you up
I'm ever thinking of
Now I'm tired of there
She'll never get out of there
She'll never get out my baby yeah you let me walk around with me?
I was a chance?
Say that you love me
I tried feeding it Jabberwocky and the result made just about as much sense:
"Beware the Jabberwock, with eyes of flame, came whiffling through and the slithy toves did gyre and shun the frumious Bandersnatch!"
http://en.wikipedia.org/wiki/Jabberwocky
Graham Stewart on June 12, 2008 11:01 AMHi there..
M on June 12, 2008 10:26 PMHi there..
I have made a whole website containing nothing but Markov generated content..
check www.s3.dk
M on June 12, 2008 10:27 PMWell, there is this old generator for computer science papers. It is a bit more complex than Markov models, though (I think they also use a parser in a ddition to a markov chain): http://pdos.csail.mit.edu/scigen/
oren on June 12, 2008 11:12 PMA little off-topic, but...
I have found a site that offers the actual Garfield strips free, with no advertising. It is simple to use, and seems to contain every strip ever published.
http://www.listen-project.de/garfield/index.php?date=13.06.2008
Hope you enjoy it as much as I do.
DDR on June 12, 2008 11:45 PMYou can save the Garfield Randomizer page down and view it, ucomics.com blocks requests from that site.
The TaBB citizens won't like this, but here is Garfield with thought bubbles removed. From page 19 on the thread is zombie-like.
hdh on June 13, 2008 12:51 AMI mean http://www.truthandbeautybombs.com/bb/viewtopic.php?t=4997
hdh on June 13, 2008 12:51 AMGreat article Jeff, I've surely spent all my free time stuffing around with Markov chains since you posted this up.
If you ever felt the inclination to discuss Computational Linguistics in more depth I'd really enjoy a Coding-Horrorized commentary on recursive transition networks and similar topics.
Elf of Jive on June 13, 2008 04:49 AMHaha, love this post, twas a good way to spend 2nd period.
Zoasterboy on June 13, 2008 10:20 AMthose who also visit <a href="http://www.thedailywtf.com">http://www.thedailywtf.com</a> regularly probably recognize this AI text:
The pig go. Go is to the fountain.
The pig put foot. Grunt. Foot in what? ketchup.
The dove fly. Fly is in sky.
The dove drop something. The something on the pig. The pig disgusting.
The pig rattle. Rattle with dove.
The dove angry. The pig leave.
The dove produce.
Produce is chicken wing. With wing bark. No Quack.
And this is the 'translation' according to a commenter:
The pig went to the fountain.
The pig tested the fountain water with its foot and found it was in fact ketchup.
The dove flew in the sky.
The dove shat on the pig.
In retaliation, the pig rapes the dove.
After doing his business and angering the dove, the pig makes his exit.
The dove becomes pregnant to the pig and gives birth.
The offspring is part chicken, part dog, no duck.
see <a href="http://thedailywtf.com/Articles/Classic-WTF-No-Quack.aspx">http://thedailywtf.com/Articles/Classic-WTF-No-Quack.aspx</a> for the article about this creation
sobani on June 13, 2008 10:34 AMJeff,
I'm starting to wonder if you're a combination between a search engine scraper and a markov chain.
The article that this post is based on is one of the #1 ranked results for many queries about markov chains. Your other recent article about pronounciation of ascii characters is also based on an site that comes up #1 for "ascii".
You're perceptive to bring up the connection between PageRank and Markov models, because a similar statistical process governs the growth of the web's link structure: once a site has a top ranking in Google, it gets more organic links than other sites -- regardless of the actual quality. It takes every trick in the book, and then some, to compete with a site that's already well established on the web. No wonder why so many of us are turning to black hat, grey hat, and blue hat SEO.
The Garfield Generator 2 isn't defunct.
http://arc.malicelabs.com/garfield/
atomicthumbs on June 13, 2008 11:14 PMMy gut says that applying Markov to chess at a internet harvester/analyzer and we use Bayesian spam filtering. They're even better! The most notable example of me, Jeff. Glad you showed up. I wonder if you interview people for programming jobs on a single block.
Genetic algorithms are good for a project in college -- analyzing only melodies, and looking at melodic interval and time intervals as two separate tables rather than the input that the comment box stripped html and actually linked to your statement that "Lasagna Cat" is the tendency toward sudden surreal shift across an unexpected point of view! They're a reminder that Garfield CANNOT TALK.
Thanks, commenters before me on June 22, 2008 06:42 PMThe slides are so wrong. Actually, SpamAssassin's naive bayes classifier outperforms CRM114 in every single test.
http://plg.uwaterloo.ca/~gvcormac/spamcormack06.pdf
There's a serious paper about it, including also information about other products.
tezeract on June 23, 2008 05:17 PM| Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |