All Abstractions Are Failed Abstractions

June 30, 2009

In programming, abstractions are powerful things:

Joel Spolsky has an article in which he states

All non-trivial abstractions, to some degree, are leaky.

This is overly dogmatic - for example, bignum classes are exactly the same regardless of the native integer multiplication. Ignoring that, this statement is essentially true, but rather inane and missing the point. Without abstractions, all our code would be completely interdependent and unmaintainable, and abstractions do a remarkable job of cleaning that up. It is a testament to the power of abstraction and how much we take it for granted that such a statement can be made at all, as if we always expected to be able to write large pieces of software in a maintainable manner.

But they can cause problems of their own. Let's consider a particular LINQ to SQL query, designed to retrieve the most recent 48 Stack Overflow questions.

var posts = 
  (from p in DB.Posts
  where 
  p.PostTypeId == PostTypeId.Question &&
  p.DeletionDate == null &&
  p.Score >= minscore
  orderby p.LastActivityDate descending
  select p).
  Take(maxposts);

The big hook here is that this is code the compiler actually understands. You get code completion, compiler errors if you rename a database field or mistype the syntax, and so forth. Perhaps best of all, you get an honest to goodness post object as output! So you can turn around and immediately do stuff like this:

foreach (var post in posts.ToList())
{
    Render(post.Body);
}

Pretty cool, right?

Well, that Linq to SQL query is functionally equivalent to this old-school SQL blob. More than functionally, it is literally identical, if you examine the SQL string that LINQ generates behind the scenes:

string query = 
  "select top 48 * from Posts
  where 
  PostTypeId = 1 and 
  DeletionDate is null and 
  Score >= -4
  order by LastActivityDate desc";

This text blob is of course totally opaque to the compiler. Fat-finger a syntax error in here, and you won't find out about it until runtime. Even if it does run without a runtime error, processing the output of the query is awkward. It takes row level references and a lot of tedious data conversion to get at the underlying data.

var posts = DB.ExecuteQuery(query);

foreach (var post in posts.ToList());
{
   Render(post["Body"].ToString());
}

So, LINQ to SQL is an abstraction -- we're abstracting away raw SQL and database access in favor of native language constructs and objects. I'd argue that Linq to SQL is a good abstraction. Heck, it's exactly what I asked for five years ago.

But even a good abstraction can break down in unexpected ways.

Consider this optimization, which is trivial in the old-school SQL blob code: instead of pulling down every single field in the post records, why not pull just the id number? Makes sense, if that's all I need. And it's faster -- much faster!

select top 48 * from Posts827 ms
select top 48 Id from Posts260 ms

Selecting all columns with the star (*) operator is expensive, and that's what LINQ to SQL always does by default. Yes, you can specify lazy loading, but not on a per-query basis. Normally, this is a non-issue, because selecting all columns for simple queries is not all that expensive. And you'd think pulling down 48 measly little post records would be squarely in the "not expensive" category!

So let's compare apples to apples. What if we got just the id numbers, then retrieved the full data for each row?

select top 48 Id from Posts260 ms
select * from Posts where Id = 123453 ms

Now, retrieving 48 individual records one by one is sort of silly, becase you could easily construct a single where Id in (1,2,3..,47,48) query that would grab all 48 posts in one go. But even if we did it in this naive way, the total execution time is still a very reasonable (48 * 3 ms) + 260 ms = 404 ms. That is half the time of the standard select-star SQL emitted by LINQ to SQL!

An extra 400 milliseconds doesn't sound like much, but slow pages lose users. And why in the world would you perform a slow database query on every single page of your website when you don't have to?

It's tempting to blame Linq, but is Linq really at fault here? These seem like identical database operations to me:

1. Give me all columns of data for the top 48 posts.

or

1. Give me just the ids for the top 48 posts.
2. Retrieve all columns of data for each of those 48 ids.

So why in the wide, wide world of sports would one of these seemingly identical operations be twice as slow as the other?

The problem isn't Linq to SQL. The problem is that we're attempting to spackle a nice, clean abstraction over a database that is full of highly irregular and unusual real world behaviors. Databases that:

  • may not have the right indexes
  • may misinterpret your query and generate an inefficient query plan
  • are trying to perform an operation that doesn't fit well in available memory
  • are paging data from disks which might be busy at that particular moment
  • might contain irregularly sized column datatypes

That's what's so frustrating. We can't just pretend all our data is formatted into neat, orderly data structures sitting there in memory, lined up in convenient little queues for us to reach out and casually scoop them up. As I've demonstrated, even trivial queries can have bizarre behavior and performance characteristics that are not at all clear.

To its credit, Linq to SQL is quite flexible: we can use strongly typed queries, or we can use SQL blob queries that we cast to the right object type. That flexibility is critical, because so much of our performance depends on these quirks of the database. We default to the built-in Linq language constructs, and drop down to hand-tuning ye olde SQL blobs where the performance traces tell us we need to.

Either way, it's clear that you've got to know what's happening in the database every step of the way to even begin understanding the performance of your application, much less troubleshoot it.

I think you could make a fairly solid case that Linq to SQL is, in fact, a leaky and failed abstraction. Exactly the kind of thing Joel was complaining about. But I'd also argue that virtually all good programming abstractions are failed abstractions. I don't think I've ever used one that didn't leak like a sieve. But I think that's an awfully architecture astronaut way of looking at things. Instead, let's ask ourselves a more pragmatic question:

Does this abstraction make our code at least a little easier to write? To understand? To troubleshoot? Are we better off with this abstraction than we were without it?

It's our job as modern programmers not to abandon abstractions due to these deficiencies, but to embrace the useful elements of them, to adapt the working parts and construct ever so slightly less leaky and broken abstractions over time. Like desperate citizens manning a dike in a category 5 storm, we programmers keep piling up these leaky abstractions, shoring up as best we can, desperately attempting to stay ahead of the endlessly rising waters of complexity.

As much as I may curse Linq to SQL as yet another failed abstraction, I'll continue to use it. Yes, I may end up soggy and irritable at times. But it sure as heck beats drowning.

Posted by Jeff Atwood
159 Comments

I tried a lil test of my own which goes

select top 1000000 * from table

select top 1000000 ID from table

select * from table where ID in (select top 1000000 ID from table)

These took

118 seconds

8 seconds

175 seconds on the third query.

Rather the opposite to us presuming that getting 1 row takes 1 second, so getting 48 rows with an IN clause takes 48 seconds.

Anonymous on July 1, 2009 7:48 AM

Where is the NOLOCK behind the sql if we are talking about performance?

Mischa Kroon on July 1, 2009 8:09 AM

If all abstractions are failed abstractions, then we shouldn't be using HLLs at all! We should all be programming in C! Or, better yet, Assembler! Or machine code!

All I'm trying to say is that abstraction is good to a degree. Too much is bad, but languages that abstract the creation of assembler instructions are Good Things.

Anonymous on July 1, 2009 8:51 AM

If all abstractions are failed abstractions, then we shouldn't be using HLLs at all! We should all be programming in C! Or, better yet, Assembler! Or machine code!

All I'm trying to say is that abstraction is good to a degree. Too much is bad, but languages that abstract the creation of assembler instructions are Good Things.

Anonymous on July 1, 2009 8:52 AM

Hey Jeff...loved the iPhone post, didn't love this post.

LINQ to SQL (or better, yet, call it "LINQ to SQL Server" as that's all it is, kicks ass. It's just a nice fluent DSL over SQL. That's it. It's the thinnest of abstractions.

You really need to correct your code. I'd be happy to do a code review of StackOverflow with you and your guys.

* You don't need to call ToList(). Stop doing that and let IQueryable do its job.
* Use more projections and select new {} anonymous type or select p.ID, rather than select p.

Thanks!

Scott Hanselman on July 1, 2009 9:01 AM

Jeff, frankly speaking for once, I think you are misguiding your readers. You ask for "Select P" and complain about that getting converted to "Select *". If you really wanted only 1 column or a few columns you would certainly use select p.Id or select new {p.Id,p.Name} etc.

Anonymous on July 1, 2009 9:08 AM

Ferdy wrote:

>However, I would expect that people put effort and
>competence into tuning, and that buying a server is
>not the default solution.

And you think there is not cost to the environment, energy and the human soul from doing that? Perhaps in the time saved debugging, and handling user complaints, the programmer could be out riding a boat to save the whales.

JessicaBoxer on July 1, 2009 9:09 AM

@JessicaBoxer: Perhaps the programmer could spend time becoming a better programmer!

Marcel Lamothe on July 1, 2009 9:35 AM

Given the fluid syntax, yet indirect control we have over the actual statements sent to the database, I approach LINQ to SQL with cautious optimism.

I see the relationship of LINQ to SQL vs. direct database programming as similar to C# vs. machine code. I trust that the C# compiler and the CLR JIT compiler will produce machine code from my C# code that is sufficiently performant such that the benefit of writing in C# far exceeds the cost of writing machine code directly.

And given that SQL statements are being compiled from my LINQ statements, the "select *" issue may very well be optimized out in the future, and likely with no change to the code I've written.

Craig Boland on July 1, 2009 9:42 AM

@JessicaBoxer

Your point is funny but from personal experience there is a quite small limit to how often you can just "add a server" and see a real increase.

I recently upgraded our ERP server from a quad-core with 3 gigs of RAM and 3 disks to an dual quad-core with 32 gigs of RAM, 17 hard drives all 50% faster, separate physical RAID 10 Arrays for the OS, Applications, Database, Log Files, and Temporary database files.

The net result? A 5% increase in performance. Now I know this ERP system like the back of my hand (but didn't write it) and I spent a long time making sure that my hardware configuration exactly follows the vendor's so-called "best practices".

It's simply that the system it runs uses a poorly designed and very leaky database abstraction which makes poor assumptions about its physical environment. (you know like one of those academic papers, "assume the network is of infinite bandwidth and hard drive access is instantaneous").

Bottom line, we paid 15,000 Euros for a 5% performance increase, how many more times do you think any given company could afford to do this? Plus, even if I wanted to, there is a physical limit on how fast a server I can get.
The only way to fix the problem is to fix the abstraction, i.e rewrite the shitty code.

Stephen on July 1, 2009 10:25 AM

I'll admit, I've never used LINQ yet, but I was interested in possibly adapting it in the future. From reading this, it sounds like a revamp of old-school VB6 "Open Recordset" for people who can't quite get their heads into set-based operations instead of stepping through rows? If that's the case, it's not for me but I see the use.

However:
In the original example, where did the "Top 48" come from to be introduced into the LINQ sql statement? Did LINQ determine there were 48 records in the set to be returned, because I don't see how Jeff's parameters would have otherwise indicated he only wanted 48. Or was this snippet a pseudocode with the limit set outside the example shown? I would find it interesting if LINQ pre-calcuated it's own statistics, and would have to consider that maybe Microsoft launched a second SQL optimizer team within LINQ - which I would explore.

keith on July 1, 2009 10:50 AM

snarf! Using Joel's argument, we may as well go back to using an abacus because all improvements afterwards "leak".

Bah!

Writing correctly executing "slow" code is more important than writing incorrectly or poorly executing "fast" code. Write correct first, optimize where necessary later. Abstractions help us write correct code more easily.

Andrew on July 1, 2009 10:59 AM

I don't like LINQ2SQL. If speed matters then the SqlDatareader will win every time. And why create all these new DSL:s - reason: because the old stuff is so boring to write!!! Good reason -( * NOT * !!!). I think a lot of these software methodlogies are crap, because one day you believe in Method A and the next the day it's Method B, you change your mind like you change underwear!
Now there's coming this Entity thing, and this NHibernate is here already and LINQ - but when you use the sqldatareader life is sweet and simple! It's fast, simple and reliable. Would you want it any other way?

code monkey on July 1, 2009 11:01 AM

I don't like LINQ2SQL. If speed matters then the SqlDatareader will win every time. And why create all these new DSL:s - reason: because the old stuff is so boring to write!!! Good reason -( * NOT * !!!). I think a lot of these software methodlogies are crap, because one day you believe in Method A and the next the day it's Method B, you change your mind like you change underwear!
Now there's coming this Entity thing, and this NHibernate is here already and LINQ - but when you use the sqldatareader life is sweet and simple! It's fast, simple and reliable. Would you want it any other way?

code monkey on July 1, 2009 11:02 AM

So many comments by so many programmers that don't know the first thing about Linq, much less LinqToSql.

These programmers are leakier than the abstractions they complain about.

David B on July 1, 2009 11:37 AM

I think LINQ sucks, always something that goes wrong so I use my Sqldatareader and push it in a generic list. Works everytime!!

code monkey on July 1, 2009 12:02 PM

@Andrew

Snarf? Isn't that the little guy from ThunderCats?

Practicality on July 1, 2009 1:05 PM

Hi,

as some of the commenters already pointed out there is a serious leakage in the reasoning behind the samples in this post, or to put it more bluntly THEY ARE WRONG. This serves as a trmendous disfavor to the creators of LINQ and all those developers who have been using it in their work (as in, "Hey, are you using this *crap* that Jeff denounced recently?")
So, I propose to you Jeff to create a new post and to apologise for your mistake and openly say that in fact LINQ is less leaky as you claimed.

Przemek (software developer at Goyello).

Anonymous on July 1, 2009 1:33 PM

Tight Coupling is not the perfect answer but is a "simple" solution that works. I may not know a lot but I'm humble at least, instead of those arrogant and ignorant bastards who haven't a clue, but believes they are "jedi masters". And playing WOW doesn't make you a computer genius if you thought so!!!

Wanko on July 2, 2009 5:43 AM

Not to pile on (which of course is exactly what I'm doing) but you can achieve select ID quite easily with Linq to SQL and I'm amazed you're using Linq to SQL and haven't realized that yet. There's a very strong, predictable correlation between your LINQ query and the resulting SQL.

Once you have a better understanding of Linq to SQL, then you might be able to comment intelligently on how leaky of an abstraction it is.

JohnOpincar on July 2, 2009 8:37 AM

@ Xiani

"Now, I expect this process is more than just a bit optimized - SELECT * is after all a pretty ubiquitous shortcut, but it is still a shortcut and hence has a cost, and it's one of the first things any performance-obsessed DBA (i.e all of 'em!) will tell you to stop doing."

IF you are actually getting all the columns, I'd argue that "SELECT *" may be (depending on the DBMS implementation) a MORE performant method than "SELECT COL1, COL2, COL3, COL4, COL5".

SELECT *:

App passes in simple string.
DBMS parses SQL query string
DBMS sees "*" and translates that to a (cached) array of column references

SELECT [list of named columns]

App constructs longer string (either from a cached array of column references or stored in-memory as the intact SQL string)
DBMS parses SQL query string
DBMS sees "COL1" and translates that to a column reference
DBMS sees "COL2" and translates that to a column reference
...

The upside in the latter is that the application *can* rely on column positioning to get specific column values, but that leads to highly fragile code and should generally be avoided by the application programmer.

Not trying to pick nits or anything, but just wanted to point out that optimizing one "layer" of your stack by imposing draconian edicts on other layers isn't always a net performance gain.

Tom Dibble on July 2, 2009 10:53 AM

To add: ubiquitous shortcuts are generally a GOOD thing in an environment, as they can be special-cased for performance. Caching the '*' column reference array is a quick and easy special case optimization.

In general, when allowing for highly expressive environments, the more "shortcuts" you provide your developers the more opportunities you have to increase performance when your developers take advantage of those shortcuts. On the opposite perspective, when developing in a well-written system with prominent shortcuts, it is often better to USE the shortcut and thus take advantage of that special casing instead of trying to outthink the system designer. Cases where the above is NOT true should be highly visible in the documentation.

Tom Dibble on July 2, 2009 11:00 AM

@janos erdelyi - "if i had my DBA hat on, i'd also be carrying a DBA Knife to kill you in your sleep."

They gave you a DBA Knife? All I got was the stupid Hat! I can see how the Knife would may so many things so much easier, though. Hmmm, I wonder if RedGate sells them...


(seriously, that's the funniest thing I've seen all week)

Chucklehead on July 2, 2009 11:13 AM

Great article (again). A willingness to consider abstractions to see if they make life easier, rather than a blanket "that abstraction is leaky so I won't use it" is essential.
See my post on this mindset in the context of business rules:
http://www.ebizq.net/blogs/decision_management/2009/07/business_rules_are_a_failed_ab.php

JT

James Taylor on July 2, 2009 11:39 AM

There is so much FUD regarding LINQ to SQL in the comments to this post that it would take Several paragraphs to correct them all. I won't try.

But - going unmentioned - is the use of lambda expressions rather than the pseudo SQL syntax. This allows the build up of complex queries in easy to understand syntax rather. Say goodbye to the typical joins from hell or procedural code in T-SQL - not the best place for it.

Also, the ability to use the SAME abstraction working over in memory collections, XML, and data from the database provide a comprehensive, consistent model for working with collections. This fact seems to be missing from the discussion.

Jeff - I would take up Mr Hanselman on his offer!

Jeff on July 3, 2009 4:50 AM

Linq2Sql is great for rapid prototyping and throw-away scripts.
But if you have to play the performance game on high volume webservers, you better handtune and use sqldatareaders and other forms of optimizations.
Sometimes you have to import sql-data in your own datastructures to gain the needed performance. An sql-server itself is a big layer of abstraction, which of course often leads to performance-bottlenecks.

CWissing on July 3, 2009 12:52 PM

@Hanselman, @JohnOpinCar & others:

You have either missed the point entirely or you are just taking a strawman cheapshot at Jeff. Get down from your high horses and re-read what Jeff actually wrote.

Select p vs select p.id doesn't at all solve Jeffs problem. He wants all the columns. p.id is useful in that once you have found the underlying issue that the abstraction tries to hide you can work around it, but that's totally beside the point.

The point is that you have to bypass the abstraction in order to find the issue in the first place.

Jeffs claim that lin2sql generates "select *" by default is a valid complaint (if it is accurate, I assume it is). Why doesn't select p translate into "select [col1], [col2],.." etc? This is a very minor part of what Jeff wrote, though.

That it is possible to work around that with Linq is good in a way, as it means that the leaky abstraction has a band-aid.

If I want to, Linq lets me have to think about every sql weirdness but uses a different syntax that I have to learn - hooray...no wait, that's just pointless!

However, Linq lets you work with sql in an easy consistent way by abstracting some things for you. This works fine in most cases, which is why linq isn't pointless.

But is a LEAKY abstraction, and I think Jeff has made a good, valid point. This time...;-)

Console on July 4, 2009 4:39 AM

"Linq to SQL is a good abstraction"
How so? I've never understood how mangling SQL into a new "Linq" structure is somehow an improvement. I love how it fits in to the program - but I don't see why the structure is so different to SQL.

It reminds me of back in 2002 when one of the VB.Net designers told me that the names of many object oriented features were re-named from C++/C#/standard terminology just so that it was "different". Their argument was if it wasn't different it wouldn't be a different language. Here we have it again with Linq where it is different from SQL just so that it is different. It would be great if they could have based it on SQL - that would have made sense.

So - is Linq's any good? That comes down to perspective. From a programmer's perspective it offers ease of programming. From database performance perspective it is a big failure. You point out only some of the performance issues - there are many other performance issues you don't list.

Linq doesn't allow cached server side execution, which can be huge for large data sets with complicated relationships. Index and other compiler hints, collation issues and more advanced statements can't be done (at least not to my knowledge).

I've found major slow downs specifically caused by Linq. I've done a couple of searches and the general rule of thumb seems to be that data access runs at 50% under Linq compared to not using Linq.

I'm programming for a system that has 4-500 concurrent users on a database with over 100,000 customers and many millions of transactions. So for me the difference can be substantial. A SQL stored procedure, properly tunned and cached, may run in 10 seconds while the same command issued through Linq may take a minute or more to run.

Maybe this wouldn't matter so much on a smaller system.

Philip on July 5, 2009 4:16 AM

@Stephen says...
> there is a quite small limit to how often you can
> just "add a server" and see a real increase.

Of course that is true. Generally speaking this problem is the inability to run operations in parallel due to interdependency. However, generally speaking database type applications are quite amenable to parallelization. This is due to the fact that reads are far more prevalent than writes (and the fact that SELECT is largely declarative rather than procedural.) Consequently, database performance can often be improved by adding read only servers behind a good load balancing scheme.

In any application there tends to be a few small bottlenecks, and you should apply your "fix the nasty code" process there. Note that often fixing bottlenecks introduces nasty code, of necessity. (For example, introducing code that is highly dependent on the implementation details of a particular server, or breaking the abstraction of the database with such things as nolock.)

> I recently upgraded our ERP server from a quad-core ...
> The net result? A 5% increase in performance.

Two things -- I would not have gone with a big iron server, but rather multiple smaller, simpler servers. Secondly, if that is the case your problem is most likely very nasty code with lots of unnecessary dependencies. This is the sort of code that is written by people who obsess about squeezing every ounce of performance out of a server, which usually means they do so with premature optimization, leading to horrible, nasty, unmaintainable, and tangled code.

I know nothing more about your situation that what you told me, but I would make an educated guess that that was your situation. I'd recommend that you build a quality test case suite, and then start refactoring the heck out of the code.

JessicaBoxer on July 6, 2009 2:20 AM

I think it's because the underlying thing being abstracted is really general (e.g., a pencil, a microprocessor) and the abstraction is less general and more oriented towards a particular way of doing something (e.g., word processor, which works in lines of letters instead of line drawings, or shading, or whatever; higher-level language, which may not let you grab those bits and do things with and to them like you can with assembler), so you end up using the abstraction like a waldo or it's like using a wrench as a hammer: it feels horribly inefficient and wrong. The abstraction is more specialized. So you have to understand the underlying thing really really well and the idea of the abstraction really really well in order to push the abstraction around to do what you could do pretty easily with the underlying thing. I am thinking of a particular circa 1980s Jovial compiler ...

Dorothy on July 6, 2009 4:00 AM

Microsoft WANTS developers to use SQL; thye just recommend that it be done in a controlled manner: through stored procedures. LINQ struck those of us in the DBA realm as being completely contradictory to Microsoft's prior recommendations, and apparently there's a bit of a scism in Redmond about it.

In my opinion, it falls in the same category as the "4GL" tools of the 80's: it's a tool to allow poor programmers to write bad code faster.

Kris Cook on July 6, 2009 7:04 AM

@Philip - LINQ2SQL != LINQ. LINQ is a generalised querying language for working with many different types of data in an identical manner, from in-memory collections to back-end databases. That's the whole point - make sure you know what you're criticising first.

LINQ2SQL allows for the use of Stored Procedures (see http://weblogs.asp.net/scottgu/archive/2007/08/16/linq-to-sql-part-6-retrieving-data-using-stored-procedures.aspx) if you want generation of query plans ahead of time etc.

Could you provide the links where a 50% speedup is proven for using raw data readers as opposed to LINQ? I'm betting that you can't - MS wouldn't release LINQ2SQL with anything like that sort of overhead.

Enough with the FUD. Learn about the technology before trying to shoot it down.

Oh and where's Jeff in this discussion? He seems to have gone very quiet. I would have thought he would at least have corrected the post after so many people pointed out his mistakes.

Dave R. on July 6, 2009 8:35 AM

@David R.

Re-reading my post I can see that I should have been a lot more explicit.

Seeing as the topic was specifically about Linq to SQL I thought it was self evident that this was the subject. I like linq, and I'm not saying anything negative about it in general, just the use of SQL.

As for use of stored procedures through linq, and the use of compiled statemetns - they simply don't work anywhere near as fast. BUT - I was refering to linq statements - compiled or not. (I probably should have stated that).

Please don't assume that we are random users making comments about things we don't understand. I've been doing DBA on SQL for a decade so I'm capable of listinging to what others say and then doing my own tests to see what is correct. I LIKE LINQ (both generally and for SQL) - but it simply doesn't perform for me. I HOPE future versions do.

So - no FUD involved - these are tests I have done. I simply can not get LINQ to perform anywhere near the speed of other methods. Please try it yourself before condemming me as a FUD maker. I even explained my situation to make it clear that it may only be upper end users have have to worry about this.

Also, before you condem that we haven't learnt the techology it would pay for you to do some quick google searches to find out what the actual state of play is.

For instance,I did a google search and turned up something right away
http://www.devtoolshed.com/content/performance-benchmarks-linq-vs-sqldatareader-dataset-selects-part-1

http://www.devtoolshed.com/node/12

Philip on July 6, 2009 11:25 AM

I know little about LINQ or SQL, but there seems to be a logical fallacy in this post.

The argument is that LINQ is less efficient than SQL, but to optimise the SQL you have to make 48 individual queries to extract 48 rows. Can this be done in a single SQL statement? If not, doesn't that make SQL as much of a "failed" abstraction as LINQ?

By the way, you never state what the time is for the actual 48 query option. Why not? You surely don't think the equation "(48 * 3 ms) + 260 ms = 404 ms", is realistic.

Phenwoods on July 6, 2009 12:35 PM

IMHO abstractions aren't fundamentally leaky, the humans who implement them are.

Eagius on July 7, 2009 1:24 PM

If "SELECT TOP 48 * " is really slower (instead of faster) than "SELECT TOP 48 id" plus 48* "SELECT * WHERE id=", something must be badly broken in your database. The second combination creates *by far* more work for the database. I can only guess the experienced difference is because of caching effects.

ammoQ on July 7, 2009 1:35 PM

@Dave R

I think we aren't entirely in disagreement.

I explicitly detailed the size of my data and stated that my findings may only be due to size of the data I have.

You accuse me of FUD in general, and I disagree. But I actually agree that I am generating FUD in relation to LINQ2SQL for large databases/data warehouses. You are right! I am clearly saying that it is an inappropriate technology for large databases (Doubt). I am clearly saying that it runs slower than other more appropriate technologies on larger databases (Fear). I am saying that the cost/benefit is unknown without the developer first doing benchmarking and testing (Uncertainty). YES - I am preaching FUD for Linq2sql for large databases!

I chose those links because they were some of the first ones I googled. I could probably find more - but I don't have the time to prove myself right when I have already done that ON MY DATA. For all in know LINQ2SQL may be faster in some situations, that's why I STILL test alternative options in order to get the desired performance.

You said the links only used a few rows, and that's a valid issue. So, did YOU repeat the tests on that page with extra data? Did you add in 100,000 to 100,000,000 records and then look at the results? If not then give it a go - all the code was there to do just that. You don’t have to believe the site author, you CAN do the test yourself on data that matches the size of tables you deal with.

Chances are you can't be bothered, and I don’t blame you. I only started benchmarking and testing when I first noticed massive slowdowns when using LINQ2SQL, which is what woke me up to the issue.

I actually agree that there is a balance between the benefits of LINQ2SQL. I DO use Linq2sql for small/quick tasks, the problem is that majority of my data tasks are over large volumes of data. If something takes at least 16 ms to run... you could make it 10 times slower and I'm not going to care. BUT if something takes at least 60 seconds to run, you can put money on the fact I will be doing everything I can to make sure that's all it takes to run.

You mentioned the “small” penalty of LINQ, and this is where the size of the data becomes important. For large data it isn't a small penalty to use LINQ, it is a large one.

To be very specific, I have one table where the current row count is 3,505,920, and another that is 22,381,736. These tables are accessed very frequently. I simply can't get the performance out of LINQ2SQL to justify its use.

I have several hundred concurrent users. I am about to have maybe 800 mobile vehicles that log their gps position every 30 seconds AND provide extra data and GPS positions during specific events - producing a mind blowing amount of daily data.

So for me the speed impact of LINQ matters. StackOverflow? I don't know - that's up to them to do their own technology comparisons and tests. For you? I don’t know.

If you take it that I am generating FUD – fair enough, but the fear, uncertainty and doubt can all be eliminated by some very simple benchmarks that data developers should be doing.

Philip on July 8, 2009 6:09 AM

@Philip - did you actually read the sites you linked to? The guy's tests are hardly real-world. He does 10000 SELECTs on a single 25-row table using three different methods. If you're re-running a single query 10000 times using LINQ without compiling your query, the setup costs are going to be repeated 10000 times and amplified.

If you're generalising that all LINQ2SQL queries exhibit a 50% slowdown based on this, then, yes, I'm going to accuse you of FUD, or at least lack of knowledge about the technology involved. And it really doesn't matter how many years you've done SQL - this argument is about LINQ2SQL, not T-SQL. I'm still waiting for the stats to support your argument - I'm guessing that you can't find them.

It would be interesting to see what sort of effect switching from LINQ2SQL to 'raw' datareaders would have on a site such as StackOverflow. But I'm betting that the developers are happy to pay a small penalty for the increase in productivity afforded to them.

Dave R. on July 8, 2009 6:19 AM

@Dave R

I think we aren't entirely in disagreement.

I explicitly detailed the size of my data and stated that my findings may only be due to size of the data I have.

You accuse me of FUD in general, and I disagree. But I actually agree that I am generating FUD in relation to LINQ2SQL for large databases/data warehouses. You are right! I am clearly saying that it is an inappropriate technology for large databases (Doubt). I am clearly saying that it runs slower than other more appropriate technologies on larger databases (Fear). I am saying that the cost/benefit is unknown without the developer first doing benchmarking and testing (Uncertainty). YES - I am preaching FUD for Linq2sql for large databases!

I chose those links because they were some of the first ones I googled. I could probably find more - but I don't have the time to prove myself right when I have already done that ON MY DATA. For all in know LINQ2SQL may be faster in some situations, that's why I STILL test alternative options in order to get the desired performance.

You said the links only used a few rows, and that's a valid issue. So, did YOU repeat the tests on that page with extra data? Did you add in 100,000 to 100,000,000 records and then look at the results? If not then give it a go - all the code was there to do just that. You don’t have to believe the site author, you CAN do the test yourself on data that matches the size of tables you deal with.

Chances are you can't be bothered, and I don’t blame you. I only started benchmarking and testing when I first noticed massive slowdowns when using LINQ2SQL, which is what woke me up to the issue.

I actually agree that there is a balance between the benefits of LINQ2SQL. I DO use Linq2sql for small/quick tasks, the problem is that majority of my data tasks are over large volumes of data. If something takes at least 16 ms to run... you could make it 10 times slower and I'm not going to care. BUT if something takes at least 60 seconds to run, you can put money on the fact I will be doing everything I can to make sure that's all it takes to run.

You mentioned the “small” penalty of LINQ, and this is where the size of the data becomes important. For large data it isn't a small penalty to use LINQ, it is a large one.

To be very specific, I have one table where the current row count is 3,505,920, and another that is 22,381,736. These tables are accessed very frequently. I simply can't get the performance out of LINQ2SQL to justify its use.

I have several hundred concurrent users. I am about to have maybe 800 mobile vehicles that log their gps position every 30 seconds AND provide extra data and GPS positions during specific events - producing a mind blowing amount of daily data.

So for me the speed impact of LINQ matters. StackOverflow? I don't know - that's up to them to do their own technology comparisons and tests. For you? I don’t know.

If you take it that I am generating FUD – fair enough, but the fear, uncertainty and doubt can all be eliminated by some very simple benchmarks that data developers should be doing.

This is my last post on this page - I'll come back to read if you want to post anything else, but I think I've made my case as clear as I'm able to.

Philip on July 8, 2009 6:42 AM

I wonder, for your sample table / query, if nesting it works better.

SELECT * from table where ID in ( select top 48 id from table where [clauses] )

Theoretically gives the sql server a chance to optimize that it doesn't get from your independent queries, and gives some additional hint that you don't actually want everything from all entries then filtered to the top 48, but just everything from the top 48, which are subtly different.

Chris on July 8, 2009 12:09 PM

LINQ Rules! I can't believe developers are still writing SQL Strings and using SQLCommands and SqlConnections. I've been able to develop two applications in the time it would take me to develop just one, if I weren't using LINQ.

sam on July 11, 2009 10:20 AM

a lot of people miss the entire point of encapsulating data access with stored procedures - it's (also) security. DBA's (myself included) don't want user app's to have direct access to the tables. Using Linq requires that, so security is out the window. Using stored procedures can not only provide more efficient data access, but provide for a designed security layer.
-dennis

Dennis Parks on July 11, 2009 10:25 AM

2 Dennis Parks: thank you! finally someone pointed this out! I was surprised it took so many posts..

Tomas B on July 15, 2009 6:00 AM

While I place importance in the performance of any piece of code, I must admit that one of the driving forces in using heavily abstracted technologies -- that by their very nature add heaps of overhead -- is the client.

Clients are less interested in how properly coded their product is than the time-to-production.

I just went through an application that was using a vanilla datagrid to display a a paged result set, but the paging was being done client side and as a result a quarter hundred thousand records were being pulled from the database each page, while onlyl a hundred or so were being displayed.

My guess is that the original developer had not thought how many records there were going to be, instead focusing on the business logic around the abstracting control (the datagrid).

This abstraction was causing intermittent problems with resource usage (although I have not explicitly proved this, I think it is most likely), it had finally leaked.

Now this abstraction may have saved a few hours of coding but in the end it caused more hours of support work, and some user dissatisfaction.

It's not always clear when your abstraction will bite you in the @$$ at the end of the day, but it is important to know the pitfalls your abstraction will have, and when they may occur.

EnderWiggin on July 16, 2009 5:26 AM

I just hate the phrase "shoring up." It is such a Palinism. And also (intended), I think you could have just linked to Joel's original article and been done with it.

And also (again, intended), there are many, many much better tools than LINQ-to-SQL. I think this stuff is just throwing stones at glass houses that both save more time and commit far less of these kinds of mistakes.

Charles on July 17, 2009 11:12 AM

@Dennis Parks - LINQ2SQL can use stored procedures:

http://weblogs.asp.net/scottgu/archive/2007/08/16/linq-to-sql-part-6-retrieving-data-using-stored-procedures.aspx

Dave R. on July 20, 2009 6:32 AM

I'm really very useful to follow a long-time see this as a blog here Thank you for your valuable information.

OYuN on July 26, 2009 8:20 AM

Why not take your abstraction to the next level and sit your application code over an object model by using an ORM like JDO or JPA?

Then you get to use inheritance, polymorphism and interfaces (all forms of abstraction provided by all good OO languages) which are the crucial tools used to prevent code bloat in large applications.

You'll also find the amount of SQL code you have to write will drop to virtually nothing as the ORM does most of the mind numbingly, 'boring as batsh*t' SQL 'glue code' for you automatically - and yes, it even performs well, very well!

blinko on August 29, 2009 11:20 AM

Why would anyone want to use LINQ. I refuse to use it. I'm going to select data using real TSQL or PLSQL so I can get EXACTLY what I want and be done with it. Unbelievable how people will jump on the MS bandwagon for any new thing that comes out. Just makes me sick actually. I'm telling every potential employer the facts about LINQ - how it sucks so bad - and how I refuse to use it in my applications.

Why do you want to abstract something simple like SQL? Just makes no sense.

David on September 16, 2009 9:41 AM

It was a very nice idea! Just wanna say thank you for the information you have shared. Just continue writing this kind of post. I will be your loyal reader. Thanks again.

Wow gold on September 29, 2009 12:47 PM

Jeff, what the hell?

You're getting a "select *" in your SQL because you used the IDENTICAL "select p" in Linq. Change the Linq to "select p.Id" and you'll get just a "select Id" in the SQL.

This isn't a leaky/failed abstraction at all. Please tell me that you actually knew this all along, but just chose a really terrible example to make your point?

I get the whole thing about indexing and datatype sizes and yaddayadda, but if that is your point, then you have the exact same problem writing actual SQL without the abstraction. You can only deal with those other issues if you have control over the database design itself.

Aaron G on February 6, 2010 11:16 PM

It's unfortunate that legions of developers will scan the post, not read the corrections in the comments, and will endlessly parrot "spolsky and atwood said not to use LINQ to SQL".

As several other people commented, if you only want certain fields, you should select them. I think the point you were getting at was that the ORM should figure out what fields you're eventually going to use, but that seems like it might set off some NP alarms.

Jon Galloway on February 6, 2010 11:16 PM

I question why Jeff is using Linq at Stack Overflow at all?

Is Stack Overflow ever going to run on a non-sql server platform?

I would only consider Linq if we needed to support more than one DB platform. Running your own site on a specific platform doesn't meet that criteria. In that case you are better off writing Stored Proc and maximizing database performance and dealing with SQL and the RDMS engine directly.

Jon Raynor on February 6, 2010 11:16 PM

Linq to SQL is an abstraction? Not by my definition of abstraction. It's a domain specific embedded language that allows you to write SQL in C# - the C# AST fragment that the LINQ expression is translated to has a one-to-one correspondence with the SQL that is generated.

Stuart Dootson on February 6, 2010 11:16 PM

Everything you do is an abstraction. C# is an abstraction of the CLI. The CLI is an abstraction of C. C is an abstraction of assembler. Assembler is an abstraction of the microcode. If you want no abstractions, the only option is a field programmable gate array. Yes, they are fast. The point is that you need to decide whether the performance hit and leaks are worth the reduced programming effort.

Grant Johnson on February 6, 2010 11:16 PM

The abstraction argument is so pathetic (IMHO).

What is sql, c#, filesystems, etc except one big fat abstraction after another on top of ones and zeros.

damon on February 6, 2010 11:16 PM

After seeing LINQ used for database abstraction in a large project, I think it was a huge mistake (although LINQ expressions are very useful).

A better abstraction, I think, is to have a RESTful interface over HTTP that applications use. (The actual RESTful queries would not need any abstraction between it and SQL, and would be closely coupled to the schema.) But all applications downstream would essentially have a simple, clean, domain specific interface for manipulating data objects.

This doesn't solve all data problems -- certain schema changes would require changes to the RESTful interface. But if done right, I think this abstraction is about as leak-proof as you can get, when dealing with databases.

Jon on February 6, 2010 11:16 PM

As a developer, I simply cannot do my job unless I understand not just database structures but how the database works. To be honest, I need a good understanding of the underlying OS and storage layouts as well. This needs to be at a more than rudimentary level of knowledge and employers are beginning to realise that the best developers are those people who have a decent amount of experience in all these areas.

Abstractions will be at their most useful when we break down the silos between developers, DBAs and Admins and we are all able to understand exactly what it is the abstraction is hiding.

At this point, abstractions will still be leaky, we will just understand the leak and probably either avoid it or take advantage of it.

Jon on February 6, 2010 11:16 PM

Abstractions tend to work better if one starts from concrete then abstract on-demand and refrain from going further until needed.

Anyhow, I think this LINQ problem could actually be a case of optimizing at the wrong level because moving a time consuming task out of time constrained path is simpler and longer lasting than making it fit now and having to revisit again when requirements change, usually negatively.

Don Park on February 6, 2010 11:16 PM

«Back

The comments to this entry are closed.