Thoughts on ORMs, 2016 Edition

This is a bit of a follow up to my 3 year old post about LINQ 2 SQL and my even older 2011 Thoughts on ORM post. I've changed my position several times over the last three years as I learned about issues with different approaches. TL;DR is that for reads, I'd prefer handwritten SQL and Dapper for .Net as a mapper to avoid having to deal with DataReaders. For Inserts and Updates, I'm a bit torn.

I still think that L2S is a vastly better ORM than Entity Framework if you're OK with solely targeting MS SQL Server, but once you go into even slightly more complex scenarios, you'll run into issues like the dreaded SELECT N+1 far too easily. This is especially true if you pass around entities through layers of code, because now virtually any part of your code (incl. your View Models or serialization infrastructure) might make a ton of additional SQL calls.

The main problem here isn't so much detecting the issue (Tools like MiniProfiler or L2S Prof make that easy) - it's that fixing the issue can result in a massive code refactor. You'd have to break up your EntitySets and potentially create new business objects, which then require a bunch of further refactorings.

My strategy has been this for the past two years:

  1. All Database Code lives in a Repository Class
  2. The Repository exposes Business-objects
  3. After the Repository returns a value, no further DB Queries happen without the repository being called again

All Database Code lives in a Repository Class

I have one or more classes that end in Repository and it's these classes that implement the database logic. If I need to call the database from my MVC Controller, I need to call a repository. That way, any database queries live in one place, and I can optimize the heck out of calls as long as the inputs and outputs stay the same. I can also add a caching layer right there.

The Repository exposes Business-objects

If I have a specialized business type (say AccountWithBalances), then that's what the repository exposes. I can write a complex SQL Query that joins a bunch of tables (to get the account balances) and optimize it as much as I want. There are scenarios where I might have multiple repositories (e.g., AccountRepository and TransactionsRepository), in which case I need to make a judgement call: Should I add a "Cross-Entity" method in one of the repositories, or should I go one level higher into a service-layer (which could be the MVC Controller) to orchestrate?

After the Repository returns a value, no further DB Queries happen without the repository being called again

But regardless how I decide on where the AccountWithBalances Getter-method should live, the one thing that I'm not going to do is exposing a hot database object. Sure, it sounds convenient to get an Account and then just do var balance = acc.Transactions.Sum(t => t.Amount); but that will lead to 1am debugging sessions because your application broke down once more than two people hit it at the same time.

At the end, long term maintainability suffers greatly otherwise. It seems that it's more productive (and it is for simple apps), but once you get a grip on T-SQL you're writing your SQL Queries anyway, and now you don't have to worry about inefficient SQL because you can look at the Query Execution Plan and tweak. You're also not blocked from using optimized features like MERGE, hierarchyid or Windowing Functions. It's like going from MS SQL Server Lite to the real thing.

So raw ADO.net? Nah. The problem with that is that after you wrote an awesome query, you now have to deal with SqlDataReaders which is not pleasant. Now, if you were using LINQ 2 SQL, it would map to business objects for you, but it's slow. And I mean prohibitively so. I've had an app that queried the database in 200ms, but then took 18 seconds to map that to .net objects. That's why I started using Dapper (Disclaimer: I work for Stack Overflow, but I used Dapper before I did). It doesn't generate SQL, but it handles parameters for me and it does the mapping, pretty fast actually.

If you know T-SQL, this is the way I'd recommend going because the long-term maintainability is worth it. And if you don't know T-SQL, I recommend taking a week or so to learn the basics, because long-term it's in your best interest to know SQL.

But what about Insert, Update and Delete?

Another thing that requires you to use raw SQL is Deletes. E.g., "delete all accounts who haven't visited the site in 30 days" can be expressed in SQL, but with an ORM you can fall into the trap of first fetching all those rows and then deleting them one by one, which is just nasty.

But where it gets really complicated is when it comes to foreign key relationships. This is actually a discussion we had internally at Stack Overflow, with good points on either side. Let's say you have a Bank Account, and you add a new Transaction that also has a new Payee (the person charging you money, e.g., your Mortgage company). The schema would be designed to have a Payees table (Id, Name) and a Transactions table (Id, Date, PayeeId, Amount) with a Foreign Key between Transactions.PayeeId and Payees.Id. In our example, we would have to insert the new Payee first, get their Id and then create a Transaction with that Id. In SQL, I would probably write a Sproc for this so that I can use a Merge statement:

CREATE PROCEDURE dbo.InsertTransaction
	@PayeeName nvarchar(100),
	@Amount decimal(19,4)
AS
BEGIN
    SET NOCOUNT ON;

    DECLARE @payeeIdTable TABLE (Id INTEGER)
    DECLARE @payeeId INT = NULL

    MERGE dbo.Payees AS TGT
        USING (VALUES (@PayeeName)) AS SRC (Name)
        ON (TGT.Name = SRC.Name)
    WHEN NOT MATCHED BY TARGET THEN
        INSERT(Name) VALUES (SRC.Name)
    WHEN MATCHED THEN
        UPDATE SET @payeeId = TGT.Id
    OUTPUT Inserted.Id INTO @payeeIdTable
    ;

    IF @payeeId IS NULL
        SET @payeeId = (SELECT TOP 1 Id FROM @payeeIdTable)

    INSERT INTO dbo.Transactions (Date, PayeeId, Amount)
    VALUES (GETDATE(), @payeeId, @Amount)
END
GO

The problem is that once you have several Foreign Key relationships, possibly even nested, this can quickly become really complex and hard to get right. In this case, there might be a good reason to use an ORM because built-in object tracking makes this a lot simpler. But this is the perspective of someone who never had performance problems doing INSERT or UPDATE, but plenty of problems doing SELECTs. If your application is INSERT-heavy, hand rolled SQL is the way to go - if only just because all these dynamic SQL queries generated by an ORM don't play well with SQL Server's Query Plan Cache (which is another reason to consider stored procedures (sprocs) for Write-heavy applications)

Concluding this, there are good reasons to use ORMs for developer productivity (especially if you don't know T-SQL well) but for me, I'll never touch an ORM for SELECTs again if I can avoid it.