Why is MNsure so broken? A Computer Programmer’s perspective

Lets face it, it is now 3 years into it, and the MNsure website is stilfatally broken, ill-designed, ill-conceived, ill-planned, and ill-operated.   This website alone is arguably the biggest stain on Mark Dayton’s governorship (aside from giving billions to the Wilf family, convicted of civil fraud in New Jersey… but that’s a whole other subject).

For 3 years in a row now, around this time of the year, my social networks light on fire with friends and colleagues complaining in the most colorful language possible about their terrible MNsure experiences.   The consensus for this year is that no one should bother even attempting to shop for insurance on their website, opting instead to go to a broker, “assister”, or to the MNsure offices directly and sign up in-person.  Its been online for 3 years, I kinda wonder why the hell they can’t fix it.

My personal experience

This year, I tried to renew my coverage in December, got to a screen where I was supposed to digitally sign some documents… and then the computer complained that I didn’t enter my OWN NAME properly on the signature line.  I’m pretty sure I know my own name.

I went back to the main screen and tried to start over, but now all the links that I expected to be there to to allow me to shop for coverage were missing, the system wouldn’t even let me shop for plans.

I figured I’d give the system a couple of weeks to iron out the glitch, but in January, I tried again to no avail.  By late January, I was hearing horror stories about long hold times on the phone listening to the same Kenny G music over and over and over only to be hung-up on by the system an hour into the hold.

So I took my friends’ advice and looked around for a place to apply in-person… but I couldn’t get to any of them in time.  Next thing I knew it was Saturday, the 30th of January.

Since the first in-person place was closed, I called their help line.  To my surprise, my hold time was very short, just a couple of minutes really, and the girl on the other end was very patient with me and with the system (which was breaking on her end as well as we tried to get through the process).  At one point in time the system was rejecting even her login during our interactions on the phone and she would grumble and mutter profanities on the phone while she tried to get around the glitches, but she was very patient and even-tempered and kind.

She was unable to reset my online account and give me my shopping links back, so we had to put my application through “manually” (I’ll be waiting on some snail-mail I guess).

Why du’unt it work?

How do calamities like the MNsure website happen?  I can’t tell you exactly for sure, because I don’t work for them, but I can give you some speculative, educated guesses.

1) Never, ever, ever, contract out programming work that you cannot maintain yourself.

MNsure hired a consulting company called Maximus as the “lead contractor” to build the MNsure website.  In my years of experience in the field of software engineering, no outside company or contractor should ever be given the lead on your project.  External contractors are good for temporary staff augmentation only.  The state of Minnesota would have to maintain its MNsure site, infrastructure, and computers in the event that contracts with Maximus were terminated or Maximus folded for whatever reason, so hiring an outside contractor to “lead” the project was a very bad idea.  According to GlassDoor.com, Maximus sounds like a terrible place to work, and I’m betting the executives are lining their own pockets with the governments money.

In my 21-years of software engineering, I still have yet to see such a business arrangement yield what I would consider a successful project.

The correct way to approach this would be to hire a team of core engineers at above market rate, and periodically hire consultants as peer reviewers of the system and as temporary staff.   For a project under such a tight deadline, they probably should have had peer-reviewers working full-time on the project.  For the budget they were given, they should’ve and could’ve certainly afforded it.

Budget?  What’s that?…  Lets talk about…

2) Budget, Incentive,  and Opacity

Another problem with these kinds of contracting relationships is that the middle-men, typically the upper-management, siphon off huge chunks of the budget.   When I was 19, I worked as a consultant.  The company I worked for sold my services to a client with lines like “This kid is the best”.

Sales people are rarely truly honest about their capabilities, they just want a sale.  “Our engineers are the best,” they’ll say.  “We already have all the knowledge and expertise to do exactly what you’re asking.”  It is in the best interests of those companies to buy engineers at low prices and sell them at high prices.

No.   No I was not “the best”.  I was not even the best 19-year-old programmer back then.  That company paid me around $17/hr and sold my services at $80/hr.  The money they pocketed went to pay for their fancy office in a glass high-rise and to pay for the gas in their CEO’s private jet.

When you do a business to business deal like this, the business typically sells you the time of the cheapest, youngest, slave-driven, inexperienced engineers they can find at the price of a veteran engineer then they take the profits and use them to buy their corporate jets and expensive cars.  Instead just hire a veteran engineer or 3 or 10 and eliminate the expensive CEO salaries.

To make things worse, the consulting group, Maximus, hired 3-4 subcontractors to do much of the actual work adding more potential communication obstruction and more cost overhead and more vacations for overpaid executives.  So now you have one degree of separation between the government and their lead contractor, who turns around and doesn’t actually do the work they were contracted to do, instead farming it out to subcontractors, incurring additional overhead and opacity.  This whole thing was a recipe for disaster from the start.  Building a software product is not like building a house.  You don’t pound some nails into some pieces of wood and call it a day.  The timelines are less predictable, the flaws are sometimes difficult to hunt down, and the quality is difficult to ensure.

“B.b.b.but I can’t seem to lure veteran talent to my team, so how am I to hire engineers directly?! What am I supposed to do?”

Money, Stupid.  Money talks.  Give the money to your engineers, not your management.  Money, pension, family and financial security.  Attract talent and they’ll come.  The MNsure project had a budget of over $22 million!  Paying 10 Veteran engineers super handsome salaries would cost you $2 million… tops.  Where did the other $20 million go?  Good question.

56,000 people used the MNsure website to sign up for health insurance last year.  Lets assume, worst case scenario, that all of them signed up on the same day (which was not the case).  Then lets assume that it takes roughly 50 database queries (of their core database) to navigate each user from the start of the application process to the finish.  In this scenario you’re dealing with 2.8 million database queries in 1 day on an indexed table of just 56,000 users.   The seek time of finding a user in that table using a b-tree is theoretically the square root of 56,000, so roughly, a little less than 256-iterations.  Most of the operations upon this table will be read-only, with writes only needing to occur when new users are added, so you could potentially even mirror this user table to multiple databases for a speed boost.  But with just 56,000 records, any average smartphone could handle that table completely in RAM.

The database tracks dependents of the applicant, so dependents of that user may consist of 5X the amount of data as the primary user table. In this case we’re talking about worst-case 250,000 records (a seek time of 500 iterations).  Assuming each record is roughly 2000 bytes wide (generous), then we’re talking about 500 megabytes for that entire table and 50 megabytes for the user table.

This database, seriously guys, is tiny.  It could fit in the RAM on most smartphones, not that you’d want to host your website on a cell phone.   I’m sure they spent $100,000 on some serious database servers, or decided to pay a bunch of money to Amazon to host their shit on the elastic cloud, regardless, load on the database (which is the only place data really needs to get stored) should be pretty negligible compared to what databases are capable of,  and their overpriced database servers should be virtually idle.

The web servers will do more work than the databases, but, the web-tier is generally designed to infinitely scale.  Since the layer does not require the fetching and storing of any variable/user data from its own disk drives (instead relying on a separate database server to supply that data), the web-tier can consist of as many computers as you can throw at it.  All you basically need is a load-balancer on the front-end to distribute the load to the least-busy web server at the given moment a request comes in and you’re golden.  The MNsure web pages are a bit bloaty.  Maybe 200K each.  If it takes 50 clicks to get through an application process (generous) then each user would consume 10MB, or 80Mbits of bandwidth.  If 56,000 people signed up on the same day, the servers would dish out 4,480,000 megabits of data.  Not a small number.  Assuming they had a modest business internet connection speed at 100mb/sec, it would take 12.4 hours to serve that data to customers, clearly not good enough for peak times.  However, put a bit more money into bandwidth and they’d be fine.  You can probably get 2000mb/sec bandwidth for $10,000/mo (serving all customers for the year within a 40-minute window, and since MNsure is only hit hard very briefly during the peak times of the year, if they hosted all or part of their web-tier on the cloud it would probably only cost them $20,000 a year in total bandwidth (worst case… not including instance pricing)…. so again… where’s this budget going?

I’m sure they’d put about $200,000 into workstations for their call centers.  Pay $30,000/mo in office rent.  The number of employees required in the call centers would be drastically reduced if the system worked as it should, so you get a big cost savings there.  So in reality, the cost of rolling this thing out should be way, way, way, less than $22 million.  Certainly the cost would be in the millions simply due to all the 3rd party integration involved, administration costs of reaching out to and signing up insurance companies, training, developer support, communication and documentation overhead.  But not $22million, sorry.  Due diligence, people.

4) Design

From looking at the MNsure website, it is clear that the project managers just didn’t know jack about what it was that they were building, nor what they wanted to build.   Evidence of this is supported by the cheesy use of Internet Explorer transitions, totally incomplete user experience,  Buttons that do literally nothing but bring up blank screens, complete lack of quality assurance testing for basic failures, (so I can imagine that they probably don’t have load-test engineers), application flow that has seemingly dozens of ways for users to get painted into a corner and locked out of the system permanently,  basic API failures with integrated vendors,  inability to handle traffic… I could go on and on.   The MNsure website looks like the work of amateurs, and functions like something built by the completely incompetent.

Do I blame the engineers for this?  Not entirely.  I think I blame management for putting engineers in positions that they don’t belong more than anything.   It is clear that there are serious flaws in User Experience Design, Database Design, Protocols, Integration, and any time you’re going to build a system that has to handle serious load, you should only put people on the task of building that system that truly understand  how systems behave under load right down to the TCP/IP level.

But from a database design level, we’re really talking about a very simple core database here.  All the site really has to do is manage relationships between Vendors, Plans, and Citizens.  External systems would be required to report on tax-credit eligibility, citizenship, dependent eligibility, and payment processing… but at its core, MNsure’s only excuse for being shoddy is bad design and bad implementation.

5) Timeline

The MNsure website was built with a deadline pre-determined.  Not only was the deadline predetermined, but it required external vendors to meet their deadlines so as not to hold up progress on MNsure own progress.  Many of the early bugs in the MNsure system were blamed on failures of the federal government’s systems.  Whenever 3rd parties are expected to cooperate on engineering software, the competing interests and timelines of those parties can bring all those projects to a screeching halt, and the executives are generally too busy vacationing in the Cayman Islands to give a crap.  The MNsure site involved integration with several outside vendors, including the Federal government.

But, the thing is, the site has been operational for 3+ years now and has had plenty of quiet times during which it could have been brought offline and improved.  That just hasn’t happened.

Anyway… I’m just going to cut off this article here because I’m getting hungry for a burrito.  But I’ll close with some favorite quotes from lunchtime with my old coworkers from Control Data:

“It takes 90% of the time to do 90% of the work, but it takes the other 90% of the time to do the remaining 10%”

This quote is totally relative to the real world, as software engineering is virtually never a linear process.  Requirements change in mid-cycle, designs change, plans change, and testing uncovers issues that sometimes take weeks to fix.

“Fast, Reliable, and Easy to use — pick two.”

This quote is great because it illustrates that opposing interests in creating software sometimes clash and hinder each other.

Making software perform well often requires investing time in complex algorithms, which in-turn make the software unreliable.  The complexities of making the software fast may also cut into the time allowed to making a good user experience.

Making software Reliable, often requires taking the “safe” route.  Safe code is generally not fast.  Sticking to methods of implementation that are trusted and trustworthy will increase reliability, but those methods may be literally 1000s of times slower than optimal.  Then fast and reliable code, will often take all of your time and focus with little time left for making pretty user interfaces.

Laymen project managers often focus intensely on the user experience because it is the only thing tangible to them.  It isn’t uncommon to have a guy seemingly over-focused on what buttons should go where… words and fonts and colors and menu bars and bling.  If you want a great user experience on a system that is reliable, you’re probably going to have a slow-ass system and/or an unreliable one.

 

 

 

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.