Mongo vs MySQL for WebApps

tl,dr; I'm not really the biggest fan of a MongoDB backing entire webapps. You can use it to solve some specific problems, but I wouldn't recommend storing all your data (e.g. user profiles etc) there. I think MongoDB is ok for prototyping, but when it comes to production systems, it wouldn't be my database of choice.

The following is a general argument against NoSQL in web apps, but code is going to be mostly MongoDB-related. Also note that I don't say there is absolutely no use case for a NoSQL database. I'm just saying your app should be backed by an SQL database to store things like your registered users and so on. In the unlikely case that you really have a special use case, then use a separate NoSQL database for this, but don't put your relational data in there too! Now let's proceed by crushing some of the most common reasons people think they need MongoDB for their webapps.

"NoSQL can be faster than SQL because it doesn't have to do table-joins"

That's true, but there is a well known and understood solution to this problem. It's called caching. Let me elaborate.

Let's say we have to load the following view a gazillion times a day:

It's similar to a typical forum which most webapps have in some form.

Then the argument for NoSQL would be that you could store everything denormalized as

{
  content: "bla bla"
  user: {
    name: "Jeremy Barns"
    ...
  }
  comments: [
    {content: "bla", user: {name: "Jonas Blue", ...}}},
    {content: "blarb", user: {name: "Jeremy Barns", ...}}
  ]
}

Now in our cool web app we could fetch all the information we needed without any table joins, just by extracting one document from the database. Super fast.

Of course, we've introduced a problem with this. We have the same user in multiple locations. So if Jeremy Barns decided to change his name, we'd have to update it on every message he ever made, a true nightmare. That's why there are relational databases, so you can actually insert an id instead of the whole user, which solves a lot of consistency problems. Of course you could do the same with Mongo:

{
  content: "bla bla"
  user: 58e5ee14d37470005df49bcb
  comments: [
    {content: "bla", user: 50e5ee14d36470005cd66waf}},
    {content: "blarb", user: 58e5ee14d37470005df49bcb}
  ]
}

and then in your application code you query the Message and from the Message you query the users, but then it already looks awfully much like a relational database with extra steps. And you still haven't solved all problems with this. What if you now need a feature "delete a user and all of his related data". Easy and consistent with a relational database. Impossible with NoSQL.

So you say "well, we've got to choose between speed and consistency!". Wrong. Let's say we implement the following data model:

interface Message {
  content: string;
  user: User;
  comments: Message[];
}

interface User {
  name: string;
  ...
}

where we use the same type "Message" for questions and replies. Now to build our view above one million times, would we have to do one million times all of those table-joins:

message -> user
message -> message -> user (n times, where n = #comments)

First of all, joins on properly indexed columns aren't actually that bad. SQL databases have strongly optimised algorithms to make those joins fast. Second, we can still implement caching solutions, if the joins really pose a problem. Now I'm not saying caching is all that simple, cache invalidation is a topic for itself, but at least it's a solution that holds water. Caching here basically means building something like the denormalised data. Ah, so for that you could use Mongo, as your caching layer! But if something goes wrong with the consistency you can just clear the cache. You can't do that if the cache is your database!

And what if suddenly this view isn't important at all anymore to your business? You'd rather like to display it like that:

Awww, snap! Those business people changed up everything, but you have a database full of data optimised for another case! Denormalising data means committing to certain joins. So the ideal strategy that would give you performance and reliability would be:

Use a relational database to store data normalised
Use caching if there is a bottleneck of joins somewhere

Furthermore: While changing business requirements will change most certainly it'll take probably a bit longer for your "new cool web-app" to actually reach a stage where this sort of caching is necessary.

"SQL doesn't scale horizontally"

Not out of the box, but once you reach a certain size you can easily switch to something like Amazon Aurora (compatible with PostgreSQL and MySQL), which has up to 15 read instances. And once you outgrow this (and are probably a billionaire by now), you can still implement caching solutions (see above).

And what about write? Well, in SQL as well as in NoSQL databases you will have one master for write. Then, classically the first thing to do is to scale up vertically the instance that's responsible for writing. Once this fails, you can check where the bottleneck is and implement a solution for the problem. For example, if you detect that you're logging too much analytics data to your primary MySQL, you could use another solution for the logs (Mongo, Redis, a second MySQL DB...). This is called offloading. Another solution is sharding. While MongoDB has an automatic mechanism for this, with the SQL solutions you'll have to take care of the sharding-logic yourself. However, that's not so bad. Since when this happens you're probably also at a size where defining the sharding-logic yourself brings performance gains.

"In SQL, adding new columns to large tables can lock / slow down the database"

When you're at the multi-million-rows-per-table level of an application, you can use a solution like Amazon's Aurora. This system doesn't slow down when columns are added. From the docs:

In Amazon Aurora, you can use fast DDL to execute an ALTER TABLE operation in place, nearly instantaneously. The operation completes without requiring the table to be copied and without having a material impact on other DML statements. Because the operation doesn't consume temporary storage for a table copy, it makes DDL statements practical even for large tables on small instance types.

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora.Managing.html

Actually, you can start with Amazon Aurora right away, as it isn't really more expensive than a regular database server.

"MongoDB is more reliable / available with it's Replica Sets than a classical SQL DB!"

Again, modern SQL-system come to the rescue. Amazon (or other cloud RDB providers) takes care of availability for you, so you don't need to worry about that.

"MongoDB is free and you can run it on cheap linux systems"

The server costs probably don't factor in too much next to your love, sweat and time put into your project. Having said that, a simple MongoDB production setup on AWS would be around 65$ / month (smallest MongoDB atlas cluster). By setting it up yourself, you'll not really be cheaper than this. A simple Aurora production setup would be ~40$ / month.

"But I have no schema! I have truly unstructured data!"

Really? You don't have users logging in, connecting with something? Even if so, sooner or later you will have some relations in any meaningful web-app. I'm still waiting for that use case where there is literally no structure in the data of a web-app.

[With Mongo] you can choose what level of consistency you want depending on the value of the data (e.g. faster performance = fire and forget inserts to MongoDB, slower performance = wait til insert has been replicated to multiple nodes before returning) - source

You probably don't have too much "fire and forget" data. Users interact with your system, that's what generates the data as well as the business value for your system. What would be so invaluable, that losing it is okay? A user post? Certainly not. A user changing the privacy settings? Better not.

With MongoDB you get full-text search

While that may be true and will help you get your full-text search started, it's not that hard to set up a search engine like Lucene or Elasticsearch. We've actually chosen MongoDB for one project for exactly this reason but we quickly outgrew the capabilities of it and switched to Lucene.

So let's summarise some of the main points we've made:

Advantages:

Easy to get started
Easier for sharding

Disadvantages of NoSQL / MongoDB:

No consistency mechanisms
No transaction safety / No rollbacks
Less powerful queries

Conclusion

In my opinion, the use cases for a NoSQL db are quite limited. SQL is powerful and should be leveraged. It doesn't make sense to write application logic, when you could also let SQL do the job. Denormalization is not the best idea for your primary database storing user data. You rarely truly have no schema. So all in all, for most web apps, prefer an SQL database to any NoSQL database.

Dear Devs: You can help Ukraine🇺🇦. I opted for (a) this message and (b) a geo targeted message to Russians coming to this page. If you have a blog, you could do something similar, or you can link to a donations page. If you don't have one, you could think about launching a page with uncensored news and spread it on Russian forums or even Google Review. Or hack some russian servers. Get creative. #StandWithUkraine 🇺🇦

Dear russians🇷🇺. I am a peace loving person from Switzerland🇨🇭. It is without a doubt in my mind, that your president, Vladimir Putin, has started a war that causes death and suffering for Ukrainians🇺🇦 and Russians🇷🇺. Your media is full of lies. Lies about the casualties, about the intentions, about the "Nazi Regime" in Ukraine. Please help to mobilize your people against your leader. I know it's dangerous for you, but it must be done!

Promo

My Svelte Project

I've built some fun mini-games for kids with Svelte, so if you have some, make sure to check it out! :) Approximate code can be found here.