8

I noticed wordpress/rails etc do not use foreign keys constraint or cascade deletes features from database. Instead they handle this in PHP/Ruby/scripting level!

I have read this and this. Most arguments against foreign keys constraint talks about performance, multithreading, locking, scalability, etc.

Assuming arguments against foreign keys are valid, my question are:

  1. If foreign keys are bad, why WordPress/Rails/etc uses a sql-server which supports foreign keys? Will they benefit going away from MySQL to a NoSQL kind of server?
  2. On other hand, can applications coded in a way to make use of foreign keys feature without running into issues?
  3. Is noSQL/redis better if we are using database only for storage and managing "relations" at application/script layer?
rahul286
  • 215
  • 1
  • 3
  • 10
  • 5
    People who claim that foreign keys hurt performance have no idea what they are talking about. In any serious DBMS they can actually *improve* performance. And what's even more important: what use is a database that is (supposedly) fast but contains invalid/inconsistent data because it is not validated by the database. Claiming that the application can handle that is plain wrong. Any database that contains important or valuable data **will** be used by more than one application. – a_horse_with_no_name Jun 12 '13 at 07:16
  • 4
    Foreign Keys is one thing and Cascading effects is another. The second, yes, is avoided by many designers, as a delete of a few rows in a table can lead to millions of deletes in other tables. – ypercubeᵀᴹ Jun 12 '13 at 08:16

1 Answers1

14

Lets start with your first link. It says clearly:

Working on large databases with referential integrity just does not perform well.

And that is right. Just you likely have no clue that "large database" is terabyte size with billions of rows in a table. A simple select may cascade into hundreds of millions of related elements to be deleted, and then you have a performance problem.

This is a non-issue for regular small databases such as a wordpress log or most CMS - it turns into a problem if you do something like facebook, or handle financial simulation data. I regularly deal with multi billion row tables and deletes that work in stored procedures outside of transactions in batches of x - because the end delete may easily clean up some hundred million rows.

Will they benefit going away from MySQL to a NoSQL kind of server?

Hardly. They are useful when a professional uses them as appropriate.

On other hand, can applications coded in a way to make use of foreign keys feature without running into issues?

Yes.

Is noSQL/redis better if we are using database only for storage and managing "relations" at application/script layer?

I once did a application review for a technology upgrade in a bank that uses no referential integrity (to get the performance up). Loading the data into SQL Server (which was supposed to replace their aging Adabas installation) it failed with integrity constraint violations. Happens 40% of the historical records were invalid because some * had deleted lookup table values not in use any more (such as old customer classification codes, which got replaced for all active customers, just not the old ones). No referential integrity warning ever came up. The result was some fired people and a problem stuck in workaround and a partially useless data warehouse build on top.

So much for managing relations at application / script layer. Errors WILL happen. Data is valuable, applications change.

Most people complaining about SQL level features would be more advised to read a book about them, and try to understand them, instead of complaining. Sadly a lot of the advice on the internet is written by people that refuse to read even documentation. Always be careful with that. Most advice to NoSql is based strongly on ignorance.

Paul White
  • 67,511
  • 25
  • 368
  • 572
TomTom
  • 4,554
  • 1
  • 14
  • 19
  • 1
    Thanks for detailed answer. For cascade delete, I have assumption in mind. Lets take WordPress example. If you deleted a post, and then all comments for that post in PHP, why not let mysql do that using cascade-delete? Assuming you want to delete same set of data, won't it be faster using cascade delete, rather than running multiple mysql queries from php-code? – rahul286 Jun 12 '13 at 08:37
  • 2
    Exactly. THat is where cascading delete is good. but otoh imagine you delete one post and it takes hours because ther are 500 million comments on it ;) That is where the perforamnce issue arises. Like all SQL related stuff, it requires planning and thinking when scaling up. You may be better of using a stored procedure to slowly delete things more controlled when you hit large data amounts. – TomTom Jun 12 '13 at 09:00
  • 1
    Sorry to bother again but ain't I suppose to delete 500 millions comments anyway via PHP if I delete related post? Actually, somewhere I thought, if something can be done in mysql, its better done in mysql. I am trying to validate this thought in real-world before we start building one of our biggest app till date. For "deleting 500 million comment" example, at application-level we can do background processing. I think same can be done in mysql. Or app can delete post itself in background, so user won't be blocked while cascading delete in progress. – rahul286 Jun 12 '13 at 10:39
  • 2
    Yes, but you enter a delete 1 row sql, and wait for hours - with locks in the database that you did not expect. People then call support and curse on the slow database. It IS developer ignorance, but this is something you have to work around. Not every - even large - database can handle a 500 million row delete without service interruption. And the website doing the delete may block the page for hours until it is committed. Try thinking about such scenarios yourself - that is part of the job of a developer. – TomTom Jun 12 '13 at 10:40
  • As a php-developer, I can delete like 100 rows at a time, update user about progress bar, etc. For long-running operation, where we cannot reduce time its better to add progress bars etc to update user about whats going on behind the scene. I never used cascade-delete in sql but can't we control them (to show progress)? Is it this case that pushes for handling more things at application-level rather than leaving them to database? – rahul286 Jun 12 '13 at 12:14
  • 2
    Pretty exactly yes. Basically - cascades are perfect as long as they do not change the characteristics too much. If 1 row deletion turns into an hour long wait, that is such a case. And let's do not get into issues like database log size management, memory consumption etc. that likely need also be planne d out if a simple delete turns into a multi gigabyte change. – TomTom Jun 12 '13 at 13:34
  • 1
    "A simple select may cascade into hundreds of millions of related elements to be deleted, and then you have a performance problem." True, but if your job is to delete those millions of pieces of information then whatever approach you pick is going to give you a performance problem. Choosing whether or not to use Foreign Keys, or any other integrity protection mechanism, is a tradeoff between "how much speed am I going to get" against "how much risk of bad data and I willing to accept". Foreign Keys are not bad; they are just a tool with pros and cons. – Greenstone Walker Jun 12 '13 at 23:39
  • 1
    The point is that some dude then knows he has to use a stored procedure and it takes time. With a cascading delete that may be something that pops up in a test. In fact it may not b possible to make the delete in one transaction. – TomTom Jun 13 '13 at 04:15
  • I am revisiting this discussion after four years. None of the projects we talked four years ago manages to reach that scale. But I am still curious about this topic. Why SQL has something in its core that can't scale or handled nicely without a lot of efforts (at scale)! Is there a new frontier which takes care of this nicely? Made for billions of records from ground up? – rahul286 Dec 15 '16 at 15:13
  • 2
    @rahul286 The problem here is not that SQL needs more care to be more performant, it's just so that some problems are not able to be faster - by pure mathematical reasons. SQL is all about searching and ordering structured data. To search a heap you have to touch each item until found. If someone created an ordered list (roughly an index in SQL) for you then you will be much faster but still need time. But that's it. There's nothing you can do to drastically lower this "list-lookup time". Maybe small tweaks, but nothing major. SQL is in this state. It's fast, all limits are "by nature itself". – Foo Bar Apr 04 '18 at 16:03
  • 2
    @rahul286 NoSQL style storages are basically what an "index" is in SQL. They can not ensure data integrity by itself - that's why they are faster than SQL. Usually you use NoSQL storages to support relational/SQL storages. Best example is a search engine for a laaarge website. All the real data is in some SQL database. Doing a search on it is slow (especially with LIKE or REGEX). Then you take your data, prepare it for the NoSQL storage and then you are doing the search function on the NoSQL storage instead on the SQL storage. NoSQL "duplicates" the SQL storage in a usefull way to be fast. – Foo Bar Apr 04 '18 at 16:10
  • 2
    @rahul286 If the NoSQL storage crashes - no one cares. You can just rebuild it from the SQL storage (which ensures data integrity and always has the real data). NoSQL are usually used as "3rd party indexes" to support SQL engines where the SQL indexes are not enough. – Foo Bar Apr 04 '18 at 16:13
  • 1
    @FooBar I understand your point well. These days, we ourself use ElasticSearch often with a MySQL based apps (WordPress sites) to improve search. :-) – rahul286 Apr 05 '18 at 04:45