Ok, so I make no proclamation to be an expert on the matter, but having quite recently faced the issue of database schema for tagging myself, I wanted to share my thoughts.
So, as part of BlockLand, we’re going to need some way to store data about Articles. You know, the usual: author, timestamp, content, etc. It would also be nice to have a way to tag articles with uh… tags for all of that Web 2.0 nonsense about related posts, categorisation, indexing, etc.
Now, storing author, data, time, content, etc. is fairly simple to do in a database. After all, you just have a table with columns for each piece of information you want to store, and make a row for each article. We’re not talking rocket science here. Ok, so let’s make a column for tags. Hang on, but I want multiple tags. I want different tags on each article. I want to see tags and all the articles they belong to. I can’t just dump that into a column.
Well, I could, in a fairly kludgy approach. I could have a column that holds a list of all tags for the article. That would work fine for finding tags on a particular article, but if ever I wanted to do a search by tag or similar, I would run into phenomenal trouble. It means slowly trawling through every single tag in every single article and working out if the tag I want is there. No, this solution will not do.
The first idea I gave serious consideration to was having a one-to-many relation to a separate Tags table. This table would include a pointer to the relevant article. Finding all the tags for the article simply meant hitting this table and matching up the ID columns. Simple and easy enough. However, I began to realise the downside of this approach- many articles could have tags in common, which means there would be heaps of rows in the table that were almost identical to each other. I liked this solution for a while, because it seemed simple and effective to me, but mainly because I couldn’t think of a better way.
Now perhaps this is what you go to school for, but eventually I read about the proper database normalisation techniques. Apparently, many-to-many relationships work just fine if you put a linker table in the middle. So basically, you’ve got your same-old Articles table, a Tags table that basically says “I am a tag this is my name” and a magical table that goes in the middle that sticks everything together. The middle table just has a reference to article and to tag, nothing more. That way, the Articles table doesn’t have a big pile of smelly old tags rotting in a column, and the Tags table isn’t filled with more of the same.
So, to make sure that the Articles-Tags link works nice and quickly, it needs indexing. Making a primary key automatically makes an index, but this will only work in one direction. So you’ll need to make a second index pointing the other direction so that you can work the link both ways. I feel that I’ve ranted on quite long enough now, so I guess I’ll just deliver this parting blow for anyone who doesn’t understand but actually cares about that last paragraph. This is the single article that most influenced my design choice. I thought it was extremely useful in explaining what to do and why to do it.
Well, that’s all folks lights are up, you had better wake up and exit the theatre.