Bluemonkey’s blog

September 25, 2009

A brief think about tagging and database normalisation

Filed under: BlockLand — Tags: , , , , , , — bluem937 @ 4:00 pm

Ok, so I make no proclamation to be an expert on the matter, but having quite recently faced the issue of database schema for tagging myself, I wanted to share my thoughts.

So, as part of BlockLand, we’re going to need some way to store data about Articles. You know, the usual: author, timestamp, content, etc. It would also be nice to have a way to tag articles with uh… tags for all of that Web 2.0 nonsense about related posts, categorisation, indexing, etc.

Now, storing author, data, time, content, etc. is fairly simple to do in a database. After all, you just have a table with columns for each piece of information you want to store, and make a row for each article. We’re not talking rocket science here. Ok, so let’s make a column for tags. Hang on, but I want multiple tags. I want different tags on each article. I want to see tags and all the articles they belong to. I can’t just dump that into a column.

Well, I could, in a fairly kludgy approach. I could have a column that holds a list of all tags for the article. That would work fine for finding tags on a particular article, but if ever I wanted to do a search by tag or similar, I would run into phenomenal trouble. It means slowly trawling through every single tag in every single article and working out if the tag I want is there. No, this solution will not do.

The first idea I gave serious consideration to was having a one-to-many relation to a separate Tags table. This table would include a pointer to the relevant article. Finding all the tags for the article simply meant hitting this table and matching up the ID columns. Simple and easy enough. However, I began to realise the downside of this approach- many articles could have tags in common, which means there would be heaps of rows in the table that were almost identical to each other. I liked this solution for a while, because it seemed simple and effective to me, but mainly because I couldn’t think of a better way.

Now perhaps this is what you go to school for, but eventually I read about the proper database normalisation techniques. Apparently, many-to-many relationships work just fine if you put a linker table in the middle. So basically, you’ve got your same-old Articles table, a Tags table that basically says “I am a tag this is my name” and a magical table that goes in the middle that sticks everything together. The middle table just has a reference to article and to tag, nothing more. That way, the Articles table doesn’t have a big pile of smelly old tags rotting in a column, and the Tags table isn’t filled with more of the same.

So, to make sure that the Articles-Tags link works nice and quickly, it needs indexing. Making a primary key automatically makes an index, but this will only work in one direction. So you’ll need to make a second index pointing the other direction so that you can work the link both ways. I feel that I’ve ranted on quite long enough now, so I guess I’ll just deliver this parting blow for anyone who doesn’t understand but actually cares about that last paragraph. This is the single article that most influenced my design choice. I thought it was extremely useful in explaining what to do and why to do it.

Well, that’s all folks lights are up, you had better wake up and exit the theatre.

September 23, 2009

BlockLand

“Hang on, why am I reading this? Surely he hasn’t updated his blog?” Indeed, he has. Shock Horror. Anyway, I didn’t actually emerge from the murky clouds of the Blogosphere just to post some silly self-referential nonsense, so let’s just assume that all the fanfare is dealt with from this point onwards.

Nay, the real reason I started blogging was to tell you about the work I started doing on BlockLand. Those long in the tooth may have remembered my hype about BluemonkeyBlockLand some time ago, but for many reasons that never really got past a fanatical ideas stage. Finally, my development skills and motivation have improved enough that I am once again starting work on the project. I’ve decided to start from scratch rather than trying to sort out the mess from the original attempt.

So, yesterday and today I’ve been sorting out the rear-end of the application. So, after spending a few hours carefully considering storage schema for the database (I might blog about some of the more fascinating areas later) and an ORM to beautifully transform my database into object-oriented glory, it then came time to actually implement said schematics. And that’s when things got interesting.

Ok, so I’ve never actually used SQL before, apart from a few trivial SELECT statements in a Software Development project, so creating a complex database schema without guidance was going to be a challenge. Nevertheless, after a couple of one-on-one one-to-many (ouch, a SQL joke *cringe*) sessions with the Internet, I felt reasonably confident that I could create all the tables, columns, keys, and indexes required. That wasn’t so hard. I then proceeded to grind my head against the wall as I tried to make sense of Stored Procedures. Every time I tried to work something out, it was like I was being gently slapped in the face while someone was trying to compress a block of Smorgy’s jelly into my brain. The really solid, rubbery stuff.

Now, I said I had more motivation to develop BlockLand, but I didn’t mention anything about masochism, so I decided to temporarily jump ship and try a different solution. So, Visual Studio cranked up and it was time to make an Entity Framework. Frankly, I’m very impressed with these things, but I’m sure my readership won’t find then nearly as riveting. Anyway, I got some more work done, but realised I’ll need to make more Stored Procedures anyway. But that’s a job for tomorrow.

Anyway, I’ll try to think of something less computer-y to post about in the next installment of the blog, but meanwhile, just welcome my return.

Blog at WordPress.com.