Krista's Coding Corner

02.07.2012

Indexing, part 1

Databases -they are everywhere. If you store anything, you already have some sorf of a database. The problem with databases / storages is that the more you have information / stuff stored, the harder it gets to find what you need.

That’s why libraries have all sorts of systems to make information seeking easier. Books (+music, movies, magazines…) are -at least in here- first divided by the fact that can they be borrowed or just looked through. Then they are sorted by genre and more limited categories. And finally by writer and title. All this because otherwise finding anything and keeping books in order would be impossible. Imagine yourself in a huge room that has piles and piles after books. And you need to find Sun Tzu’s The Art of War. Not an easy task.

The thing is that someone has, at least at some point, read the book that lies in the shelf. If not anyone else than the writer :) If the book is anything else than a novel, it probably has an index / contents’ list. This is a great thing –especially if the index is good and useful. You never need to read the book full to find the information you are interested, it saves a lot of time (kiddoes, always read your books through, you might just learn something useful even though it can seem useless at first! ;) ).

You can create this sort of index with any data. But do it wisely: if you index wrong things, it won’t help you. Just like I have one cookbook: if I try to find recipe for cookies, I need to know that there are for example "Aunt Hannah's cookies". These are of course can be found under H (where at least I won't be looking for cookies).

Why to index? If you are not yet convinced, here is a real life example from one case I was dealing with:

I had lots of data, and it would have been impossible for a human to go through it all –and to get something out of the data points. As I wanted to calculate correlations over different datasets the data contained, I needed to make my computer do the job. Everything was just fine but the calculation took a LOT of time (we are talking about days…) and I was getting more and more data into my database over time which made the thing even slower. It was unbearable.

I’m not that good at math, hence there was nothing I could do to the correlation-calculation process. Therefore my only option to speed up the job was to make my database faster because I did know / guess that reading data from the memory took time. Because everything became slower and slower with more and more data, the only issue wasn’t the calculation as it should take always the same amount of time if I was calculating same things. But it didn’t.

A new database was built with indexing. When reading one dataset from the base used to take about 14 minutes, it now took only 220 milliseconds (0,22 seconds). Huge difference, right? And it was achieved with just indexing (+ compression, to be honest).

This post will come way too long if I don’t divide it, so how indexing should and can be made is the topic of next post.


PS. Correlation doesn’t mean causality! For example: two things can be caused by some event -> the two should correlate with each other but they may have nothing else in common than the cause. Like the ice-cream sales and people drowning (this happens here! :P). They correlate as both have a huge increase in the summer but they have nothing to do with each other. They just happened to be caused the fact that in summer it is warmer and more enjoyable to eat ice-cream and the waters aren’t frozen -> people can swim and go sailing more easily which put more people in danger of drowning.


PPS. I'll try to write the next post today but I might not have the time as I have to spend some quality time with my bunnies. <3 We are going outdoors and you can just imagine the amount of hilarity it causes when someone is “taking the bunnies for a walk”.

blog comments powered by Disqus