Krista's Coding Corner

03.07.2012

Indexing, part 2

Imagine that you have 10 boxes that each stores one ball. You have to find all yellow balls that are in the boxes. Isn’t it easier to have boxes that have labels telling you whether there is yellow, red or violet ball? Indexing with groups does pretty much the same thing as labeling. Sure, you need to take a look at the label / index but that isn’t too big of an effort.

Database is normally a list which has columns and rows, a table.

Let’s imagine your database has data about balls. Every data point represents a ball that has a color and a size. In this case we could have a three-column-table: first column for ball ID, second for the color and third for the size. And rows just are there to include more and more data points == balls.

One way to index is to group all similar items. In this case we can make an index where we tell that yellow balls can be found from row numbers 2-6 and 8-10. If you now make a search of yellow balls, you don’t need to go thought the whole list, just the row numbers the index has stored.

How to search efficiently?

Binary search algorithm is the easiest and really commonly used way to go through stuff. You might see it in some shows where one needs to guess amount / price of something and if it isn’t correct, the host will say “real price / amount is higher” (or lower). General rule is that the fastest way to find out the right price –if we don’t count lucky guesses or knowledge- is to always guess the mean of your current interval. Here is an example, where we have first got a range (maybe it is told or you have made some guesses to get it):

Tags / keywords:

We quite often use tags to be able to search stuff. It used to be really popular to add keywords into webpages because search engines were quite bad and they didn’t find you and your page otherwise. In some services they are still used (I just read about an issue with one selling site where searches don’t work because some sellers add unnecessary and invalid keywords / wordlists to their selling items -> your search will generate massive amounts of unwanted and un-aimed results). These words work like index: it is faster and easier to use, and mostly these are used only when a human is doing the search (==typing the word to somewhere) because words come quite naturally to humans. :)

Trie / hash –table

If we still use the ball-example, we could give ball colors some other, easier and less memory-taking names like 1 and 2 (and then some math can be applied). Instead of going through some character mumbo-jumbo (letters are in this case went through individually because a computer can’t see the word, just line of characters), the computer just need to go thought couple of numbers.

(Okay, this is trie-/hash –table made super simple, read more from Wikipedia or somewhere else).

Data structure

When your searches are more complicated like 'find all the balls that are yellow and bigger than diameter 1,4 cm', you need more difficult data structures. Every time you add more data into a simple database (like in the example), it needs to be recreated because you can add data only into end of the table. This means that you always have to re-sort it afterwards (which is very unefficient). But there are thousands of different kind of other data structures, you could start your search for them here.

Gimmicks:

Do you know why it is faster to sort your list and then to do the search than just do the search? Because you processor does many things in secret that you don’t have a clue, those damn sneaky little things! That is: it tries to guess what you will do next and it actually calculates stuff to you beforehand based on its guesses. And the computers are not mind readers -> they may not guess right.

If your list is sorted, the processor can guess more easily what is coming than if there is really mixed up list. This means that the processor will be calculating the right things.

Processor is doing this guessing and calculating beforehand because otherwise it would just stand still. It's not true that processor only does one thing at a time: reading and writing memory takes time, meanwhile the processor just chills. So, they try to make our computers more efficient by these means.

I was able to find this page which tells more about it.


PS. This post is quite shallow (and loooong) but I think you can use it as a starting point :)

PPS. Indexing normally happens so that we have an extra file where indexing tables are found. This means that your program has to support indexing, otherwise indexing won't help you :P

blog comments powered by Disqus