Krista's Coding Corner

08.07.2012

Indexing, part 3

Sorry, I’m fixated with this, and the previous part didn’t answer yet how to actually do the indexing... Because there are many different kind of databases I can’t give you the ultimate and only way to do it but I can try to clarify what I did with the example presented in the first part. It is a really simple database but there are some basics you should know.

I had data a lot of data in the CSV-format that was a bit messy.

1. The data was cleaned up. Computer read it and filtered all necessary parts as there were also some overlapping data entries etc. Know your data, know what to clean up.

2. The data had one hugely important feature that distinguish parts of it from others. This could now be seen something like the colour in the previous part. I had data where some data points (==balls) were yellow, some red and so on. The data was in many files but one file always contained only one color, like yellow. So, it was already sorted quite well. Of course you can do this kind of a sorting by yourself too. I just didn’t have to.

3. The data was then also sorted by time stamps (time was one big deal with my data) from earliest to the latest in the files. Computers understand better this kind of an order than something going from the latest to the earliest because in western world we see the time in that manner -> systems and functions are made to deal with the time in same way.

I was also quite lucky as my data files also included always just some specific time slot. So, I didn't have to divide the data into specific time slots because it was already done by the files. The time slots weren’t the same length because I used the already existing file-based intervals. If this was done more elegantly, they would have been the same.

4. The files were converted into JSON-format (== I had many JSON-files now) because it is faster and easier to deal with them than CSV-files. I used JSON because it was easy: it is one of the simplest ways to store data independent of the used programming language. You might use some other data format too :)

5. JSON-files were then compressed with DEFLATE (using zlib). This way the data didn’t take so much space, and because of the smaller size it was faster to read the data from the disc.

6. To every file was calculated 64-bit hash from the specific data (why 64-bit? I just felt like it...). Because of this, I can now just look the hash and know what is inside of the feature-sorted -block, and I don’t have to go through all the data.

The hash is like a headline in a book chapter: read it and you should know what the chapter is about. The hash is able to tell me the colour the block includes. It included only this information because I wasn’t sure if later I would also need other key values which would mean that everything needs to be done all over again because it would be missing something. Having only one value ment that everything else could be added with indexing, which would be simple. This was a design decision.

7. After getting the “headlines” for all the data blocks, every block was stored in to the disc in following order, JSON-file after file to get just one file after wards:

The 64-bit hash –key; 64-bit size of the compressed JSON-block; the compressed JSON-block; next 64-bit hash; next 64-bit size of the compressed JSON-block...

With this I can just read the hash and see if it is something I’m looking for. If not, I read the size of the compression, so I can jump into next hash because I will then know where it is. (Look next picture).

8. Now it was time for the first real indexing: computer made a list (==file / index) where it stored the information where every hash was located == how many bytes from the beginning they were. Like in some board games: you know that you need to throw 5 with the dice because it will bring you to some specific place in the board.

Now there was just the list that had to go through, not all the data or the hashes in the data.

9. Time was important thing with my data but it still wasn’t really included so well. So, another index-list was made to store information about which time slots were where. The list had hash-key, place in the data (from what byte the specific block started) and the blocks earliest and latest date. The dates were quite easy to get as it was necessary to only unravel some data from the beginning and the end of each data block.

Now it was possible to get data from specific time slots and specific features without going through all the data. You can make more of those indexes if needed, of course :)


Was this the best way to do this? No idea, probably not, but it worked well enough for me. And even though there seems to be many steps, they were all quite simple, so don’t worry. You just need to do everything one by one and everything will work out. Of course, always think what you need and what your data is. This worked in this simple case but you might need to find more tricks and more effective ways to deal with your data. Just start searching! :)

PS. It was faster to read the zipped data from disc and decompress it, than just read the non-compressed data.

blog comments powered by Disqus