Only add indexes that you actually need….
I should probably start referring to database textbooks at this point. Lookups and indexing:
- Hashing is almost definitely quicker, given a decent hash bucket size…
- .. but doesn’t allow next/previous operations very well.
- The hash vales can be easily stored to disk, so it’s not like having to build or rebuild indices.
- Tree structures can also be stored to disk, but you need to distinguish between whether you’re storing a disk tree (pointers are blocks / offsets / etc), or storing a memory tree with has been indirected / translated to a stream datastructure.
- Unfortunately, if doing the latter, where the offsets are an indirection of some sort, if you’re not careful, reading the tree off disk can be more time consuming than building it from scratch: you need another datastructure to store the translation from disk blocks / offsets to memory addresses, which has a similar set of costs to a tree (unless you use another hash table).
- I suppose I could increase the branching factor, but even if you do, you still have the same set of costs to build the tree.
So, as you might guess, the datastructures I’ve got have good amortized cost, but there’s a still a startup cost to be minimised. I also have other techniques to help minimise the startup costs (background loading). This is not critical, and an optimisation to be made after completing basic functionality. Answers on a postcard please.
So, I’ve been working on my latest top secret project. In summary:
- Progress pretty good, but didn’t do much work on it over the hols.
- New bespoke database engine pretty good (definitely fast enough!)
- It does Instagram and filters posts and users quite nicely.
- A bit more work needed to present tweets & twitter data: They’re in the database engine, but more work needed on the display.
- After basic filtering is done, working on more advanced analytics: graphs of who posts / replies to who, volumes of posts, hashtags etc.