Posts Tagged ‘database’

bookmooch illustration
[Illustration credit Andrice Arp, courtesy of BookMooch.com]

Regarding my previous post about Twitter’s database architecture and scaling, here’s a post about bookmooch.com’s scaling troubles (bookmooch is a book swapping site).

Bookmooch had a search engine for books, using a single table that indexed each word in a book’s title/author/etc. This system was very efficient for searching (one query per word). The problem was that adding new books to the index was very costly (the more popular the word was, the longer it took to update). It was up to the point where the disk writes took as much as 20 seconds out of their alloted minute.

In short, the solution was to have two tables for the index instead of one. One table was optimized for queries (faster searches), while the other for updates (adding new books quickly). These tables would be synchronized periodically.

IMHO, the guy in charge did nothing wrong with the initial design of the search engine. There are certain characteristics of a system that are hard to predict and even then, one needs a lot of experience to correctly optimize things in advance (like the problem in the second paragraph). As in the Twitter story, profiling saved the day, but don’t forget to monitor your logs!

Read Full Post »

Project: RSS

My first project, which I will be starting in about a week, will be a web-based RSS/ATOM feed aggregator with the following properties:

  1. It will store a complete archive of every item it sees.
  2. Queries of the first ~50 items of each feed will be fast and take constant time.
  3. The interface will be responsive as possible, despite server/communication latency.

The most important question thus far is whether to use a fully-fledged database with server (mysql, postgres, etc.), or to use a file based database (Berkeley db, etc.). The crucial factor is the database’s ability to achieve both properties 1 & 2 above. In other words, how well does each approach scale?

I believe that the second property could be achieved by using a table of all feed items sorted by their dates and limiting the first query to the first 20 items, fetching the next 20 items whenever the user reaches the end of the list (a la Google Reader).

The third property may be implemented using asynchronous commands to the server. Since the javascript code will be mostly event driven, I don’t think it’d be a problem making it perform several simultaneous actions (server requests) at once, provided they don’t conflict (e.g., access the same data structures).
Another way is to use a queue of commands waiting to be issued to the server. This has the added benefit of being able to reissue any failed commands at a later time, and being able to work offline.

Read Full Post »

Twitter’s architecture

Rolling on Rails: Under the Hood at Twitter – talks about the “scramble to scale” and the importance of thought out database schemas and queries.

In order to scale your service, it is usually not necessary to change the entire framework. Profiling your code and queries can lead to small incremental changes that solve major bottlenecks (caches, query optimization, etc.).

Read Full Post »