08 of December de 2009

New infrastructure of web servers on boo-box system

others Postado por

When you work at a company like boo-box, you must have a team capable of developing extraordinary technological solutions in a short amount of time.

In the past few months, boo-box has undergone phenomenal growth:


As you can see, ad impressions rose by 85% in the last month alone! To handle this growth, we had to adapt and upgrade our systems.
Before going into further details, let me sum up our strategy: we constantly perform check-ups, use cache as much as possible, and take really good care of our database :)

Going into details

In order to stand out as an online advertisement genius, you have to follow a few basic precepts:

  1. Speed: users are not going to sit around waiting for an ad to load.
  2. Scalability: viralizations, accelerated growth, etc. If your application is not highly scalable, you should either find another job or be ready to give up on sleep :)
  3. Monitoring: monitor everything! Make your monitor check-ups automatic whenever possible.
  4. Procedures: create, and follow, procedures for various situations that may arise.

With this in mind, our Ninjas refactored the heart of the system: the Application Layer.

The goal was to test the limit of language and platforms nginx + Ruby + Merb + MySQL, with the most advanced algorithms and topology.

To this end, we changed the logic, synchronicity and cache of the key parts. Some topology was changed from the last post “The infrastructure of web servers of the boo-box system”:


We also implemented monitoring methods that gave us greater control over the application.

Server Names and Release Versions
As we explained in our last post about our infrastructure, our servers are named after characters from the anime cartoon Dragon Ball Z, such as Korin, Kami, Cell, Raditz, Trunks, Goku, and Gohan, among others.

Lately, we’ve also started to name our releases — this time, after classic films. The first two releases were:

  1. Metropolis, named after the 1927 Fritz Lang sci-fi classic;
  2. Hannibal, 2001 box-office hit.

Cluster Application
This layer is made up of all the servers that process our business protocol. These are the servers that decide which ads will appear in the windows, what happens when a user clicks on an ad, and what to do with the data registered by a new publisher in the system.

This layer underwent the most changes. The refactoring focused on these four essential points:

  1. Logic: we optimized the processing algorithm, making it more efficient for this new phase of the system.
  2. Memcached: we raised the use of the memcached by 80%. Always respecting our business protocol, we were able to cache large amounts of information, thus optimizing response time. This change directly affected the application database layer, because it allowed the application to consult this database considerably less.
  3. Log registry: we made log registration asynchronous. This means that the application layer no longer directly communicates with our MySQL logs. Instead it beanstalks the log, (Beanstalkd) which, later, will be consumed by a Ruby program, which in turn will be responsible for storing it in the data base.
  4. Number of Workers: we performed a few beanchmarks that indicated the ideal number of workers. Before, we used to use a complex logic that measured the total processing capacity of each server to find the ideal number of workers, which, at the time, was very high. However, our beanchmarks indicate this number shouldn’t be very high. The higher the number of workers, the more server cores will be busy and therefore the slower their individual processing power will be. However, if the number of workers is too low, the load balancer might not find available workers, even if they process rapidly. In our case, the ideal number is between 20 and 30 workers.
  5. We also tried running the run_later method of the MERB. But we came to the conclusion that we would have greater managing power using the Beanstalkd method. Besides which, Beanstalkd also has a persistence aspect.

Cluster Log and BI
All requests made on boo-box are registered in our Cluster Log. Windows shopped, ads clicked, activities on partner sites; everything is registered.

So as I said at the beginning of the post: take good care of your database! It could deprive you of sleep :)

In this layer, it was important to combine tuning with asynchronous processing:

  1. Tuning: pay close attention to these configurations: key_buffer, max_connections, table_cache, thread_concurrency, innodb_buffer_pool_size and, of course, to their index structures on the charts. We also introduced a new partitioning system, from MySQL, which optimizes the storage structure on the HD (MySQL Reference Manual – Chapter 18. Partitioning)
  2. Asynchronous processing: as we explained under the Cluster Application, the application no longer communicates directly with the logs database, but instead with the beanstalk queue. Therefore, one works independently of the other, which greatly improved the response time in our shop windows.
    This optimization also affected our BI, which is responsible for processing all the data received that will be used at a later date (reports, measurements, projections, etc.)

Cluster data source
This layer is responsible for storing specific system information.

With all the upgrades we performed on the other layers, (especially the increased used of the Memcached), there was no need to make great changes here. Quite the contrary: we reduced the number of servers on this layer.

Cluster products cache
A large part of the ads displayed on boo-box windows are products offered by partnering e-commerce. Since product information does not have to be stored for a long time, we made a temporary cache for the data.
With the changes to the Cluster Application, we were able to considerably drop the amount of servers in this layer.

Remember: every time you use CouchDB or MySQL, it is important to have rapid storage machines to achieve a better I/O performance. Therefore, Cloud Computing may not be a good option for this layer.

Load balancer and Static Files
The servers receive the user requests and direct the load to one of the application servers.

On the web server nginx, we used the round-robin algorithm. When we raised scalability, it turned out this wasn’t the best logic for us because when one of our workers was slow to respond, it affected all of the requests coming in. So we switched to the Fair module, which permits nginx to send requests to the application servers that have the least amount of processing at the moment.

To guarantee the safety of the systems, it’s important to constantly monitor several layers and all the applications involved. However, this requires more than tools. It is essential to install manual and automatic procedures.

Right now, we use tools such as: Munin, Monit, Webalizer, Pingdom and our own tools.

We have hourly, daily, weekly and monthly checks, which generate data based upon which we make important decisions about possible changes to or maintenance of our structure.

It is important to note that making these procedures and check-ups automatic is ideal, but shouldn’t be imperative. Many technicians and infrastructure managers neglect regular check-ups because they hadn’t automated them, and ended up causing a lot more problems than solving them. So if necessary, introduce constant manual check-ups. Discipline is your biggest ally.

To build a highly scalable system you need more than technological solutions.

Effective management will make all the difference. Without getting lost in bureaucracy, you must create well-defined work procedures and practices that will allow your team to combine effective long-term solutions (slow implementation) with short-term solutions (rapid implementation), and deal with any problems that might arise from such a complex system.

Who makes it happen