Archive for the ‘Scalability’ Category

node.js 7-day Retrospective

Thursday, January 19th, 2012

A week ago I was walking the dog and thinking about how to handle a validation routine and I got sidetracked and thought about a different problem I had a few weeks earlier. I’ve been working with Ajax Push for a few things to test some parts of a larger concept.

I’m a big advocate of writing mini-projects to test pieces of an eventual solution. Writing two small projects in APE helped me refine the model of another project I had which is what triggered this project. CodeRetreat was also a very good experience – rewriting the same code six times in the same day. Each time you iterated, your code or methodology was better.

Now I have an idea and need a platform. I know I need Ajax Push and APE wasn’t suitable for my other project. I don’t like JavaScript, plain and simple. node.js uses server-side Javascript and this app would have plenty of client-side Javascript as well. The library I intended to use was socket.io as it supported the feature set I needed.

Within minutes, I had node.js up and running through installing one of their binary distributions for Debian. This turned out to be a mistake as they have an extremely old version packaged, but, it took five days before I ran into a package that required me to upgrade.

node.js is fairly straightforward and I had it serving content shortly after starting it. The first problem I ran into was routing. It seemed cumbersome to define everything and I started running through a number of route packages. I ended up using Express, a lightweight framework that includes routing as part of the framework. Express also wraps the connect package which I had used to handle uploads. Refactored code to use Express and I’m off and running.

Now, I’m serving pages, static files, my stylesheets are loading (with the proper content type) and the site is active. I actually had a problem with some JQuery because the content-type wasn’t being set to text/html for my index page.

Next up, image resizing. I used the gm wrapper around graphicsmagick which worked well and I didn’t look further. The methods used by GM are quite straightforward and I see very little difference in the output quality from it versus imagemagick. The command line interface is a bit more straightfoward – not that you need that unless you’re debugging what GM is actually doing. I did run into an few issues with the async handling which required a bit of rework. I still have some corner cases to solve but, I’m shooting for an alpha release.

Redis was straightforward and I needed that for a counter. Again, the async paradigm makes you write code that an OO or functional programmer might find troubling.

What you expect:

io.sockets.on('connection', function (socket) {
  var counter = redis_client.incr('counter');
  socket.emit('stats', { 'counter':res });
});

What you really mean:

io.sockets.on('connection', function (socket) {
  redis_client.incr('counter', function (err, res) {
    socket.emit('stats', { 'counter':res });
  });
});

Javascript doesn’t support classes, but, there are ways to emulate the behavior you’re after. This is something you learn when working with Sequelize – the ORM I am using for access to MySQL. I really debated whether to use Redis for everything, or, log to MySQL for the alpha. I know in the future I’ll probably migrate to CouchDB or possibly MongoDB so that I can still do sql-like queries to fetch data. Based on the stream-log data I expected to be collecting, I could see running out of RAM for Redis over time. Sequelize allows you to import your models from a model file which cleans up a bit of code. Most of the other ORMs I tried were very cumbersome and required a lot of effort to move models to an external file resulting in a very polluted app.js.

Now I needed a template language and form library. I initially looked at Jade but wanted something closer to the Python templating languages I usually use. Second on the list was ejs which is fairly powerful. It defines a layout page and imports your other page into a <%- body %> section, but, that’s about as powerful as it gets. There is currently support for partial includes, allowing header templates, etc, but, that is being phased out and should be done through if branches in the layout.ejs file.

As for a form library, I never found anything satisfying. For most frameworks, a good form library is a necessity for me, but, I can hand-code validation routines for this.

Authentication. At this point, I tried to install Everyauth. During installation a traceback is displayed with a fairly cryptic message:

npm ERR! Error: Using '>=' with 1.x.x makes no sense. Don't do it.

Tracking this down, we find a very old packaged version of NPM which refuses to upgrade because node.js is too old. In Debian Sid, node.js version 0.4.12 is packaged, what? 0.7.0-pre1 is the latest recommended. Upgrading node.js to be able to install a newer version of npm allows us to move forward.

Note: before upgrading, make sure you commit all of your development changes. I didn’t and lost about fifteen minutes of code due to a sleepy rm. :)

So, now we’re running a newer version of node.js, npm upgrades painlessly and we’ve installed Everyauth.

Everyauth is, well, every auth you can think of. In reading their example code, it looked like they were doing more than they actually do, but, they wrap a set of routines and hand back a fairly normalized set of values back. Within fifteen minutes I had Facebook and Twitter working, but, GoogleHybrid gave me some issues. I opted to switch to Google’s OAuth2, but, that failed in a different place. I’ll have to debug that, fork and do a pull request.

I need to write the backend logic for Everyauth, but, with Sequelize, that should be fairly quick.

Down to the basics

Web site performance is always on my mind. Search Engine Optimization becomes important for this site as well. Javascript built pages are somewhat difficult for Googlebot to follow and we don’t want to lose visibility because of that. However, we want to take advantage of a CDN and using Javascript and dom manipulation will allow us to output a static page that can be cached and use JQuery to modify the page to customize it for a logged in user. The one page that will probably see the heaviest utilization is completely Ajax powered, but, it is a short-lived page and probably wouldn’t be indexed anyhow.

node.js for serving static files

I debated this. node.js does really well for Ajax and long-polling but several articles recommend using something other than node.js for static media. I didn’t find it to be slow, but, other solutions did easily outserve it for static content. Since we’re putting all of our content behind Varnish, the alpha will serve the content to Varnish and Varnish will serve the content. It is possible I’ll change that architecture later.

socket.io

While I’ve just scratched the surface of the possibilities, socket.io is very easy to use and extremely powerful. I haven’t found a browser that had any problems, and, it abstracts everything so I don’t have to worry which method it is using to talk to the remote browser. You can associate data with the socket so that you can later identify the listener which is handy for writing chat applications.

Stumbles

At the end of seven days, I’m still stumbling over Javascript’s async behavior. At times, function precedence is cumbersome to work around when you’re missing a needed closure for a method in a library. I’ve also tested a number of packages that obviously solved someone’s problem and was published that looked good but just wasn’t generic enough.

248 hours

656 lines of code, most major functionality working, some test code written and the bulk of the alpha site at least fleshed in.

Overall Impression

node.js is very powerful. I think I could have saved time using Pyramid and used node.js purely for the Ajax Push, but, it was a good test and I learned quite a bit. If you have the time to implement a small project using new technology, I highly recommend it.

Software Used

* node.js
* Express
* gm
* redis
* sequelize
* ejs
* everyauth
* npm
* socket.io

Software Mentions

* APE
* jade

Additional Links

* Blazing fast node.js: 10 performance tips from LinkedIn Mobile

The Architecture of a New Project

Wednesday, January 11th, 2012

Yesterday I started working with Ajax Push, wrote a quick demo for a friend, and then stripped that and wrote a functional demo project with documentation. I did this to test if Ajax Push worked well enough for another concept project. As it turns out, using APE does work, but, it leaves a little to be desired.

While I was working with APE and tweaking the documentation and demo, a problem I had faced a few weeks back popped into my mind. Using Ajax Push for this application was perfect, it was all server push rather than client communication and the concept would work wonderfully.

What now?

We’re faced with a few dilemmas. This problem is 99% Ajax/Long Polling and 1% frontend. An Android and IOS app need to be developed to interface with the system, but, that is the simple part of the project.

Architecture

At first I considered Python/Pyramid as the frontend, Varnish for caching content and APE for handling the Ajax Push/Long Polling. I’ll need to write an API to handle the Android and IOS Authenticating and communicating with the system. I suspect my app will become an OAuth2 endpoint for the apps which I’ll explain in a moment.

It was at this point that I realized, I could use node.js and socket.io to handle the long polling, but, the frontend requirements are so lightweight, I could do most of the web app in Node.js. Since I’m using node.js quite heavily, I’ll probably use Redis and CouchDB to do my storage – just in case.

Epiphany

Now, I had an epiphany. While I don’t really intend to open the API for the project initially, there’s a certain logic to making your own project utilize the same API that you will later make public. If anything, it makes designing your IOS and Android app easier since they utilize an API rather than relying on separate methods for communications with the webapp. One single interface rather than two and later if Windows Mobile gets an app, we’ve already got the API designed. Since we’re an OAuth2 endpoint, our mobile apps can take advantage of numerous existing libraries – saving quite a bit of time.

Later, if the API is made public, we’re not facing a new engineering challenge and we’ve had some first-hand experience with the API.

Recently there has been a lot of discussion about using ‘the right tool for the job’ and why that is wrong. ‘Use the same language for every part of the project’ is the other school of thought. There are things I know Python does well, there are things I know it doesn’t do well. There are things Erlang can handle, and things it shouldn’t. While I’m not a fan of Javascript, for this project, it really does seem like the right tool for the job. The difference between APE and node.js was Spidermonkey versus V8. In both cases, I’m writing Javascript, so, why not choose the option that has a much larger installed base – and a demo that has a use case very similar to my final app.

Now what?

While I’ve not used node.js, I’m expecting the next few days to be a rapid iteration of development and testing.

…and I’ll be using git. :)

git init

XFS Filesystem Benchmarking thoughts due to real world situation

Tuesday, November 15th, 2011

I suspect I’m going to learn something about XFS today that is going to create a ton of work.

Writing a benchmark test (in bash) that uses bonnie++ to benchmark along with two other real world scenarios. I’ve thought a lot about how to replicate some real-world testing, but, benchmarks usually stress a worst case scenario and rarely replicate real-world scenarios.

Benchmarks shouldn’t really be published as the end-all, be-all of testing and you have to make sure you’re testing the piece you’re looking at, not your benchmark tool.

I’ve tried to benchmark Varnish, Tux, Nginx multiple times and I’ve seen numerous benchmarks that claim one is insanely faster than the others for this workload or that workload, but, there are always potential issues in their tests. A benchmark should be published with every bit of information possible so that others can replicate the test in the same way and potentially point out configuration issues.

One benchmark I read lately showed Nginx reading a static file at 35krps, and Varnish flatlined at 8krps. My first thought was, is it caching or talking to the backend? There was a snapshot of varnishstat supporting the notion that it was indeed cached, but, was the shmlog mounted on a ram based tmpfs? Was varnishstat running while the benchmark was?

Benchmarks test particular workloads – workloads you may not see. What you learn from a benchmark is how this load is affected by that setting – so when you start to see symptoms, your benchmarking has taught you what knob to turn to fix things.

Based on my impressions of the filesystem issues we’re running into on a few production boxes, I am convinced lazy-count=0 is a problem. While I did benchmark it and received different results, logic dictates that lazy-count=1 should be enabled for almost all workloads. Another value I’m looking at is -i size=256 – which is the default for XFS. I believe this should be larger which would really assist directories with tens of thousands of files. -b 8192 might be a good compromise since many of these sites are running small files, but, the average filesize is 5120 bytes – slightly over the 4096 byte block – meaning that each file written receives two inodes – and two metadata updates. logsize should be increased on heavy write machines, and I believe the default setting is too low even for normal workloads.

With that in mind, I’ve got 600 permutations of filesystem tests, which need to be run four times to check each mount option, which again need to be run three times to check each IO scheduler.

I’ll use the same methodology to test ext4 which is going to be a lot easier due to fewer knobs, but, I believe XFS is still going to win based on some earlier testing I did.

In this quick test, I increased deletes from about 5.3k/sec to 13.4k/sec which took a little more than six minutes. I suspect this machine will be running tests for the next few days after I write the test script.

GFS2 Kernel Oops

Sunday, October 30th, 2011

For a few years I’ve run a system using DRBD replication between two machines with GFS2 running in dual primary mode to test a theory on a particular type of web hosting I’ve been developing.

For months the system will run fine, then, out of the blue, one of the nodes will drop from the cluster, reboot and we’ve never seen anything in the logs. It’ll run another 120-180 days without incident and then will reboot again with no real indication of the problem. We knew it was a kernel panic or kernel oops, but, the logs never were flushed to disk when the machine was rebooted.

Imagine our luck when two days in a row, at roughly the same time of day, the node rebooted. Even though we have remote syslog set up, we’ve never caught it.

/etc/sysctl.conf was changed so that panic_on_oops was set to 0, a number of terminal sessions were opened from another machine tailing various logs, and we were hoping to have the problem occur again.

/etc/sysctl.conf:

kernel.panic=5
kernel.panic_on_oops=0

At 6:25am, coincidentally during log rotation, the GFS2 partition umounted, but, the machine didn’t reboot. Checking our terminal, we still had access to dmesg, and, we had some logs:

GFS2: fsid=gamma:gfs1.0: fatal: invalid metadata block
GFS2: fsid=gamma:gfs1.0:   bh = 1211322 (magic number)
GFS2: fsid=gamma:gfs1.0:   function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 401
GFS2: fsid=gamma:gfs1.0: about to withdraw this file system
GFS2: fsid=gamma:gfs1.0: telling LM to unmount
GFS2: fsid=gamma:gfs1.0: withdrawn
Pid: 18047, comm: gzip Not tainted 3.0.0 #1
Call Trace:
 [] ? gfs2_lm_withdraw+0xd9/0x10a
 [] ? gfs2_meta_check_ii+0x3c/0x48
 [] ? gfs2_meta_indirect_buffer+0xf0/0x14a
 [] ? gfs2_block_map+0x1a3/0x9fe
 [] ? drive_stat_acct+0xf3/0x12e
 [] ? do_mpage_readpage+0x160/0x49f
 [] ? pagevec_lru_move_fn+0xab/0xc1
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? mpage_readpages+0xd0/0x12a
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? bit_waitqueue+0x14/0x63
 [] ? gfs2_readpages+0x67/0xa8
 [] ? sd_prep_fn+0x2c1/0x902
 [] ? gfs2_readpages+0x3b/0xa8
 [] ? __do_page_cache_readahead+0x11b/0x1c0
 [] ? ra_submit+0x19/0x1d
 [] ? generic_file_aio_read+0x2b4/0x5e0
 [] ? do_sync_read+0xab/0xe3
 [] ? vfs_read+0xa3/0x10f
 [] ? sys_read+0x45/0x6e
 [] ? system_call_fastpath+0x16/0x1b
------------[ cut here ]------------
WARNING: at fs/buffer.c:1188 gfs2_block_map+0x2be/0x9fe()
Hardware name: PDSMi
VFS: brelse: Trying to free free buffer
Modules linked in:
Pid: 18047, comm: gzip Not tainted 3.0.0 #1
Call Trace:
 [] ? gfs2_block_map+0x2be/0x9fe
 [] ? warn_slowpath_common+0x78/0x8c
 [] ? warn_slowpath_fmt+0x45/0x4a
 [] ? gfs2_block_map+0x2be/0x9fe
 [] ? drive_stat_acct+0xf3/0x12e
 [] ? do_mpage_readpage+0x160/0x49f
 [] ? pagevec_lru_move_fn+0xab/0xc1
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? mpage_readpages+0xd0/0x12a
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? bit_waitqueue+0x14/0x63
 [] ? gfs2_readpages+0x67/0xa8
 [] ? sd_prep_fn+0x2c1/0x902
 [] ? gfs2_readpages+0x3b/0xa8
 [] ? __do_page_cache_readahead+0x11b/0x1c0
 [] ? ra_submit+0x19/0x1d
 [] ? generic_file_aio_read+0x2b4/0x5e0
 [] ? do_sync_read+0xab/0xe3
 [] ? vfs_read+0xa3/0x10f
 [] ? sys_read+0x45/0x6e
 [] ? system_call_fastpath+0x16/0x1b
---[ end trace 54fad1a4877f173c ]---
BUG: unable to handle kernel paging request at ffffffff813b8f0f
IP: [] __brelse+0x7/0x26
PGD 1625067 PUD 1629063 PMD 12001e1
Oops: 0003 [#1] SMP
CPU 0
Modules linked in:

Pid: 18047, comm: gzip Tainted: G        W   3.0.0 #1 Supermicro PDSMi/PDSMi+
RIP: 0010:[]  [] __brelse+0x7/0x26
RSP: 0018:ffff880185d85800  EFLAGS: 00010286
RAX: 00000000e8df8948 RBX: ffff8801f3fb6c18 RCX: ffff880185d857d0
RDX: 0000000000000010 RSI: 000000000002ccee RDI: ffffffff813b8eaf
RBP: 0000000000000000 R08: ffff880185d85890 R09: ffff8801f3fb6c18
R10: 00000000000029e0 R11: 0000000000000078 R12: ffff880147986000
R13: ffff880147986140 R14: 00000000000029c1 R15: 0000000000001000
FS:  00007fb7d320f700(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff813b8f0f CR3: 0000000212a73000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process gzip (pid: 18047, threadinfo ffff880185d84000, task ffff880147dc4ed0)
Stack:
 ffffffff811537f6 0000000000000000 00000000000022a1 0000000000000001
 000000000000ffff 0000000000000000 ffffffff8102a7ce 0000000000000001
 0000000100000008 00000000fffffffb 0000000000000001 ffff8801f3fb6c18
Call Trace:
 [] ? gfs2_block_map+0x2be/0x9fe
 [] ? warn_slowpath_common+0x7d/0x8c
 [] ? printk+0x43/0x48
 [] ? alloc_page_buffers+0x62/0xba
 [] ? block_read_full_page+0x141/0x260
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? do_mpage_readpage+0x49b/0x49f
 [] ? pagevec_lru_move_fn+0xab/0xc1
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? mpage_readpages+0xd0/0x12a
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? bit_waitqueue+0x14/0x63
 [] ? gfs2_readpages+0x67/0xa8
 [] ? sd_prep_fn+0x2c1/0x902
 [] ? gfs2_readpages+0x3b/0xa8
 [] ? __do_page_cache_readahead+0x11b/0x1c0
 [] ? ra_submit+0x19/0x1d
 [] ? generic_file_aio_read+0x2b4/0x5e0
 [] ? do_sync_read+0xab/0xe3
 [] ? vfs_read+0xa3/0x10f
 [] ? sys_read+0x45/0x6e
 [] ? system_call_fastpath+0x16/0x1b
Code: 31 00 45 31 f6 fe 85 88 00 00 00 48 89 df e8 a2 1a fc ff eb 03 45 31 f6 5b 4c 89 f0 5d 41 5c 41 5d 41 5e c3 8b 47 60 85 c0 74 05  ff 4f 60 c3 48 c7 c2 96 7b 4d 81 be a4 04 00 00 31 c0 48 c7
RIP  [] __brelse+0x7/0x26
 RSP 
CR2: ffffffff813b8f0f
---[ end trace 54fad1a4877f173d ]---

As I suspected, log rotation appeared to trigger the problem and handed us the above traceback. Running fsck.gfs2 resulted in:


# fsck.gfs2 -y /dev/drbd1
Initializing fsck
Validating Resource Group index.
Level 1 RG check.
(level 1 passed)
Error: resource group 7339665 (0x6ffe91): free space (64473) does not match bitmap (64658)
The rgrp was fixed.
Error: resource group 7405179 (0x70fe7b): free space (64249) does not match bitmap (64299)
(50 blocks were reclaimed)
The rgrp was fixed.
Error: resource group 7470693 (0x71fe65): free space (65456) does not match bitmap (65464)
(8 blocks were reclaimed)
The rgrp was fixed.

...snip...

Ondisk and fsck bitmaps differ at block 133061348 (0x7ee5ae4)
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
Ondisk and fsck bitmaps differ at block 133061349 (0x7ee5ae5)
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
RG #133061031 (0x7ee59a7) free count inconsistent: is 65232 should be 65508
Inode count inconsistent: is 37 should be 0
Resource group counts updated
Inode count inconsistent: is 1267 should be 1266
Resource group counts updated
Pass5 complete
The statfs file is wrong:

Current statfs values:
blocks:  188730628 (0xb3fcd04)
free:    176443034 (0xa844e9a)
dinodes: 644117 (0x9d415)

Calculated statfs values:
blocks:  188730628 (0xb3fcd04)
free:    177426468 (0xa935024)
dinodes: 493059 (0x78603)
The statfs file was fixed.
Writing changes to disk
gfs2_fsck complete

Filesystem was remounted after a 7 minute fsck and we’ll see if it happens again tomorrow.

WordPress, Varnish and ESI Plugin

Sunday, June 5th, 2011

This post is a version of the slideshow presentation I did at Hack and Tell in Fort Lauderdale, Florida at The Whitetable Foundation on Saturday, June 4, 2011.

Briefly, I created a Plugin that enabled Fragment Caching with WordPress and Varnish. The problem we ran into with normal page caching methods was related to the fact that this particular client had people visiting many pages per visit, requiring the sidebar to be regenerated on uncached (cold) pages. By caching the sidebar and the page and assembling the page using Edge Side Includes, we can cache the sidebar which contains the most database intensive queries separately from the page. Thus, a visitor moving from one page to a cold page, only needs to wait for the page to generate and pull the sidebar from the cache.

What problem are we solving?

We had a high traffic site where surfers visited multiple pages, and, a very interactive site. Surfers left a lot of comments which meant we were constantly purging the page cache. This resulted in the sidebar having to be regenerated numerous times – even when it wasn’t truly necesssary.

What are our goals?

First, we want that Time to First Byte to be as quick as possible – surfers hate to wait and if you have a site that takes 12 seconds before they see any visible indication that there is something happening, most will leave.

We needed to keep the site interactive, which meant purging pages from cache when posts were made.

We had to have fast pageloads – accomplished by caching the static version of the page and doing as few calculations as possible to deliver the content.

We needed fast static content loading. Apache does very well, but, isn’t the fastest webserver out there.

How does the WordPress front page work?

The image above is a simple representation of a page that has a header, an article section where three articles are shown and a sidebar. Each of those elements is built from a number of SQL queries, assembled and displayed to the surfer. Each plugin that is used, especially filter plugins that look at content and modify it before output add a little latency – resulting in a slower page display.

How does an Article page work?

An article page works very similar to the frontpage except our content block now only contains the contents from one post. Sometimes additional plugins are called to display the post content dealing with comments, social media sharing icons, greetings based on where you’re visiting from (Google, Digg, Reddit, Facebook, etc) and many more. We also see the same sidebar on our site which contains the site navigation, advertisements and other content.

What Options do we Have?

There are a number of existing caching plugins that I have benchmarked in the past. Notably we have:

* WP-Varnish
* W3 Total Cache
* WP Super Cache
* WordPress-Varnish-ESI
* and many others

Page Caching

With Page Caching, you take the entire generated page and cache it either in ram or on disk. Since the page doesn’t need to be generated from the database, the static version of the page is served much more quickly.

Fragment Caching

With Fragment Caching, we’re able to cache the page and a smaller piece that is often repeated, but, perhaps doesn’t change as often as the page. When a websurfer comments on a post, the sidebar doesn’t need to be regenerated, but, the page does.

WordPress and Varnish

Varnish doesn’t deal well with cookies, and WordPress uses a lot of cookies to maintain information about the current web surfer. Some plugins also add their own cookies to track things so that their plugin works.

Varnish can do domain name normalization which may be desired or not. Many sites redirect the bare domain to the www.domain.com. If you do this, you can modify your Varnish Cache Language (VCL) to make sure it always hands back the proper host header.

There are other issues with Varnish that affect how well it caches. There are a number of situations where Varnish doesn’t work as you would expect, but, this can all be addressed with VCL.

Purging – caching is easy, purging is hard once you graduate beyond a single server setup.

WordPress and Varnish with ESI

In this case, our plugin caches the page and the sidebar separately, and allows Varnish to assemble the page prior to sending it to the server. This is going to be a little slower than page caching, but, in the long run, if you have a lot of page to page traffic, having that sidebar cached will make a significant impact.

Possible Solutions

You could hardcode templates and write modules to cache CPU or Database heavy widgets and in some cases, that is a good solution.

You could create a widget that handles the work to cache existing widgets. There is a plugin called Widget Cache, but, I didn’t find it to have much benefit when testing.

Many of the plugins could be rewritten to use client-side javascript. This way, caching would allow the javascript to be served and the actual computational work would be done on the client’s web browser.

Technical Problems

When the plugin was originally written, Varnish didn’t support compressing ESI assembled pages which resulted in a very difficult to manage infrastructure.

WordPress uses a lot of cookies which need to be dealt with very carefully in Varnish’s configuration.

What sort of Improvement?

Before the ESI Widget After the ESI Widget
12 seconds time to first byte .087 seconds time to first byte
.62 requests per second 567 requests per second
Huge number of elements Moved some elements to a ‘CDN’ url

WordPress Plugin

In the above picture, we can see the ESI widget has been added to the sidebar, and we’ve added our desired widgets to the new ESI Widget Sidebar.

Varnish VCL – vcl_recv

sub vcl_recv {
    if (req.request == "BAN") {
       ban("req.http.host == " + req.http.host +
              "&& req.url == " + req.url);
       error 200 "Ban added";
    }
    if (req.url ~ "\.(gif|jpg|jpeg|swf|css|js|flv|mp3|mp4|pdf|ico|png)(\?.*|)$") {
      unset req.http.cookie;
      set req.url = regsub(req.url, "\?.*$", "");
    }
    If (!(req.url ~ "wp-(login|admin)")) {
      unset req.http.cookie;
    }
}

In vcl_recv, we set up rules to allow the plugin to purge content, we do a little manipulation to cache static assets and ignore some of the cache breaking arguments specified after the ? and we aggressively remove cookies.

Varnish VCL – vcl_fetch

sub vcl_fetch {
  if ( (!(req.url ~ "wp-(login|admin)")) || (req.request == "GET") ) {
                unset beresp.http.set-cookie;
  }
  set beresp.ttl = 12h;

  if (req.url ~ "\.(gif|jpg|jpeg|swf|css|js|flv|mp3|mp4|pdf|ico|png)(\?.*|)$") {
    set beresp.ttl = 365d;
  } else {
    set beresp.do_esi = true;
  }
}

Here, we remove cookies set by the backend. We set our timeout to 12 hours, overriding any expire time. Since the widget purges cached content, we can set this to a longer expiration time – eliminating additional CPU and database work. For static asset, we set a one year expiration time, and, if it isn’t a static asset, we parse it for ESI. The ESI parsing rule needs to be refined considerably as it currently parses objects that wouldn’t contain ESI.

Did Things Break?

Purging broke things and revealed a bug in PHP’s socket handling.

Posting Comments initially broke as a result of cookie handling that was a little too aggressive.

Certain plugins break that rely on being run on each pageload such as WP Greet Box and many of the Post Count and Statistics plugins.

Apache logs are rendered virtually useless since most of the queries are handled by Varnish and never hit the backend. You can log from varnishncsa, but, Google Analytics or some other webbug statistics program is a little easier to use.

End Result

Varnish 3.0, currently in beta, allows compression of ESI assembled pages, and, now can accept compressed content from the backend – allowing the Varnish server to exist at a remote location, possibly opening up avenues for companies to provide Varnish hosting in front of your WordPress site using this plugin.

Varnish ESI powered sites became much easier to deploy with 3.0. Before 2.0, you needed to run Varnish to do the ESI assembly, then, into some other server like Nginx to compress the page before sending it to the surfer, or, you would be stuck handing uncompressed pages to your surfers.

Other Improvements

* Minification/Combining Javascript and CSS
* Proper ordering of included static assets – i.e. include .css files before .js, use Async javascript includes.
* Spriting images – combining smaller images and using CSS to alter the display port resulting in one image being downloaded rather than a dozen tiny social media buttons.
* Inline CSS for images – if your images are small enough, they could be included inline in your CSS – saving an additional fetch for the web browser.
* Multiple sidebars – currently, the ESI widget only handles one sidebar.

How can I get the code?

http://code.google.com/p/wordpress-varnish-esi/