Archive for February, 2010

Using Varnish to assist with AB Testing

Thursday, February 25th, 2010

While working with a recent client project, they mentioned AB Testing a few designs. While I enjoy statistics, we looked at Google’s Website Optimizer to track trials and conversions. After some internal testing, we opted to use Funnels and Goals rather than the AB or Multivariate test. I had little control over the origin server, but I did have control over the front-end cache.

Our situation reminded me of a situation I encountered years ago. A client had an inhouse web designer and a subcontracted web designer. I felt the subcontracted web designer’s design would convert better. The client wasn’t completely convinced, but agreed to running two designs head to head. However, their implementation of the test biased the results.

What went wrong?

Each design was run for a week, in series. While this provided ample time for gathering data, the inhouse designer’s design ran during a national holiday with a three day weekend, and the subcontractor’s design ran the following week. Internet traffic patterns, the holiday weekend, weather, sporting events, TV/Movie premieres, etc. added so many variables which should have invalidated the results.

Since Google’s AB Testing has session persistence and splits traffic between the AB tests, we need to emulate this behavior. When people run AB tests in series rather than parallel, or, switch pages with a cron job or some other automated method, I cringe. A test at 5pm EST and 6pm EST will yield different results. At 5pm EST, your target audience could be driving home from work. At 6pm EST they could be sitting down for dinner.

How can Varnish help?

If we allow Varnish to select the landing page/offer page outside the origin server’s control, we can run both tests run at the same time. An internet logjam in Seattle, WA would affect both tests evenly. Likewise, a national or worldwide event would affect both tests equally. Now that we know how to make sure the AB Test is fairly balanced, we have to implement it.

Redirection sometimes plays havoc on browsers and spiders, so, we’ll rewrite the URL within Varnish using some Inline C and VCL. Google uses javascript and a document.location call to send some visitors to the B/alternate page. Users that have javascript disabled, will only see the Primary page.

Our Varnish config file contains the following:

sub vcl_recv {
  if (req.url == "/") {
    C{
      char buff[5];
      sprintf(buff,"%d",rand()%2 + 1);
      VRT_SetHdr(sp, HDR_REQ, "\011X-ABtest:", buff, vrt_magic_string_end );
    }C
    set req.url = "/" req.http.X-ABtest "/" req.url;
  }
}

We’ve placed our landing pages in /1/ and /2/ directories on our origin server. The only page Varnish intercepts is the index page at the root of the site. Varnish randomly chooses to serve the index.html page from /1/ or /2/, internally rewrites our URL and serves it from the cache or the origin server. Since the URL rewriting is done within vcl_recv, subsequent requests for the page don’t hit the origin. The same method can be used to test landing pages that aren’t at the root of your site by modifying the if (req.url == “”) { condition.

You can test multipage offers by placing additional pages within the /1/ and /2/ directories on your origin along with the signup form. Unlike Google’s AB Test, Varnish does not support session persistence. Reloading the root page will result in the surfer alternating between both test pages. Subsequent pages need to be loaded from /1/ or /2/ based on which landing page was selected.

When doing any AB Test, change as few variables as possible, document the changes, and analyze the difference between the results. Running at least 1000 views of each is an absolute minimum. While Google’s Multivariate test provides a lot more options, a simple AB test between two pages or site tours can give some insight into what works rather easily.

If you cannot use Google’s AB Test or the Multivariate Test, using their Funnels and Goals tool will still allow you to do AB Testing.

Varnish VCL, Inline C and a random image

Thursday, February 18th, 2010

While working with the prototype of a site, I wanted to have a particular panel image randomly chosen when the page was viewed. While this could be done on the server side, I wanted to move this to Varnish so that Varnish’s cache would be used rather than piping the request through each time to the origin server.

At the top of /etc/varnish/default.vcl

C{
  #include <stdlib.h>
  #include <stdio.h>
}C

and our vcl_recv function gets the following:

  if (req.url ~ "^/panel/") {
    C{
      char buff[5];
      sprintf(buff,"%d",rand()%4);
      VRT_SetHdr(sp, HDR_REQ, "\010X-Panel:", buff, vrt_magic_string_end);
    }C
    set req.url = regsub(req.url, "^/panel/(.*)\.(.*)$", "/panel/\1.ZZZZ.\2");
    set req.url = regsub(req.url, "ZZZZ", req.http.X-Panel);
  }

The above code allows for us to specify the source code in the html document as:

<img src="/panel/random.jpg" width="300" height="300" alt="Panel Image"/>

Since we have modified the request uri in vcl_recv before the object is cached, subsequent requests for the same modified URI will be served from Varnish’s cache, without requiring another fetch from the origin server. Based on the other VCL and preferences, you can specify a long expire time, remove cookies, or do ESI processing. Since the regexp passes the extension through, we could also randomly choose .html, .css, .jpg or any other extension you desire.

In the directory panel, you would need to have

/panel/random.0.jpg
/panel/random.1.jpg
/panel/random.2.jpg
/panel/random.3.jpg

which would be served by Varnish when the url /panel/random.jpg is requested.

Moving that process to Varnish should cut down on the load from the origin server while making your site look active and dynamic.

SEOProfilerBot, Amazon ECS, and poor programming

Monday, February 8th, 2010

This morning a client’s machine alerted several times due to high load. As the machine runs roughly 50 wordpress powered sites and rarely has issues, we did some investigation. Evidently a bot called SEOProfiler was hitting the machine and causing problems.

From SEOProfiler’s page, http://www.seoprofiler.com/bot/,

The spbot is bandwidth-friendly. It tries to wait at least 5 minutes until it visits another page of your domain. In general, it takes days or weeks until spbot visits another page of your website.

Oh really?

In a three hour period on a machine with 50 domains:

# grep -l '+http://www.seoprofiler.com/bot/' *.log|wc -l
50
# grep '+http://www.seoprofiler.com/bot/' *.log|wc -l
375938

In a period of three and a half hours, I calculate that to be roughly two pages per second requested.

Let’s see how friendly they really are:

# grep seoprofiler.com xxxxxx.com-access.log | grep 'GET /robots.txt ' | wc -l
2005

2005 requests for robots.txt in three and a half hours, well, at least they are checking.

# grep seoprofiler.com xxxxxx.com-access.log | grep -v 'GET /robots.txt ' |wc -l
1883

1883 requests for documents in that same period. They actually requested robots.txt more frequently than pages on this particular domain. Here are the first 50 lines from one of the sites on this machine with robots.txt requests excluded:

67.202.41.44 - - [07/Feb/2010:06:38:13 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 11857 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
75.101.214.118 - - [07/Feb/2010:06:38:15 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 10214 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.65.79 - - [07/Feb/2010:06:38:41 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 71830 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
72.44.54.185 - - [07/Feb/2010:06:38:45 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 20829 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
67.202.48.58 - - [07/Feb/2010:06:38:48 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 19576 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.172.253 - - [07/Feb/2010:06:39:32 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 73199 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.65.79 - - [07/Feb/2010:06:39:47 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 60596 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.191.9 - - [07/Feb/2010:06:39:50 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 21406 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
204.236.242.36 - - [07/Feb/2010:06:39:51 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 24076 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
72.44.48.77 - - [07/Feb/2010:06:40:10 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 29957 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.65.79 - - [07/Feb/2010:06:40:15 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 9871 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
204.236.242.36 - - [07/Feb/2010:06:40:40 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 11748 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.172.253 - - [07/Feb/2010:06:40:43 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 10781 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
75.101.197.161 - - [07/Feb/2010:06:40:44 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 14995 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.93.177 - - [07/Feb/2010:06:40:45 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 72244 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
204.236.197.86 - - [07/Feb/2010:06:40:57 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 13103 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.172.253 - - [07/Feb/2010:06:40:58 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 12032 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
67.202.0.47 - - [07/Feb/2010:06:41:05 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 17798 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
75.101.254.111 - - [07/Feb/2010:06:41:22 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 38199 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.65.79 - - [07/Feb/2010:06:41:38 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 17484 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
204.236.197.86 - - [07/Feb/2010:06:41:41 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 23264 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.103.67 - - [07/Feb/2010:06:41:47 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 17145 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
72.44.42.173 - - [07/Feb/2010:06:41:48 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 23440 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
204.236.244.231 - - [07/Feb/2010:06:41:50 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 29496 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
75.101.214.118 - - [07/Feb/2010:06:41:52 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 69694 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.140.41 - - [07/Feb/2010:06:41:56 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 14958 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
72.44.48.77 - - [07/Feb/2010:06:42:41 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 12272 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
72.44.54.185 - - [07/Feb/2010:06:42:55 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 60345 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
67.202.16.163 - - [07/Feb/2010:06:43:03 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 16470 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
204.236.242.36 - - [07/Feb/2010:06:43:04 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 21739 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.103.67 - - [07/Feb/2010:06:43:05 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 59288 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.152.208 - - [07/Feb/2010:06:43:05 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 11407 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
72.44.42.173 - - [07/Feb/2010:06:43:09 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 14459 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
67.202.0.47 - - [07/Feb/2010:06:43:31 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 10561 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.93.177 - - [07/Feb/2010:06:43:46 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 14947 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.152.208 - - [07/Feb/2010:06:43:50 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 19598 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.140.41 - - [07/Feb/2010:06:43:55 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 12090 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.140.41 - - [07/Feb/2010:06:44:05 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 11853 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
75.101.254.111 - - [07/Feb/2010:06:44:16 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 11612 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
67.202.41.44 - - [07/Feb/2010:06:44:15 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 71920 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
67.202.0.47 - - [07/Feb/2010:06:44:22 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 14007 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.191.9 - - [07/Feb/2010:06:44:31 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 130288 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
75.101.254.111 - - [07/Feb/2010:06:45:01 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 21739 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
204.236.242.36 - - [07/Feb/2010:06:45:26 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 18281 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.65.79 - - [07/Feb/2010:06:45:32 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 59638 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.103.67 - - [07/Feb/2010:06:45:40 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 12372 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.65.79 - - [07/Feb/2010:06:46:04 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 14353 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
72.44.54.185 - - [07/Feb/2010:06:46:07 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 27416 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.152.208 - - [07/Feb/2010:06:46:13 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 22271 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
75.101.197.161 - - [07/Feb/2010:06:46:13 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 14548 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"

While we don’t see many duplicate IPs here, let’s analyze the one that has six hits:

174.129.65.79 - - [07/Feb/2010:06:38:41 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 71830 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.65.79 - - [07/Feb/2010:06:39:47 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 60596 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.65.79 - - [07/Feb/2010:06:40:15 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 9871 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.65.79 - - [07/Feb/2010:06:41:38 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 17484 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.65.79 - - [07/Feb/2010:06:45:32 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 59638 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
174.129.65.79 - - [07/Feb/2010:06:46:04 -0500] "GET /xxxxx/xxxxx/xxxxx.html HTTP/1.1" 200 14353 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"

The longest delay between page fetches is 3 minutes, 54 seconds, with a minimum of 28 seconds.

In that same period of time, you can see that they used a number of Amazon ECS instances:

  10655 67.202.0.47
  10454 204.236.242.36
  10353 174.129.103.67
  10343 75.101.254.111
  10295 204.236.197.86
  10128 174.129.65.79
   9908 174.129.191.9
   9883 75.101.214.118
   9835 72.44.54.185
   9833 72.44.42.173
   9769 174.129.136.94
   9718 75.101.197.161
   9290 174.129.106.91
   9063 72.44.48.77
   9017 174.129.152.208
   8850 204.236.212.138
   8712 174.129.93.177
   8423 174.129.140.41
   8415 67.202.41.44
   8302 67.202.16.163
   8116 72.44.57.92
   7923 204.236.245.5
   7633 75.101.219.131
   7519 67.202.48.58
   7510 174.129.72.66
   7429 67.202.2.164
   7356 174.129.155.12
   7335 174.129.172.253
   7036 75.101.214.102
   6998 67.202.42.161
   6835 174.129.159.143
   6109 204.236.244.231
   6002 174.129.127.87
   5961 75.101.168.14
   5841 174.129.84.116
   5201 174.129.163.50
   5114 72.44.49.238
   4744 174.129.153.52
   4654 75.101.241.159
   4615 204.236.241.141
   4585 75.101.179.97
   4463 174.129.61.74
   4387 75.101.179.141
   4379 72.44.56.37
   4332 75.101.187.208
   4169 67.202.56.227
   4106 204.236.211.119
   4075 174.129.93.123
   3722 204.236.242.141
   3332 67.202.11.26
   3276 67.202.0.31
   3097 174.129.171.75
   2360 75.101.234.148
   1837 174.129.136.47
   1689 67.202.56.158
    853 67.202.10.125
     67 75.101.204.87
     14 204.236.212.231
     12 174.129.144.34
      6 174.129.106.64

Even if we look at only one of the domains that was spidered:

    125 72.44.48.77
    123 174.129.140.41
    112 174.129.65.79
    109 75.101.254.111
    108 174.129.172.253
    104 75.101.197.161
    104 174.129.93.177
    104 174.129.103.67
    102 204.236.197.86
    102 174.129.136.94
    101 67.202.2.164
     99 75.101.214.118
     98 67.202.0.47
     96 67.202.48.58
     95 204.236.212.138
     93 174.129.106.91
     86 67.202.41.44
     85 72.44.54.185
     84 204.236.242.36
     82 75.101.219.131
     82 72.44.42.173
     76 67.202.42.161
     76 174.129.191.9
     75 174.129.152.208
     73 72.44.57.92
     73 67.202.16.163
     71 75.101.168.14
     71 174.129.159.143
     68 204.236.245.5
     68 174.129.72.66
     61 174.129.155.12
     60 204.236.244.231
     60 204.236.211.119
     59 174.129.153.52
     58 72.44.49.238
     54 72.44.56.37
     54 174.129.93.123
     54 174.129.61.74
     51 75.101.179.141
     51 174.129.163.50
     50 204.236.242.141
     47 174.129.127.87
     45 75.101.241.159
     44 75.101.214.102
     43 67.202.56.227
     42 174.129.171.75
     41 67.202.11.26
     40 67.202.0.31
     39 75.101.187.208
     39 204.236.241.141
     36 174.129.84.116
     32 75.101.179.97
     30 75.101.234.148
     22 174.129.136.47
     19 67.202.56.158
     12 67.202.10.125

While their goals stated on their page are admirable, it is clear that they lack some understanding of how ECS works. Writing code to run across distributed instances is not a simple process, so, I can see where handing out spider assignments to nodes could run into problems. But, looking at a single IP address, we can see that their bot probably doesn’t maintain state between fetches since it fetches robots.txt prior to each URL and then violates their ‘no more than one page every five minutes’.

72.44.48.77 - - [07/Feb/2010:06:40:10 -0500] "GET /robots.txt HTTP/1.1" 200 2631 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
72.44.48.77 - - [07/Feb/2010:06:40:10 -0500] "GET / HTTP/1.1" 200 29957 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
72.44.48.77 - - [07/Feb/2010:06:42:40 -0500] "GET /robots.txt HTTP/1.1" 200 2631 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
72.44.48.77 - - [07/Feb/2010:06:42:41 -0500] "GET /xxxxxx/xxxxxx/xxxxxx.html HTTP/1.1" 200 12272 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
72.44.48.77 - - [07/Feb/2010:06:49:26 -0500] "GET /robots.txt HTTP/1.1" 200 2631 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
72.44.48.77 - - [07/Feb/2010:06:49:26 -0500] "GET /xxxxxx/xxxxxx/xxxxxx.html HTTP/1.1" 200 16855 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
72.44.48.77 - - [07/Feb/2010:06:53:11 -0500] "GET /robots.txt HTTP/1.1" 200 2631 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"
72.44.48.77 - - [07/Feb/2010:06:53:11 -0500] "GET /xxxxxx/xxxxxx/xxxxxx.html HTTP/1.1" 200 68020 "-" "Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )"

Based on the times, I don’t believe they could have spun up a new ECS instance on the same IP address which leads me to believe that they are spidering links from the site and requesting robots.txt each time.

While I believe using cloud services is a good thing, companies like this that abuse it are going to cause problems for other people that adopt the same methods. Amazon’s ECS instances have already hit numerous anti-spam blacklists due to Amazon’s lax policy or inability to quickly track down spam. While I have resisted the temptation to block ECS instances for inbound email, this client requested that we block the IP addresses that SEOProfilerBot was coming from – which means that any other search engine that comes along that uses Amazon’s ECS will not be able to reach his sites.

Cuill did the same thing to his sites a while back and we altered the robots.txt file, but, that didn’t stop the constant pounding from their spiders that had already fetched the robots.txt.

At some point, Amazon ECS and other cloud vendors will be firewalled from large portions of the net — limiting the usefulness of writing applications that run on the cloud.