The Performance Impact of Uploaded Data

calendarJanuary 18, 2008 in Caching , HTTP , HttpWatch , Optimization

Web developers are becoming more aware of the performance penalties of page bloat and as we covered in our previous posts there are ways to mitigate this, compression being just one.

However, one of the causes of poor performance that is often overlooked is the transmission time taken to upload data to the server. Although, HTTP request messages are typically smaller than HTTP response messages, the performance cost can be an order of magnitude higher per byte. This is caused by the asymmetric nature of many consumer broadband connections.

For example, the results of a speed test on a UK broadband cable connection are shown here:

Broadband Speed Test

The upload speed is only about 6% of the download speed. That means, byte for byte, uploaded data takes about 16 times as long to transmit as the equivalent amount of downloaded data.

To put it another way, if you upload 4 KB of data in an HTTP request message it may take the same length of time as downloading a 64 KB page.

You can easily see the size of the HTTP request message by looking at the Sent column in HttpWatch:

 Sent Column

The value shown in the Sent Column is made up of the size of the following items:

  • The HTTP GET or POST request line
  • HTTP request headers
  • Form fields and uploaded files sent with POST requests

Unfortunately, request data is never compressed because there is no server-side equivalent of the Accept-Encoding request header that is used by browsers to indicate that they support compression of downloaded content.

For a typical site, you might be surprised to know that the request data can be up to 50% of the size of the response data. Since many broadband connections are asymmetric, this can have a substantial impact on performance. Here’s an example of a flight search page on Expedia:

Upload / Download Ratio

The ratio increases as the downloaded content is cached by the browser, often making uploaded data the most significant factor in the performance of a web page.

So what can be done to reduce the amount of uploaded data?

Step 1: Minimize the size of Cookies

Cookies are simply part of the request headers. Expedia uses around 9 cookies which are fortunately quite small but it’s easily possible to end up with a lot of cookie data, particularly if you’re using 3rd party web frameworks. RFC2109 specifies that browsers should support at least 20 cookies for each domain and at least 4K of data per cookie.

The problem with cookies is that they need to be sent with every single HTTP request where the URL is in the domain and path to which they apply. That includes requests for style-sheets, images and scripts.  So in most cases the amount of cookie data uploaded, is effectively multiplied by the number of requests per page.

One way to reduce the amount of cookie data (apart from making them as small as possible and using them less) is to use different domains or paths for your content. For instance, you probably need cookies in your page processing code but you don’t need them for static content such as images. If you put static content in a different location, cookie data will not be sent because cookies are domain and path specific.

Another way to reduce cookie data is to look at server-side storage, such as using the session state management of a framework like ASP.NET. You can then use a single cookie that just contains a session id and look up any session related data on the server when required. Of course, this may have an impact on server side performance, but it is a useful way of minimizing the amount of cookie data that a site requires.

Step 2: Avoid Excessive use of Hidden Form Fields

There are two sets of data in a typical HTML form:

  1. Fields you want the user to fill out with information.
  2. Hidden fields.

You can’t really do much about the first type, except reducing the size of your field names. This may be difficult depending on your implementation framework.

Hidden fields are used to maintain page-scoped variables that will be required by the server when the form is submitted, e.g. a user ID. They may also be injected by various web frameworks or server-side controls. One such example is the __VIEWSTATE field used by ASP.NET as shown below:

ASP.NET hidden fields

There are two reasons for doing this. Firstly it’s easier to have the state to hand (so to speak) in your page logic and secondly it often scales better across a web farm by keeping page scoped state within the page rather than fetching it from somewhere else like a database. 

One way to reduce the amount of data in hidden fields is to only use a single key in a hidden form field. The key value is then used on the server to retrieve the data required to process the submitted page. This is exactly like the approach used to reduce the amount of cookie data.  And like with cookies, reducing request transmission time in this way may have an impact on server-side performance and scalability.

Step 3: Avoid Verbose URLs

This is not the most important issue on the list, but it is still worth considering. In practice there are no ubiquitous limits on URL length, but most browsers will struggle with URLs longer than 4 KB – some may struggle at 1 KB or less. Remember that your URL will also end up in the referrer header of images and other embedded resources.

It’s usually the query string parameters that make a URL overly long. Again using Expedia as an example, you can see how many sites use these variables:

http://www.expedia.com/pub/agent.dll?qscr=fexp&flag=q&city1=lon&citd1=bos&date1=1/22/2008&time1=362&
date2=1/22/2008&time2=362&cAdu=1&cSen=&cChi=&cInf=&infs=2&
tktt=&trpt=2&ecrc=&eccn=&qryt=8&load=1&airp1=&dair1=&rdct=1&
rfrr=-429

In this case the URL is used to quickly pinpoint a results page for flights between ‘city1′ London (LON) and ‘city2′ Boston (BOS) on specific dates.

Encoding data like this in URLs does have certain advantages. If you bookmark or share the URL the same results will be displayed when the URL is next used. However, if the URL contains unused or redundant data it may be causing a significant increase in the amount of uploaded data.

Two Simple Rules for HTTP Caching

calendarDecember 10, 2007 in Caching , HTTP , HttpWatch

In practice, you only need two settings to optimize caching:

  1. Don’t cache HTML
  2. Cache everything else forever

“Wooah…hang on!”, we hear you say. “Cache all my scripts and images forever?

Yes, that’s right. You don’t need anything else in between. Caching indefinitely is fine as long as you don’t allow your HTML to be cached.

“But what about if I need to issue code patches to my JavaScript? I can’t allow browsers to hold on to all my images either. I often need to update those as well.”

Simple – just change the URL of the item in your HTML and it will bypass the existing entry in the cache.

In practice, caching ‘forever’ typically means setting an Expires header value of Sun, 17-Jan-2038 19:14:07 GMT since that’s the maximum value supported by the 32 bit Unix time/date format. If you’re using IIS6 you’ll find that the UI won’t allow anything beyond 31-Dec-2035. The advantage of setting long expiry dates is that the content can be read from the local browser cache whenever the user revisits the web page or goes to another page that uses the same images, script or CSS files.

You’ll see long expiry dates like this if you look at a Google web page with HttpWatch. For example, here are the response headers used for the main Google logo on the home page:

Google Expires header

If Google needs to change the logo for a special occasion like Halloween they just change the name of the file in the page’s HTML to something like halloween2007.gif.

The diagram below shows how a JavaScript file is loaded into the browser cache on the first visit to a web page:

Accessing page with empty cache

On any subsequent visits the browser only has to fetch the page’s HTML:

Read from cache

The JavaScript file can be read directly from the browser cache on the user’s hard disk. This avoids a network round trip and is typically 100 to 1000 times faster than downloading the file over a broadband connection.

The key to this caching scheme is to keep tight control over your HTML as it holds the references to everything else on your web site. One way to do this is to ensure that your pages have a Cache-Control: no-cache header. This will prevent any caching of the HTML and will ensure the browser requests the page’s HTML every time.

If you do this, you can update any content on the page just by changing the URL that refers to it in the HTML. The old version will still be in the browser’s cache, but the updated version will be downloaded because of the modified URL.

For instance, if you had a file called topMenu.js and you fixed some bugs in it, you might rename the file topMenu-v2.js to force it to be downloaded:

Force update with new file name

Now this is all very well, but whenever there’s a discussion of longer expiration times, the marketing people get very twitchy and concerned that they won’t be able to re-brand a site if stylesheets and images are cached for long periods of time.

In fact, choosing an expiration time of anything other than zero or infinite is inherently uncertain. The only way to know exactly when you can release a new version to all users simultaneously is to choose a specific time of day for your cache expiry; say midnight. It’s better to set indefinite caching on all your page-linked items so that you get the maximum amount of caching, and then force updates as required.

Now, by this point, you might have the marketing types on board but you’ll be losing the developers. The developers by now are seeing all the extra work involved in changing the filenames of all their CSS, javascript and images both in their source controlled projects and in their deployment scripts.

So here’s the icing on the cake; you don’t actually need to change the filename, just the URL. A simple way to do this is to append a query string parameter onto the end of the existing URL when the resource has changed.

Here’s the previous example that updated a JavaScript file. The difference this time is that it uses a query string parameter ‘v2’ to bypass the existing cache entry:

Force update with query string

The web server will simply ignore the query string parameter unless you have chosen to do anything with it programmatically.

There’s one final optimization you can make. The Cache-Control: no-cache response header works well for dynamic pages as it ensures that pages will always be refreshed from the server; even when pressing the Back button. However, for HTML that changes less frequently it is better to use the Last-Modified header instead. This will avoid a complete download of the page’s HTML, if it has not changed since it was last cached by the browser.

The Last-Modified header is added automatically by IIS for static HTML files and can be added programmatically in dynamic pages (e.g. ASPX and PHP). When this header is present, the browser will revalidate the local, cached copy of an HTML page in each new browser session. If the page is unchanged the web server returns a 304 Not Modified response indicating the browser can use the cached version of the page.

So to summarize:

  1. Don’t cache HTML
    • Use Cache-Control: no-cache for dynamic HTML pages
    • Use the Last-Modified header with the current file time for static HTML
  2. Cache everything else forever
    • For all other file types set an Expires header to the maximum future date your web server will allow
  3. Modify URLs by appending a query string in your HTML to any page element you wish to ‘expire’ immediately.

Why is Google so Fast?

calendarNovember 5, 2007 in HTTP , Optimization

It’s no coincidence that the most successful search engine on the planet is also the fastest to return results. Here are some time charts from HttpWatch for Google and its two closest competitors; Yahoo and Live.com:

Google.com returns its results page in 0.155 seconds:

Timechart for Google results page

Live.com returns its results page in 0.619 seconds:

Timechart for Live.com results page

Yahoo returns its results page in 1.131 seconds:

Timechart for Yahoo results page

These screen shots were created by visiting the home page of each search engine with an empty cache and then entering a search term while recording with the free, Basic Edition of HttpWatch

After clicking the ‘Search’ button, the results of the keyword search are delivered by Google approximately four times faster than Live.com and seven times faster than Yahoo. How do they manage to do this?

Clearly, the time taken to lookup the results for a keyword is crucial and there’s no denying that Google’s distributed super-computer reputedly running on a cluster of one hundred thousand servers is at the heart of that. However, Google has also optimized the results page by applying two of the most important aspects of web site performance tuning:

  • Make less HTTP requests
  • Minimized the size of the downloaded data

The Google results page requires only one network round-trip compared to the four and eight round-trips required by Live.com and Yahoo respectively. They have achieved this by ensuring that the results page has no external dependencies. All its style information and javascript code has been in-lined with <style> and <script> tags.

You might be wondering how the Google logo and other images are rendered on the results page since Internet Explorer does not support in-lined image data. Well, that’s a little more subtle. When the user visits the Google home page, the image nav_logo3.png is pre-loaded by some background javascript (hence the separate page group in HttpWatch):

Pre-loading of Nav_logo3.png

The image wasn’t actually displayed on the home page but it was forced into the browser’s cache. When the search results page is rendered by the browser, it doesn’t need to fetch the image from google.com because it already has a local copy. It didn’t even register in HttpWatch as a (Cache) result because Internet Explorer loaded the item directly from its in-memory image cache.

As you can see from the screenshot, nav_logo3.png doesn’t just contain the Google logo. It also has a set of arrows and the Google Checkout logo. This is because the results page uses a technique called CSS Sprites. All the images used on the results page are carefully sliced out of this single aggregate image with the CSS background-position attribute. The use of this technique has allowed Google to load the search page images in a single round-trip.

The other major advantage of the Google results page, over its competitors, is the amount of data that is downloaded. You can see this by looking at the highlighted values in the HttpWatch page summaries:

Google results page summary

Live.com results page summary

Yahoo results page summary

The Google results page only requires 6 KB of data to be downloaded, whereas Live.com requires 16 KB and 57 KB for Yahoo. All three search engines use HTTP compression, but Google’s results page requires less data because:

  1. Their page is simpler so it requires less HTML
  2. They’ve avoided extra round-trips for script and CSS. Each round trip requires HTTP response headers and adds to the total amount of data that has to be downloaded. In addition, HTTP compression tends to be more efficient on a single large request rather than several smaller requests.
  3. The HTML is written to minimize size at the expense of readability. It contains very little white space, no comments and uses short variable names and ids.

Not only do these techniques improve the performance of the Google results page, they have the added benefit of reducing the load on the Google web servers.

Ready to get started? TRY FOR FREE Buy Now