A Guide to Automating HttpWatch with PHP

calendarMarch 11, 2011 in Automation , HttpWatch

Stoyan Stefanov, the creator of smush.it, architect of YSlow 2.0 and engineer at Facebook, has written an excellent three part guide to controlling HttpWatch from PHP:

He’s even published a class that wraps the HttpWatch API and makes it easier to use from PHP.

In the rest of this blog post we wanted to follow up on a few points mentioned in Stoyan’s blog posts. These items don’t just apply to PHP. You may find them useful when automating HttpWatch using other languages such as C# or Ruby.

Hiding the IE window

The ‘A better experience in IE’ section of Automating HttpWatch with PHP shows how you can hide the IE browser window during tests by separately creating IE and attaching HttpWatch to it.

It’s also possible to do this using the Container property without having to separately create and attach to IE:

$controller = new COM("HttpWatch.Controller");
$plugin = $controller->IE->New();
$browser = $plugin->Container; // Only works with IE
$browser->Visible = false;

If you do decide to do this in order to stop windows popping up during a test please bear these points in mind:

  1. Orphaned instances of iexplore.exe will be left running in the background if your script ever terminates before calling CloseBrowser.
  2. In IE 8 and earlier, HttpWatch will not record a Render Start event because the hidden IE window does not get updated by Windows. However, the event will be recorded in Firefox 3.5+ and IE 9.
  3. Your performance measurements may not directly match user experience. It’s possible that current or future browsers may avoid certain rendering and processing actions if they detect that the output will not be in a visible window. For that reason, we recommend running tests in visible browser windows on a normal interactive desktop either on a physical machine or VM. Also, viewing a browser test through Remote Desktop is likely to have a significant negative impact on performance as the graphics and text making up the page have to be transferred over the network.

Opening the HttpWatch Plugin Window

It’s often handy to open the HttpWatch window in the browser when you are developing an automation script so that you can check that it is working as expected.

The OpenWindow method allows you to do this and specify whether you want the HttpWatch window docked or undocked. For example, here is the PHP code to open HttpWatch as an embedded window in the browser:

...
// Open docked HttpWatch window in browser
$plugin->OpenWindow(false);
...

Handling Differences Between HttpWatch Basic and Professional Editions

In part 3, Stoyan mentions using try-catch to handle the errors that occur when attempting to access data that is restricted in HttpWatch Basic Edition. While this is a valid approach, there is a risk of hiding other errors that might be occurring that are not due to the restrictions in HttpWatch Basic edition.

There are a couple of properties in the HttpWatch automation interface that help you handle the differences. The first is the IsBasicEdition property on the Controller class.

For example, here’s a high level test in PHP:

$controller = new COM("HttpWatch.Controller");
if ( $controller->IsBasicEdition )
{
    echo "\nThis test requires HttpWatch Professional Edition";
}

At a lower level, you can also check each request to see if it has been restricted using the IsRestrictedURL property:

...
if ( $entry->IsRestrictedURL)
{
    // Goes here in HttpWatch Basic Edition for URLs outside Alexa Top 20
    echo "\nSome of the properties for this request are restricted";
}
else
{
    // Goes here in HttpWatch Basic Edition for URLs in Alexa Top 20
    // or in HttpWatch Professional Edition for any URL
    echo "\nAll the properties of this request are available";
}
...

6 Things You Should Know About Fragment URLs

calendarMarch 1, 2011 in HttpWatch

1. A Fragment URL Specifies A Location Within A Page

Any URL that contains a # character is a fragment URL. The portion of the URL to the left of the # identifies a resource that can be downloaded by a browser and the portion on the right, known as the fragment identifier, specifies a location within the resource:

In HTML documents, the browser looks for an anchor tag <a> with an id attribute matching the fragment. For example, in the URL shown above the browser finds a matching tag in the Printing Support heading:

Printing Support

 

and scrolls the page to display that section:

2. Fragments Are not Sent in HTTP Request Messages

If you try using fragment URLs in an HTTP sniffer like HttpWatch, you’ll never see the fragment IDs in the requested URL or Referer header. The reason is that the fragment identifier is only used by the browser – it doesn’t affect which resource is returned from the server.

Here’s a screen shot of HttpWatch showing the traffic generated by refreshing a fragment URL:

So don’t expect to see fragments identifiers in your server side code.

3. Anything After the First # is a Fragment Identifier

It doesn’t matter if the first # appears to be contained within the host name, path or query string – it always indicates where the fragment identifier starts.

For example, here’s a URL that attempts to encode an HTML color and shape into the query string:

http://example.com/?color=#ffff&amp;shape=circle

Unfortunately, the # in the HTML color makes the rest of the URL a fragment identifier and the server will see a single, empty color parameter in the query string:

4. Changing A Fragment ID Doesn’t Reload a Page but Does Create History

Fragments have a couple of handy features. First, if you manually change a fragment URL from something like this:

http://www.httpwatch.com/features.htm#filter

to this:

http://www.httpwatch.com/features.htm#print

and the browser scrolls the page to the new location but doesn’t reload the page.

However, it does add an entry in the browser’s history so that clicking the Back button will go back to the original location in the page.

These features are particularly useful when used with JavaScript (see below) to create linkable URLs and history for pages that either use top level HTML frames or update their content dynamically with Ajax calls.

5. JavaScript Can Use window.location.hash to Change Fragment IDs

The window object’s hash property allows JavaScript to manipulate the current page’s fragment identifier. As described in 4) this can be used to add history entries for a page without forcing a complete reload.

We recently deployed the help and automation reference for HttpWatch on our web site using the frame based HTML generated by the help authoring tool. Although the content was easily accessible in the browser, the URL in the location bar  didn’t change as you moved between topics making it practically impossible to share URLs for topics of interest.

The solution was to use fragment identifiers and JavaScript to create linkable URLs. The fragment identifier specifies the embedded help topic page:

6. Googlebot Ignores Fragments By Default

The Googlebot is responsible for crawling sites to find content and embedded links that will become part of the Google search index. It fetches and parses HTML, but it’s not a full blown browser and doesn’t have a JavaScript engine. As a consequence, it will normally ignore fragment identifiers and just look at the resource returned from the web server. Any JavaScript used by your page to load or build content will not be executed.

This means it would be impossible for Ajax driven sites to be indexed and have their fragment URLs returned directly in Google searches. To overcome this problem Google supports a convention that allows the Googlebot to turn fragment identifiers into query string parameters.

To use this indexing scheme you would first need to change all your fragment identifiers to start with a ! symbol:

http://www.example.com/ajax.html#mystate

would need to change to:

http://www.example.com/ajax.html#!mystate

The presence of the leading ! indicates to Google that you support this scheme.

Also, your page needs to be able supply the HTML for a given state in response to a query string parameter named _escaped_fragment_ . When the Googlebot needs the content for a given state it supplies the fragment identifier using a simple GET request and a query string value:

http://www.example.com/ajax.html?_escaped_fragment_=mystate

Top 7 Myths about HTTPS

calendarJanuary 28, 2011 in Firefox , HTTPS , HttpWatch

Myth #7 – HTTPS Never Caches

People often claim that HTTPS content is never cached by the browser; perhaps because that seems like a sensible idea in terms of security. In reality, HTTPS caching is controllable with response headers just like HTTP.

Eric Lawrence explains this succinctly in his IEInternals blog:

It comes as a surprise to many that by-default, all versions of Internet Explorer will cache HTTPS content so long as the caching headers allow it. If a resource is sent with a Cache-Control: max-age=600 directive, for instance, IE will cache the resource for ten minutes. The use of HTTPS alone has no impact on whether or not IE decides to cache a resource. (Non-IE browsers may have different default behavior for caching of HTTPS content, depending on which version you’re using, so I won’t be talking about them.)

The slight caveat is that Firefox will only cache HTTPS resources in memory by default. If you want persistent caching to disk you’ll need to add the Cache-Control: Public response header.

This screenshot shows the contents of the Firefox disk cache and the Cache-Control: Public response header in HttpWatch:

Myth #6 – SSL Certificates are Expensive

If you shop around you can find SSL certificates for about $ 10 a year or roughly the same cost as the registration of a .com domain for a year.

(UPDATE: you can get domain validated SSL certificates for free. See comment #1)

The cheapest certificates don’t have the level of company verification provided by the more expensive alternatives but they do work with nearly all mainstream browsers.

Myth #5 – Each HTTPS Site Needs its Own Public IP Address

With the pool of IPv4 addresses running low this is a valid concern and it’s true that only one SSL certificate can be installed on single IP address. However, if you have a wildcard SSL certificate (from about $ 125 yr) you can have as many sub-domains as you like on a single IP address. For example, we run https://www.httpwatch.com, http://www.httpwatch.com and https://store.httpwatch.com on the same public IP address:

On IIS 7 there is a trick though to making this work. After adding a certificate you need to find it and rename it in the certificate manager so that the name starts with a *. If you don’t do this you cannot edit the hostname field for an HTTPS binding:

UPDATE: UCC (Unified Communications Certificate) supports multiple domains in a single SSL certificate and can be used where you need to secure several sites that are not all sub-domains.

UPDATE #2: SNI (Server Name Indication) allows multiple certificates for different domains to be hosted on the same IP address. On the server side it’s supported by Apache and Nginx, but not IIS. On the client it’s supported by IE 7+, Firefox 2.0+, Chrome 6+, Safari 2.1+ and Opera 8.0+.  See comment #4 and comment #5.

UPDATE #3: IIS 8 now supports SNI

Myth #4 – New SSL Certificates Have to be Purchased When Moving Servers or Running Multiple Servers

Buying an SSL certificate involves:

  1. Creating a CSR (SSL Certificate Signing Request) on your web server
  2. Purchasing the SSL certificate using the CSR
  3. Installing the SSL certificate by completing the CSR process

These steps are designed to ensure that the certificate is safely transferred to the web server and prevents anyone from using the certificate if they intercept any emails or downloads containing the certificate in step 2).

The result is that you cannot just use the files from step 2) on another web server. If you want to do that you’ll need to export the certificate in other format.

In IIS you can create a transferrable .pfx file that is protected by a password:

This file can be imported onto other web servers by supplying the password again.

Myth #3 – HTTPS is Too Slow

Using HTTPS isn’t going to make your site faster (actually it can – see below) but the overhead is mostly avoidable by following the tips in our HTTPS Performance Tuning blog post.

The amount of CPU resource required to encrypt the data can be reduced by compressing textual content and is usually not a significant on servers with modern CPUs.

Extra TCP level round-trips are required to setup HTTPS connections and some additional bytes have to be sent and received. However, you can see in HttpWatch that this overhead is small once the HTTPS connection has been made:

The initial visit to an HTTPS site is somewhat slower than HTTP due to the longer connection times required to setup SSL. Here’s a time chart of the page load for an HTTP site recorded in HttpWatch:

And here’s the same site accessed over HTTPS:

The longer connection times caused the initial page load to be about 10% slower. However, once the browser has active keep-alive HTTPS connections a subsequent refresh of the page shows very little difference between HTTP and HTTPS.

First, the page refresh with HTTP:

and then with HTTPS:

It’s possible that some users may even find that the HTTPS version of a web site is faster than HTTP. This can happen if they sit behind a coporate HTTP proxy that normal intercepts, examines and records web traffic. An HTTPS connection will often just be forwarded as a simple TCP connection through the proxy because HTTPS traffic cannot be intercepted. It’s this bypassing that can lead to improved performance.

UPDATE: A blog post by F5 challenges the claim the CPU overhead of SSL is no longer significant, but most of their arguments are refuted in this follow up.

Myth #2 – Anything can go in Cookies and Query Strings with HTTPS

Although, a hacker cannot intercept a user’s HTTPS traffic on the network and read their cookie or query string values directly, you still need to ensure that their values can’t be easily predicted.

For example, one of the early UK banking sites used simple counter based numeric values for the session id:

A hacker could use a dummy account to see how this cookie worked and find a recent value. They could then try manipulating the cookie value in their own browser to hi-jack other sessions with nearby session id values.

Query string values are also protected on the network by HTTPS but they can still leak their values in other ways. For more details see How Secure Are Query Strings Over HTTPS .

Myth #1 – My Site Only Needs HTTPS for the Login Page

This is a commonly held view. The theory being that HTTPS will protect the user’s password during login but HTTPS is not needed after that.

The recently released Firesheep add-on for Firefox demonstrated the fallacy of this approach and how easy it is to hi-jack someone’s else session on sites like Twitter and Facebook.

The free public WiFi in a coffee shop is an ideal environment for session hi-jacking because:

  • The WiFi network doesn’t normally use encryption so it’s very easy to monitor all traffic
  • The WiFi network probably uses NAT through a single IP address to access the internet. This means that a highjacked session appears to come from the same network address as the original login

There are lots of examples of this approach to security. For example, by default the Twitter signin page uses HTTPS but it then switches to HTTP after setting up the session level cookies:

HttpWatch warns that these cookies were setup on HTTPS but the Secure flag wasn’t used to prevent them being used with HTTP:

Potentially someone in a coffee shop with Firesheep could intercept your twitter session cookies and then hi-jack your session to start tweeting on your behalf.

You can check SSL/TLS configuration our new SSL test tool SSLRobot . It will also look for potential issues with the certificates, ciphers and protocols used by your site. Try it now for free!

Ready to get started? TRY FOR FREE Buy Now