Acquiring the Information

**Figure 6:** Comparison of Web Interaction Monitoring Strategies
$\begin{figure} \begin{center} \begin{tabular} {\vert\vert l\vert\vert l\vert l\... ...xtensions & Yes & Yes & No & No\ \hline \end{tabular}\end{center}\end{figure}$

In this subsection, we examine five techniques of gathering session information required for the decision engine, and discuss their pros and cons. A summary is presented in Figure 6. It is expected that a combination of these techniques will be required in a DFP toolkit, if it is to be deployed to support personalization for a broad variety of web sites. After presenting the techniques we make some general remarks.

Content generation scripts send high-level semantics to decision engine. Assuming that all pages that need to be tracked are generated via executable scripts/programs (which is a reasonable assumption to make for large sites), an obvious approach to obtaining meaningful semantic information would be to create or modify these scripts to gather/create the desired information, and then pass it on to the decision engine. The primary advantage of this approach is that the people developing the web pages will have the best idea of the intended semantics of the pages, and thus what the decision engine should receive. For this reason, we expect this to be the approach of choice when creating new web sites. Further advantages are that the actual HTTP requests and responses do not need to be transformed/parsed, and HTTPS connections can be handled. The primary disadvantage concerns legacy sites, where modifying all the existing scripts to generate the high level semantic data would be quite expensive. Another disadvantage is that maintenance of the site would become more cumbersome.

So how can the DFP approach be used in large legacy web sites? In such cases, the only solution might be to try and extract meaningful information from the raw HTTP requests/responses. There are various ways to do so, some of which we discuss below.

Content generation scripts send raw HTML to decision engine. This is a variation of the approach mentioned earlier, however, in this case, only the raw HTTP requests/responses are forwarded by the scripts. This can be done by injecting the same (small) block of code into the scripts that generate each and every page of the web site. The advantage is that converting a legacy site to this approach is straight-forward (assuming that it was implemented with server-side scripting language such as JSP or ASP), since the only function the extra piece of code performs is to forward appropriate data (HTTP request and/or response) to the decision engine. However, since the injected code will be uniform and generic, it will not be able to extract high-level semantic information from each page. This means that detailed knowledge of what information to extract for specific (categories of) pages, and how to extract it, needs to be built, either in the decision engine or in some other process. Depending on the level of information to be extracted, this can cause maintenance problems anytime the structure of the corresponding pages change. Moreover, this approach is sensitive to the language and/or platform that the web site is implemented in, e.g., if the CGI scripts are based on C++ then it may be hard to know where to inject the code block, making the approach infeasible.

Wrapper scripts. The idea here is that the web server can be configured so that all web requests and responses (that need to be tracked) are filtered through executable scripts that perform the task of extracting the relevant information and contacting the Vortex server to determine the appropriate response. Note that these wrapper scripts could reside on the web server(s) that are supporting page requests, or could reside on separate machines. As opposed to the previous approaches, the advantage in this scenario is that the actual content generation is not affected -- this method is simply layered on top. Moreover, HTTPS connections can be handled, since the wrapper script gets the customer request after it is decrypted by the server, and parses the response before it is encrypted and sent to the client. A disadvantage is that HTML pages being served would need to be transformed, since the links/forms/frames in existing pages now need to go through the wrapper, whereas other objects (e.g., pre-loaded images via Javascript) need to be accessed directly. However, it can be hard to automatically transform all underlying pages, especially if a lot of destination URL computation is done inside client-side script code, which would require the wrapper to parse the corresponding scripting language. Also, HTML pages input to the wrapper would need to be parsed and translated into higher level semantic information, either by the wrapper or the decision engine. Finally, session tracking information may be lost under this approach, if the web server is using a cookie-based scheme that tracks sessions for some but not all of the web site pages. In that case, replacing URLs so that they access the wrapper scripts may disrupt the web site's scheme for putting cookies at the customer site. To remedy this, some re-writing of the web site scripts would be required. However, this problem can be eliminated if the web server allows URL re-direction based on customizable rules. In that case, the customer could see the same URL and no HTML re-writing would be required, hence session tracking would not be a problem.

Proxies. A proxy can be inserted between a company's web site and the end user. The proxy is responsible for tracking user requests, extracting the site responses, and contacting the decision server to determine the appropriate intervention strategy. An advantage is that HTML page transformation is not required. However, there are several disadvantages to this approach. Firstly, if SSL tunneling is being used, then the proxy will need to serve as the receiving end of the tunnel, and will need to perform encrypting/decrypting of the web traffic. Moreover, it would also need to extract higher level semantic information from the HTML. Lastly, the use of one or more proxies may have impact on scalability, because the proxy servers can become a bottleneck. It will be important to have enough proxies to cover the anticipated load on the web site.

Web Server Extensions. Most popular web servers
[4] (Apache, Netscape Enterprise, Microsoft IIS) have an API (Apache modules, Netscape's NSAPI, Microsoft's ISAPI) that can be used to extend the functionality provided by the server. In particular, these can be used to attach monitoring hooks into the web server itself, thus gaining low-level access to all web interactions. The advantage is that no transformation of HTML response being generated is required, and secure connections can be handled. The disadvantages of needing to extract higher-level semantic information from HTML responses still applies. Moreover, writing server extensions is tricky (since they should not impact reliability or scalability) and server specific.

We conclude this subsection with some general remarks about these techniques and our experience with two of them.

We first consider session tracking. Three techniques are commonly used for tracking a session in web sites: encoding the session ID into the URLs sent and requested by the customer, placing cookies on the customer machine, and placing the session ID into a hidden form field. (The latter technique requires that all pages transmitted to the user are generated via form submissions.) In order for the decision engine to know the session of a page request, the session ID must be passed to the engine along with other page information. The session ID can be sent explicitly, or it can be sent as it occurs in the HTML of the requested page, and the encoding scheme used by the web site can be used to extract the session ID.

We now turn to the issue of scalability. In particular, how do the above techniques work when a web site is supported by a web server farm rather than a single web server? There are two main issues. First, in the case of a web server farm there may also need to be a farm of decision engines. Because the log of a given customer session will generally be maintained in the main memory of a single decision engine, it will be important that all decisions about that session be made by the same decision engine, even if different web servers are being used to serve the pages. This can be accomplished by encoding the decision engine ID inside the session ID. A load-balancing strategy can be implemented to distribute customer sessions across the decision engines. Furthermore, in applications such as MIHU, if all of the decision engines reach saturation then the system can decide for some customers that they will not receive any MIHU decisions. This permits a graceful degradation of service in the face of unexpectedly high load.

The second scalability issue concerns how the added expense of transmitting information from web server to decision engine will impact performance. In all cases except for proxies, the processing involved in transmitting to the decision engine can be performed on the web server. Thus, each server will be more loaded, but no architectural problems arise. In the case of proxies (and wrapper scripts if they are implemented on separate machines) there is a possibility of the proxy becoming a bottleneck.

We have built two versions of the MIHU prototype at Bell Labs, that explored some of the issues discussed above. In the first case, we modified the content generation CGI-scripts used by the web site. In the second case, we wrote a wrapper servlet. Here, actual request URLs were passed to the wrapper servlet via its PATH_INFO environment variable. The servlet then performed the original request and parsed the HTML response generated to extract the relevant information. Before shipping the response to the customer, the servlet also modified the links/forms/frames in the page to go through the servlet, and inserted a BASE tag that pointed to the original URL so that any relative accesses (e.g., pre-loaded images inside Javascript) would work. None of the actual pages stored at the web site needed to be modified.

Importantly, with any of the above monitoring methods presented, the method can be phased into the site - e.g., initially the tracking can focus only on part of the site, and only on part of the relevant data.

4.2 Acquiring the Information