Data Collection Part 1 - Single Methodologies

Originally, web analytics data was primarily collected through the use of log files. Later, the development of page tags and network data collectors allowed businesses to compile additional types of valuable data. It should be noted, that each of the three data collection methods are associated with their own sets of advantages and disadvantages.

 
To apply a more robust approach of tracking web visitor behavior, you can pursue a hybrid model that leverages the advantages of two or more data collection techniques while minimizes the disadvantages. To find more information about hybrids, view our newsletter Data Collection Part 2 - Hybrid Methodologies.

Keeping track of the characteristics of each of the different techniques can be quite overwhelming. Below is a summary of each of the three techniques and the pros and cons associated with each.
 

Log Files

Log files, a form of server-side data collection, were originally created to help programmers, web developers and site administrators monitor visitor activity and debug problems within a website. The log files collect records of requests to the web server made by the visitor's web browser. When a website visitor requests information from the web server (i.e. a web page or graphic image) each request is recorded as an individual request in the log file. In addition, log files typically record such things as error messages, status messages and transaction details.

Often marketers utilize log files to monitor trends related to the number of visitors, unique visits, repeat visits, page views, referrals, referral source, time on site, keywords, keyword phrases, purchases, downloads, subscribers, form submissions, transactional data and the various related conversion rates for a given time period.

Log file software is normally licensed and hosted by the client or by an ISP, rather than by a vendor although, a number of vendors support a hosted log file based analysis. Log file reporting takes accumulated log files for a set period of time and processes them as a batch, meaning reports viewed are historical rather than live.

Log files have performance and collection costs. The collection and storage of log files demands processing cycles and memory from web servers. In the case of large websites with multiple-server configurations, the costs are compounded.

Log files - Advantages

Log files present several valuable advantages that cannot be tracked via page tags, the most important of which are outlined below:

Search Engine Spider Reporting
Knowing the usage patterns of spiders can be valuable when engaging in search engine optimization. This data can be utilized to optimize the technology and content of the site for those spiders.

Complete download data
Log files make it possible to calculate the amount of downloads for files that are successfully completed vs. downloads that were not fully completed.

Server Error Code Reporting
Error code data is automatically recorded in most log files and can provide valuable information into site functionality and design issues that would be difficult to detect through other means.

Log files - Data inaccuracies and limitations

Caching Servers, Browser Caching, and Proxies Servers
Proxy servers used by most major companies and major ISPs (e.g. AOL), can create barriers for collecting data. For companies that rely on server-based measurements, proxy servers may prevent complete data from reaching the web server to be logged. For instance, if 3,000 people in a proxy group viewed a web page, the web server would only log it as one request because the proxy server requests the web page only once, and then distributes the web page to the 3,000 users in the proxy group. The result is an incomplete picture of visitor behavior.

Similar issues may be caused by the use of browser navigation. When a visitor hits the "Back" or "Forward" button on their web browser, the web browser will use a locally cached copy of the web page that it saved from the last time the web page was visited. This results in a significant blind spot in the analytics, potentially masking site navigation and design issues that may be preventing visitors from accomplishing their goals on the site.

Robots and Spiders
Even though web server log analysis based systems sometimes go to extraordinary measures to filter out machine-generated traffic, the ever-changing landscape of machine-generated traffic requires an enormous on-going investment to keep filters current and up to date. Machine generated traffic places the same load on web servers as human generated traffic and makes it difficult to understand what actual visitors are really doing, and therefore, whether marketing initiatives are truly being effective

Page/Data tags

Page tagging is based upon the client-side data collection methodology. Data tags usually append the JavaScript code on website pages. When read by a visitor's browser, data-tags collect information related to the visitor's activity and transmit that data to data-collection servers, typically through an image request with the various parameters of interest appended to the query string.

It is a common misconception that page tags are only offered by ASP vendors. Most leading software vendors also offer the option of page tag based data collection.

Page Tagging - Advantages

Page tags commonly produce accurate user data with less effort than log files and in some cases reduce the total amount of data that needs to be processed. Often more control can be exercised on what data elements are collected within page tags in comparison to log files.

Certain behavioral data such as form-filled entries and onsite dynamic variables such as discount levels, promotion info or custom variables are much more easily collected via page tags.

Page Tags - Limitations

Implementation Effort
All pages that are to be tracked need to have the tag placed on each individual page, which may take a great deal of time and effort. Tags may also be embedded within a corporate template or via "server side includes."

Error Codes
Most sites would require additional configuration to allow for a tag based solution to collect error codes.

File Download Information
Most tag based solutions only allow for the tracking of the start of the download so it is unknown whether or not the download was completed successfully.

JavaScript Disabled on Browser
In the event that a user has JavaScript turned off on their browsers (currently estimated to be 2-3%), the potential exists to overlook the traffic from that segment of the population. However, the best of the client-side tracking technologies rely on JavaScript only for their ability to track unique users and set cookies and will still record requests for web resources even when JavaScript is turned off, meaning the data collection for this segment of users remains as accurate as web server log analysis solutions.

Network Data Collection

In addition to server log files and page tags, web analytics data can be collected through a methodology referred to as "network data collection". Network data collection can take on many forms and possible configurations. However, in almost all cases, web analytics data is collected from some sort of packet sniffer that resides on either the web servers themselves or sits on an independent piece of hardware (hub, switch, proxy server etc.) which is either in front of the web servers or has access to the requests being made to the web servers.
 

Network Data Collection - Advantages

In addition to the advantages that log files offer, network data collection has some strong advantages that should be seriously considered when formulating your data collection strategy.

Network Level Data
Network data collection provides access to a more granular level of technical data that can be used to determine server response times to requests and identify network related issues that could be interfering with user experience.

Data Consolidation
Often, network data collection simplifies the process of consolidating and combining data from many servers which is common to log files.

Additional Application Data
Some network data collectors are capable of collection application server variables and other additional fields of data that are not captured in log files and would be difficult or impossible to capture with page tags.

First Time Visit Cookie Setting
Some network data collectors are capable of setting a visitor identification cookie which is a superior method of setting this cookie as the first request the web server sees from a new visitor will not have the appropriate visitor identification cookie on it.

Search Engine Spider Reporting
Knowing the usage patterns of spiders can be valuable when engaging in search engine optimization. This data can be utilized to optimize the technology and content of the site for those spiders.

Complete download data
Log files make it possible to calculate the amount of downloads for files that are successfully completed vs. downloads that were not fully completed.

Server Error Code Reporting
Error code data is automatically recorded in most log files and can provide valuable information into site functionality and design issues that would be difficult to detect through other means.

Network Data Collection - Limitations

Network data collection suffers from many of the same limitations as log files, and therefore, the best practice for implementing would normally involve the use of some amount of page tagging to capture data that would otherwise be missed by network data collection.

Server Load / Network Latency
Network data collectors that are installed directly on web servers have to be carefully designed to minimize the amount of load that is introduced onto the servers. Additionally, when network data collectors are deployed on a hardware device, it is important to minimize any network latency that is introduced.

Data Loss Due to Overload
Some network data collectors when overloaded with more than the maximum number of requests that the collector can handle will not be able to capture data during these periods and will result in data loss.

Additional Dependencies on IT Department to Implement
Due to the insertion of an additional component either into the network or on the web servers, it is often the case where the IT department will require additional resources to test, install, and maintain network data collectors.

Caching Servers, Browser Caching, and Proxies Servers
Proxy servers used by most major companies and major ISPs (e.g. AOL), can create barriers for collecting data. For companies that rely on server-based measurements, proxy servers may prevent complete data from reaching the web server to be logged. For instance, if 3,000 people in a proxy group viewed a web page, the web server would only log it as one request because the proxy server requests the web page only once, and then distributes the web page to the 3,000 users in the proxy group. The result is an incomplete picture of visitor behavior.

Similar issues may be caused by the use of browser navigation. When a visitor hits the "Back" or "Forward" button on their web browser, the web browser will use a locally cached copy of the web page that it saved from the last time the web page was visited. This results in a significant blind spot in the analytics, potentially masking site navigation and design issues that may be preventing visitors from accomplishing their goals on the site.

Robots and Spiders
Even though web server log analysis based systems sometimes go to extraordinary measures to filter out machine-generated traffic, the ever-changing landscape of machine-generated traffic requires an enormous ongoing investment to keep filters current and up to date. Machine generated traffic places the same load on web servers as human generated traffic and makes it difficult to understand what actual visitors are really doing, and therefore, whether marketing initiatives are truly being effective.

View Data Collection Part 2 - Hybrid Methodologies

 

Josh Manion
Chief Executive Officer
Stratigent, LLC

 

For more information please call 877-427-2900 or email info@stratigent.com.