Data Collection Part 1 - Single Methodologies
Originally, web analytics data was primarily collected through the use of log files. Later, the development of page tags and network data collectors allowed businesses to compile additional types of valuable data. It should be noted, that each of the three data collection methods are associated with their own sets of advantages and disadvantages.
Keeping track of the characteristics of each of the different techniques can be quite overwhelming. Below is a summary of each of the three techniques and the pros and cons associated with each.
Log Files
Often marketers utilize log files to monitor trends related to the number of visitors, unique visits, repeat visits, page views, referrals, referral source, time on site, keywords, keyword phrases, purchases, downloads, subscribers, form submissions, transactional data and the various related conversion rates for a given time period.
Log file software is normally licensed and hosted by the client or by an ISP, rather than by a vendor although, a number of vendors support a hosted log file based analysis. Log file reporting takes accumulated log files for a set period of time and processes them as a batch, meaning reports viewed are historical rather than live.
Log files have performance and collection costs. The collection and storage of log files demands processing cycles and memory from web servers. In the case of large websites with multiple-server configurations, the costs are compounded.
Log files - Advantages
Search Engine Spider Reporting
Knowing the usage patterns of spiders can be valuable when engaging in search engine optimization. This data can be utilized to optimize the technology and content of the site for those spiders.
Complete download data
Log files make it possible to calculate the amount of downloads for files that are successfully completed vs. downloads that were not fully completed.
Server Error Code Reporting
Error code data is automatically recorded in most log files and can provide valuable information into site functionality and design issues that would be difficult to detect through other means.
Log files - Data inaccuracies and limitations
Proxy servers used by most major companies and major ISPs (e.g. AOL), can create barriers for collecting data. For companies that rely on server-based measurements, proxy servers may prevent complete data from reaching the web server to be logged. For instance, if 3,000 people in a proxy group viewed a web page, the web server would only log it as one request because the proxy server requests the web page only once, and then distributes the web page to the 3,000 users in the proxy group. The result is an incomplete picture of visitor behavior.
Similar issues may be caused by the use of browser navigation. When a visitor hits the "Back" or "Forward" button on their web browser, the web browser will use a locally cached copy of the web page that it saved from the last time the web page was visited. This results in a significant blind spot in the analytics, potentially masking site navigation and design issues that may be preventing visitors from accomplishing their goals on the site.
Robots and Spiders
Even though web server log analysis based systems sometimes go to extraordinary measures to filter out machine-generated traffic, the ever-changing landscape of machine-generated traffic requires an enormous on-going investment to keep filters current and up to date. Machine generated traffic places the same load on web servers as human generated traffic and makes it difficult to understand what actual visitors are really doing, and therefore, whether marketing initiatives are truly being effective
Page/Data tags
It is a common misconception that page tags are only offered by ASP vendors. Most leading software vendors also offer the option of page tag based data collection.
Page Tagging - Advantages
Certain behavioral data such as form-filled entries and onsite dynamic variables such as discount levels, promotion info or custom variables are much more easily collected via page tags.
Page Tags - Limitations
All pages that are to be tracked need to have the tag placed on each individual page, which may take a great deal of time and effort. Tags may also be embedded within a corporate template or via "server side includes."
Error Codes
Most sites would require additional configuration to allow for a tag based solution to collect error codes.
File Download Information
Most tag based solutions only allow for the tracking of the start of the download so it is unknown whether or not the download was completed successfully.
JavaScript Disabled on Browser
In the event that a user has JavaScript turned off on their browsers (currently estimated to be 2-3%), the potential exists to overlook the traffic from that segment of the population. However, the best of the client-side tracking technologies rely on JavaScript only for their ability to track unique users and set cookies and will still record requests for web resources even when JavaScript is turned off, meaning the data collection for this segment of users remains as accurate as web server log analysis solutions.
Network Data Collection
Network Data Collection - Advantages
Network Level Data
Network data collection provides access to a more granular level of technical data that can be used to determine server response times to requests and identify network related issues that could be interfering with user experience.
Data Consolidation
Often, network data collection simplifies the process of consolidating and combining data from many servers which is common to log files.
Additional Application Data
Some network data collectors are capable of collection application server variables and other additional fields of data that are not captured in log files and would be difficult or impossible to capture with page tags.
First Time Visit Cookie Setting
Some network data collectors are capable of setting a visitor identification cookie which is a superior method of setting this cookie as the first request the web server sees from a new visitor will not have the appropriate visitor identification cookie on it.
Search Engine Spider Reporting
Knowing the usage patterns of spiders can be valuable when engaging in search engine optimization. This data can be utilized to optimize the technology and content of the site for those spiders.
Complete download data
Log files make it possible to calculate the amount of downloads for files that are successfully completed vs. downloads that were not fully completed.
Server Error Code Reporting
Error code data is automatically recorded in most log files and can provide valuable information into site functionality and design issues that would be difficult to detect through other means.
Network Data Collection - Limitations
Server Load / Network Latency
Network data collectors that are installed directly on web servers have to be carefully designed to minimize the amount of load that is introduced onto the servers. Additionally, when network data collectors are deployed on a hardware device, it is important to minimize any network latency that is introduced.
Data Loss Due to Overload
Some network data collectors when overloaded with more than the maximum number of requests that the collector can handle will not be able to capture data during these periods and will result in data loss.
Additional Dependencies on IT Department to Implement
Due to the insertion of an additional component either into the network or on the web servers, it is often the case where the IT department will require additional resources to test, install, and maintain network data collectors.
Caching Servers, Browser Caching, and Proxies Servers
Proxy servers used by most major companies and major ISPs (e.g. AOL), can create barriers for collecting data. For companies that rely on server-based measurements, proxy servers may prevent complete data from reaching the web server to be logged. For instance, if 3,000 people in a proxy group viewed a web page, the web server would only log it as one request because the proxy server requests the web page only once, and then distributes the web page to the 3,000 users in the proxy group. The result is an incomplete picture of visitor behavior.
Similar issues may be caused by the use of browser navigation. When a visitor hits the "Back" or "Forward" button on their web browser, the web browser will use a locally cached copy of the web page that it saved from the last time the web page was visited. This results in a significant blind spot in the analytics, potentially masking site navigation and design issues that may be preventing visitors from accomplishing their goals on the site.
Robots and Spiders
Even though web server log analysis based systems sometimes go to extraordinary measures to filter out machine-generated traffic, the ever-changing landscape of machine-generated traffic requires an enormous ongoing investment to keep filters current and up to date. Machine generated traffic places the same load on web servers as human generated traffic and makes it difficult to understand what actual visitors are really doing, and therefore, whether marketing initiatives are truly being effective.
View Data Collection Part 2 - Hybrid Methodologies
Josh Manion Chief Executive Officer
Stratigent, LLC
For more information please call 877-427-2900 or email info@stratigent.com.
.jpg)