What you can know from web
statistics
(for extensive
information, download
this document)
The only things you
can know for certain are the number of requests made
to your server, when they were made, which files were
asked for, and which host asked you for them.
You can also know
what people told you their browsers were, and what the
referring pages were. You should be aware, though,
that many browsers lie deliberately about what sort of
browser they are, or even let users configure the
browser name. Also, a few browsers send incorrect
referrers, telling you the last page that the user was
on even if they weren't referred by that page. And
some people use "anonymizers" which
deliberately send false browsers and referrers.
What you can't know
1. You can't tell
the identity of your readers. Unless you explicitly
require users to provide a password, you don't know
who connected or what their email addresses are.
2. You can't tell
how many visitors you've had. You can guess by looking
at the number of distinct hosts that have requested
things from you. Indeed this is what many programs
mean when they report "visitors". But this
is not always a good estimate for three reasons.
First, if users get your pages from a local cache
server, you will never know about it. Secondly,
sometimes many users appear to connect from the same
host: either users from the same company or ISP, or
users using the same cache server. Finally, sometimes
one user appears to connect from many different hosts.
AOL now allocates users a different hostname for every
request. So if your home page has 10 graphics on, and
an AOL user visits it, most programs will count that
as 11 different visitors!
3. You can't tell
how many visits you've had. Many programs, under
pressure from advertisers' organisations, define a
"visit" (or "session") as a
sequence of requests from the same host until there is
a half-hour gap. This is an unsound method for several
reasons. First, it assumes that each host corresponds
to a separate person and vice versa. This is simply
not true in the real world, as discussed in the last
paragraph. Secondly, it assumes that there is never a
half-hour gap in a genuine visit. This is also untrue.
I quite often follow a link out of a site, then step
back in my browser and continue with the first site
from where I left off. Should it really matter whether
I do this 29 or 31 minutes later? Finally, to make the
computation tractable, such programs also need to
assume that your logfile is in chronological order: it
isn't always, and analog will produce the same results
however you jumble the lines up.
4. Cookies don't
solve these problems. Some sites try to count their
visitors by using cookies. This reduces the errors.
But it can't solve the problem unless you refuse to
let people read your pages who can't or won't take a
cookie. And you still have to assume that your
visitors will use the same cookie for their next
request.
5. You can't follow
a person's path through your site. Even if you assume
that each person corresponds one-to-one to a host, you
don't know their path through your site. It's very
common for people to go back to pages they've
downloaded before. You never know about these
subsequent visits to that page, because their browser
has cached them. So you can't track their path through
your site accurately.
6. You often can't
tell where they entered your site, or where they found
out about you from. If they are using a cache server,
they will often be able to retrieve your home page
from their cache, but not all of the subsequent pages
they want to read. Then the first page you know about
them requesting will be one in the middle of their
true visit.
7. You can't tell
how they left your site, or where they went next. They
never tell you about their connection to another site,
so there's no way for you to know about it.
8. You can't tell
how long people spent reading each page. Once again,
you can't tell which pages they are reading between
successive requests for pages. They might be reading
some pages they downloaded earlier. They might have
followed a link out of your site, and then come back
later. They might have interrupted their reading for a
quick game of Minesweeper. You just don't know.
9. You can't tell
how long people spent on your site. Apart from the
problems in the previous point, there is one other
complete show-stopper. Programs which report the time
on the site count the time between the first and the
last request. But they don't count the time spent on
the final page, and this is often the majority of the
whole visit.