NavList:

A Community Devoted to the Preservation and Practice of Celestial Navigation and Other Methods of Traditional Wayfinding

HOME
Rejecting outliers: was: Kurtosis.
From: George Huxtable
Date: 2010 Dec 31, 14:54 -0000
The threadname is changed once again, from "kurtosis" (a mathematician's
word far beyond the vocabulary of navigators, which displays Frank's
erudition) to the more familiar "Rejecting outliers", which is what the
discussion seems to be really about.

I was trying to discover exactly what Peter Fogg himself was actually
claiming his procedure could accomplish. Not what Frank Reed thought that
it might accomplish, though those views may also be of some interest.

And I used the word "magic" to describe that procedure, because nowhere,
that I can recall, has Peter Fogg explained, in numerical terms that we
might agree on (or otherwise) what his criteria are for accepting some
observations and rejecting others. Which brought this response, from Frank-
"Now come on, George. Magic?? I really believe that this attitude has made
it nearly impossible for you to see something simple and useful."

Oh? What is this "something simple and useful" that Frank believes my
attitude has made it nearly impossible for me to see? Is it, I wonder, the
virtue of plotting observations, to allow the practised eye to pick out
oddnesses? Well, I'm all in favour of that, and can not recall any
arguments I've made against it. As a one-time experimental physicist, such
procedures have played a large part in my working life. And I see no reason
why that should not apply to navigational procedures also. The human eye
and brain can work together powerfully, and often provide a workable
alternative to full mathematical analysis. What I have argued against are
spurious claims that ascribe some exceptional qualities to those procedures
that they do not, and cannot, possess.

Now let's get on to the real nub of this discussion, the separation of
"outliers" from what I will call useful data.

I am aware that a Gaussian distribution is no more than a convenient
approximation, representing observed scatter in measurements of many types,
that seems to work well in practice. And there are many reasons why some
observations might well lie outside an expected Gaussian error-band: they
are commonly ascribed to some sort of "blunder". Such blunders can come in
all sorts of unpredictable shapes and sizes, and it would seem impossible
to predict any frequency-distribution for errors of that type. They would
certainly corrupt any set of otherwise-valid measurements, and need to be
detected and discarded, to the extent that is possible. That is the
challenge that mariners face, to somehow distinguish the good from the bad.

Frank writes-"In the real world, at least from every practical set of
observations that I have seen, the probability of points "in the tails" of
the distribution are much higher. For example, you might get a 3.6 minute
of arc error one time out of a hundred observations or even one in fifty,
or in other words, with hundreds of times greater frequency than the
standard normal distribution would imply."

Is that comment intended to apply just to sextant altitudes, or generally,
to other fields of measurement as well, as seems to be implied? He appears
to be challenging the very basis of error-theory, rooted as it is in the
Gaussian distribution, which has provided a useful model for statisticians
for many years. He is perfectly entitled to do so, but to be taken
seriously will need to offer much firmer evidence than the anecdotal
statements provided above.

Frank then offers what he describes as "an easy way to model such
observations" by combining two Gaussian distributions; one with a suggested
standard deviation of 0.7', and another with a SD of 3.0". How does anyone
use such a "model"? What is it based on? Where do any "blunders" fit in?
How were the parameters (0.7',3.0', 80%) derived? Is it intended to
represent real-life, perhaps Frank's own experience with measuring
altitudes? Or has it just been imagined, dreamed up out of nothing? Facts,
please.

For one thing, it depends on whether all his fifty or a hundred
observations have been made under comparable conditions. I imagine that
most, or perhaps all, of Frank's were made from on land, but let me provide
a maritime example which might well produce the sort of distribution he
describes. Take a five-week ocean passage, in which benign weather has
prevailed for four weeks of the five, resulting in a standard deviation in
altitudes of 0.7'. But for one week it's been stormy, and over that week
the SD has increased, to 3.0'. If we lump all observations together, over
the five weeks, we will get a non-Gaussian distribution of the overall
scatter. But that doesn't imply that on a calm day we are likely to see
scatter in the region of 4'. In the same way, if we are to analyse a
lifetime's experience of measuring altitudes, such measurements have to be
assessed with some care,
taking like with like.

As for the "obsrevations where something has gone wrong but not at a level
that we immediately detect. They're the sort of observations that we might
occasionally mark down with a question mark or maybe just have a "funny
feeling" about but they're not the sorts of observations that you would
immediately throw it.". If there's an observation that you have a "funny
feeling" about, or put a question mark against, the moment to discard it is
there and then, at the time of the "funny feeling". Not wait to see if it
fits in with your preconceptions or not, and then discard it if it doesn't.

George.

 contact George Huxtable, at george{at}hux.me.uk
or at +44 1865 820222 (from UK, 01865 820222)
or at 1 Sandy Lane, Southmoor, Abingdon, Oxon OX13 5HX, UK.


----- Original Message -----
From: "Frank Reed" 
To: 
Sent: Thursday, December 30, 2010 6:32 AM
Subject: [NavList] Kurtosis WAS: errors in plotting and a possible/partial
fix


George H, you wrote:
"Is Peter Fogg really claiming that he has a method which can reduce the
error resulting from random scatter to less than simple averaging will do?"

Yes. Of course, he is. SURELY that's obvious by now. And it's a simple
method. It differs only slightly from the usual navigators' technique of
omitting LOPs from a fix if they are too far out from a group of others.
When you have a series of closely-spaced observations (well away from the
meridian), the differences between the plotted observed altitudes and the
line with the required slope is no more and no less than a plot of the
intercepts of the sights. Of course any such method needs to be applied
with some fixed a priori standards. Otherwise the temptation to fit the
line will become too great.

And George, you wrote:
"If so, I can always produce sets of simulated data, which are affected
only by computer-generated random scatter, on which he can try his magic,
to substantiate that claim."

Now come on, George. Magic?? I really believe that this attitude has made
it nearly impossible for you to see something simple and useful.

You also wrote:
"I understood that his reason for declining such trials, when last offered,
was that that his procedures could not be expected to improve on such
Gaussian scatter, but could only improve on non-Gaussian outliers. If I'm
wrong about that, the offer remains open."

Of course this is the issue. Gaussian distributions are only an approximate
model of real observational error, excellent as a starting point, in fact a
gold standard for a starting point, but only part of the story. What we
have here is "kurtosis".

Kurtosis (positive kurtosis, to be precise) is a ponderous name for a
simple phenomenon in observations: you get more outliers than a pure
Gaussian distribution would imply. And most people who have done
observations with manual instruments are familiar with this phenomenon
though they rarely have a name for it. For a navigation example, suppose
you have a navigator who has a standard deviation of Sun altitude sights of
0.9 minutes of arc. That's not an unreasonable number. It implies that
roughly two-thirds of observations (actually 68%) are within 0.9 minutes of
arc of the truth. But the standard normal distribution tails off very
rapidly. This means that the odds of finding an observation at three or
four standard deviations away from the truth are extremely low --by this
THEORETICAL model of the error distribution. Specifically, the odds of an
observation at 3 s.d. with an error of +/-2.7 minutes of arc, or more, are
about 1-in-370 --for a Gaussian normal distribution. The odds of an
observation at 4 s.d. with an error of +/-3.6 minutes of arc or more are
about 1-in-16,000. That number implies that you could shoot Sun altitudes
five times a day, every day of the year, for over eight years and still
only have an even-money chance of seeing an observation with an error of
3.6'. But that is not the reality of sextant observations. The normal
distribution is a model with zero kurtosis. In the real world, at least
from every practical set of observations that I have seen, the probability
of points "in the tails" of the distribution are much higher. For example,
you might get a 3.6 minute of arc error one time out of a hundred
observations or even one in fifty, or in other words, with hundreds of
times greater frequency than the standard normal distribution would imply.
That's called "kurtosis" (for those who like even more arcane terminology,
it is technically a "leptokurtic" distribution).

If you want to model observations that have kurtosis, there is an easy way
to do it, and it has a direct relationship with the origins of these
"outliers" in the real world. Generate random variables as follows: with
some probablity f (e.g. 80%) take random numbers from a Gaussian normal
distribution with a relatively small standard deviation. In the case here,
we might take 80% of numbers from a normal distribution with standard
deviation 0.7'. These correspond to normal "good" observations. For all
other simulated observations (necessarily with probability 1-f, of course),
take the observations from a Gaussian distribution with a significantly
larger standard deviation, perhaps 3.0' in the case described here. These
correspond to obsrevations where something has gone wrong but not at a
level that we immediately detect. They're the sort of observations that we
might occasionally mark down with a question mark or maybe just have a
"funny feeling" about but they're not the sorts of observations that you
would immediately throw it. The random numbers you will get from this
"mixed" simulation will generally resemble normally distributed numbers
until you look more closely at the statistics, or until you employ some
graphing technique like the very simple and efficient one that Peter Fogg
has discussed many times. We can adopt a standard where we drop any
observations greater than perhaps 2.5 s.d. from the sloping line, and we
will get better results than a crude average of all points most of the
time.

This isn't magic. It's good science. Whether it's useful for a navigator
depends on many factors: the type of observations (altitudes? lunars?), the
quality of the observation conditions (small boat? land observer?), the
time and calculating resources available (is a calculated plot available?),
and probably more. Of course, one could also argue that this was never used
historically so if we're only interested in the history of a dead skill,
it's irrelevant. If there's any life left in traditional navigation,
there's every reason to seek modern methods of analysis. There's nothing
wrong with trying to cull outliers in observational data when there is
significant kurtosis.

-FER


----------------------------------------------------------------
NavList message boards and member settings: www.fer3.com/NavList
Members may optionally receive posts by email.
To cancel email delivery, send a message to NoMail[at]fer3.com
----------------------------------------------------------------
Subject:
Author:
Start date:	(yyyymm dd)
End date:	(yyyymm dd)
NavList:

A Community Devoted to the Preservation and Practice of Celestial Navigation and Other Methods of Traditional Wayfinding

Compose Your Message

NavList

What is NavList?

Get a NavList ID Code

Retrieve a NavList ID Code

Email Settings

Custom Index

Add Images & Files
Name or NavList Code:	Email:
Name:
	(please, no nicknames or handles)
Email:
NavList ID Code: