NavList:

A Community Devoted to the Preservation and Practice of Celestial Navigation and Other Methods of Traditional Wayfinding

HOME
Re: Rejecting outliers
From: George Huxtable
Date: 2011 Jan 3, 11:46 -0000
Thanks to Peter Hakel for providing an unlocked version of his spreadsheet:
I can now read it all, though without understanding exactly what it's
doing.

Perhaps Peter can clarify for me a point about his earlier posting, on 31
December, in which he wrote-

Eq(1):      weight = 1 / variance = 1 / (standard deviation squared)

It may only be a matter of words, but it seems to me that the weight has to
be assessed individually for each member of the set. Isn't "standard
deviation" a measure of the set as a whole, not just a single member?
Shouldn't Eq(1) read something like-
"= 1/ (deviation squared)", not
"= 1/ (standard deviation squared)"?

Otherwise, I fail to follow it.

I think he is giving himself an unnecessarily hard time by allowing the
slope to be a variable in his fitting routine. What's more, he is diverging
from an important prior-constraint of the problem, which is that the true
slope must represent an altitude change of 32' over a 5 minute period, and
NOTHING ELSE WILL DO. To that extent, his analysis is inappropriate to the
problem.

Knowing that variation with time, we can eliminate time from the problem
before we even start to tackle it, by subtracting from each altitude value
an amount that increases linearly with time, with a slope of 32',from some
arbirary base-value, chosen for convenience.

This then results in a set of nine simple numbers, of which the time and
even the ordering is now unimportant. Peter's task then is to find some way
of processing those numbers to determine a result that represents the true,
unperturbed, initial value, better than a simple mean-of-9 does. In the
case we're presented with, there's no evidence that the distribution is
anything other than a simple Gaussian, which makes his task more difficult.
If there were obvious "outliers", it could be more straightforward.

Now for Peter's weighting function. In a simple least-squares analysis the
weight given is the same to each observation, so the weight factor is a
constant =1, whatever the deviation.

If a limit is set, outside which data is excluded, it becomes a square
distribution, around a best-estimate of a cenral value, within which the
weighting is taken as 1, and outside it, either side of the centre beyond a
deviation specified, for example, as 3 standard deviations, the weighting
is zero. It's somewhat unphysical and arbitrary, but at least the
conditions can be clearly specified.

Peter modifies a square-box weighting function, as above, to add an
inverse-square fall-off beyond its shoulders. Those sharp shoulders also
seem semewhat unphysical. What I would like to follow is how the half-width
between those shoulders relates to the standard deviation, and to his
"scatter" parameter.

It seems that Peter wishes to leave it to the individual, to choose the
scatter parameter that's most appropriate to a particular data set, after
viewing some results. If I understand that right, he hasn't yet eliminated
all the "magic" from the operation.

George.

contact George Huxtable, at  george{at}hux.me.uk
or at +44 1865 820222 (from UK, 01865 820222)
or at 1 Sandy Lane, Southmoor, Abingdon, Oxon OX13 5HX, UK.

----- Original Message -----
From: "P H" 
To: 
Sent: Monday, January 03, 2011 1:17 AM
Subject: [NavList] Re: Rejecting outliers


| George,
|
| I forgot to mention that you can unlock the spreadsheet by turning off
its
| protection, there is no password.  I attach the unlocked version, so you
can
| skip that step.  The PNG file is an image, a screenshot of the
color-coded
| portion of the spreadsheet where input and output data are concentrated.
This
| is the part with which a user would interact; I attached it for those
readers
| who may be interested in the main points but don't want to bother with
Excel in
| detail.  My Excel is Office 2004 for Mac, so hopefully compatibility will
not be
| a problem.
|
| I gave the details of the procedure in:
|
| http://www.fer3.com/arc/m2.aspx?i=115086&y=201012
|
| Step 1: Calculate the standard (non-weighted) least-squares linear fit
through
| the data.
|
| Now iterate:
| Step 2: Calculate altitude differences "diff" between the data and the
latest
| available linear fit.
| Step 3: Calculate new weights as 1 / diff^2 for each data point.
| Step 4: Calculate a new linear fit using weights from Step 3.
| Repeat until convergence.
|
| "diff" could turn up small, or even zero, which would cause numerical
problems;
| that is why the weights have a ceiling controlled by the "Scatter"
parameter.
| The weight=1.000 means that the data point has hit this ceiling and
contributes
| to the result with maximum influence allowed by the procedure.  For
| Scatter=7.0', all nine data points reach this ceiling and we are stuck at
Step
| 1.
|
| I have not gotten around to fitting Gaussian-scattered data, as you
suggested.
| I may do so in the future, time permitting.
|
| The procedure of weighted-least squares has a very solid theoretical
background,
| see, e.g.,
|
| http://en.wikipedia.org/wiki/Least_squares#Weighted_least_squares
| http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
|
| The one admittedly heuristic detail is Step 3; weight should be
1/variance, but
| that is unknown in this case.  Step 3 seems like a reasonable
replacement,
| effectively substituting |diff| for the standard deviation of altitudes
at the
| given UT.
|
| This procedure is capable of eliminating lopsidedness to a certain
extent, as I
| have shown previously.  However, if there are too many "lopsided" data
points,
| the result will follow them to the new "middle" defined by them.  I don't
know
| how we would weed that out without additional information about where the
| correct "middle" really is.  As I said earlier, in the absence of an
independent
| check, we must rely on the assumption that a sufficient majority of data
points
| are "good."
|
| My motivation was to see what information can be extracted from the data
set
| alone, without any additional information such as DR.  I think that Peter
Fogg's
| approach of precomputing the slope is fine, and would most likely give a
better
| practical result.  After all, "position tracking" is preferable to
"position
| establishing from scratch" and it is indeed what happens in real life.
But I
| think you will agree that academic curiosity has its benefits, too. :-)
|
|
| Peter Hakel
|
|
|
|
|
| ________________________________
| From: George Huxtable 
| To: NavList@fer3.com
| Sent: Sun, January 2, 2011 3:12:28 PM
| Subject: [NavList] Re: Rejecting outliers
|
| I'm failing to understand some aspects of Peter Hakel's approach, though
it
| least it seems numerical, definable, and repeatable.
|
| First, though, I should say that problems arise when viewing his Excel
| output. It looks as though some columns need to be widened to see their
| contents, but it seems that the file is deliberately crippled to prevent
me
| taking many of the actions I am familiar with in Excel. But I should add
| that mine is an old 2000 version, which may be part of the problem. The
| alternative .png version opens with "paint" but only allows me to see the
| top-left area of the sheet. Is that all I need?
|
| Now to more substantive matters-
|
| I don't understand how the various weighting factors have been derived,
and
| why many of them are exactly 1.0000, when others are much less.
|
| Presumably, they depend on the divergence of each data point from some
| calculated line, which is then readjusted by iteration, but I have failed
| to follow how that initial straight-line norm was assessed, or what
| algorithm was used to obtain the weights. Answers in words rather than in
| statistical symbols would be most helpful to my simple mind.
|
| You seem to have ended up with a best-fit slope of about 24' in a
5-minute
| period, as I did when allowing Excel to make a best-fit, when giving it
| freedom to alter the slope as it thought fit. But the slope can be
| pre-assessed with sufficient accuracy from known information, and unless
| there is some error in the information given, such as an (unlikely) star
| mis-identification, we can be sure that the actual slope is nearer to
32',
| and the apparent lower figure is no more than a product of scatter in the
| data. This is a point that Peter Fogg keeps reiterating, perhaps the only
| valid point in all he has written on this topic.
|
| As a result, we could, if we wished, subtract off that line of known
| constant slope from all the data, and end up with a set of numbers, all
of
| which should be roughly equal, simply scattering around some mean value
| that we wish to discover. Then the statistical task of weeding outliers
| becomes somewhat simpler.
|
| =================
|
| If you apply your procedure to an artificially-generated  data-set,
| scattering in a known Gaussian manner about a known mean (which could be
| zero), and known to contain no non-Gaussian outliers, what is the
resulting
| scatter in the answer? How does it compare with the predicted scatter
from
| simple averaging? I suspect (predict) that it can only be worse, though
| perhaps not by much.
|
| This is the way I picture it. If there is any lopsidedness in the
| distribution, then each added observation that is in the direction of the
| imbalance will be given greatest weight, whereas any observation that
would
| act to rebalance it, on the other side, will be attenuated, being further
| from the trend. So there will be some effect, however small, that acts to
| enhance any unbalance, though probably not to the extent of causing
| instability. Does that argument make sense to you? It could be checked
out
| by some Monte Carlo procedure.
|
| I presume that the proposed procedure is entirely empirical, and has no
| theoretical backing, though there may not be anything wrong with that, if
| it works.
|
| George.
|
| contact George Huxtable, at  george@hux.me.uk
| or at +44 1865 820222 (from UK, 01865 820222)
| or at 1 Sandy Lane, Southmoor, Abingdon, Oxon OX13 5HX, UK.
|
|
|
Subject:
Author:
Start date:	(yyyymm dd)
End date:	(yyyymm dd)
NavList:

A Community Devoted to the Preservation and Practice of Celestial Navigation and Other Methods of Traditional Wayfinding

Compose Your Message

NavList

What is NavList?

Get a NavList ID Code

Retrieve a NavList ID Code

Email Settings

Custom Index

Add Images & Files
Name or NavList Code:	Email:
Name:
	(please, no nicknames or handles)
Email:
NavList ID Code: