NavList:
A Community Devoted to the Preservation and Practice of Celestial Navigation and Other Methods of Traditional Wayfinding
Re: Rejecting outliers
From: George Huxtable
Date: 2011 Jan 3, 11:46 -0000
From: George Huxtable
Date: 2011 Jan 3, 11:46 -0000
Thanks to Peter Hakel for providing an unlocked version of his spreadsheet: I can now read it all, though without understanding exactly what it's doing. Perhaps Peter can clarify for me a point about his earlier posting, on 31 December, in which he wrote- Eq(1): weight = 1 / variance = 1 / (standard deviation squared) It may only be a matter of words, but it seems to me that the weight has to be assessed individually for each member of the set. Isn't "standard deviation" a measure of the set as a whole, not just a single member? Shouldn't Eq(1) read something like- "= 1/ (deviation squared)", not "= 1/ (standard deviation squared)"? Otherwise, I fail to follow it. I think he is giving himself an unnecessarily hard time by allowing the slope to be a variable in his fitting routine. What's more, he is diverging from an important prior-constraint of the problem, which is that the true slope must represent an altitude change of 32' over a 5 minute period, and NOTHING ELSE WILL DO. To that extent, his analysis is inappropriate to the problem. Knowing that variation with time, we can eliminate time from the problem before we even start to tackle it, by subtracting from each altitude value an amount that increases linearly with time, with a slope of 32',from some arbirary base-value, chosen for convenience. This then results in a set of nine simple numbers, of which the time and even the ordering is now unimportant. Peter's task then is to find some way of processing those numbers to determine a result that represents the true, unperturbed, initial value, better than a simple mean-of-9 does. In the case we're presented with, there's no evidence that the distribution is anything other than a simple Gaussian, which makes his task more difficult. If there were obvious "outliers", it could be more straightforward. Now for Peter's weighting function. In a simple least-squares analysis the weight given is the same to each observation, so the weight factor is a constant =1, whatever the deviation. If a limit is set, outside which data is excluded, it becomes a square distribution, around a best-estimate of a cenral value, within which the weighting is taken as 1, and outside it, either side of the centre beyond a deviation specified, for example, as 3 standard deviations, the weighting is zero. It's somewhat unphysical and arbitrary, but at least the conditions can be clearly specified. Peter modifies a square-box weighting function, as above, to add an inverse-square fall-off beyond its shoulders. Those sharp shoulders also seem semewhat unphysical. What I would like to follow is how the half-width between those shoulders relates to the standard deviation, and to his "scatter" parameter. It seems that Peter wishes to leave it to the individual, to choose the scatter parameter that's most appropriate to a particular data set, after viewing some results. If I understand that right, he hasn't yet eliminated all the "magic" from the operation. George. contact George Huxtable, at george{at}hux.me.uk or at +44 1865 820222 (from UK, 01865 820222) or at 1 Sandy Lane, Southmoor, Abingdon, Oxon OX13 5HX, UK. ----- Original Message ----- From: "P H"To: Sent: Monday, January 03, 2011 1:17 AM Subject: [NavList] Re: Rejecting outliers | George, | | I forgot to mention that you can unlock the spreadsheet by turning off its | protection, there is no password. I attach the unlocked version, so you can | skip that step. The PNG file is an image, a screenshot of the color-coded | portion of the spreadsheet where input and output data are concentrated. This | is the part with which a user would interact; I attached it for those readers | who may be interested in the main points but don't want to bother with Excel in | detail. My Excel is Office 2004 for Mac, so hopefully compatibility will not be | a problem. | | I gave the details of the procedure in: | | http://www.fer3.com/arc/m2.aspx?i=115086&y=201012 | | Step 1: Calculate the standard (non-weighted) least-squares linear fit through | the data. | | Now iterate: | Step 2: Calculate altitude differences "diff" between the data and the latest | available linear fit. | Step 3: Calculate new weights as 1 / diff^2 for each data point. | Step 4: Calculate a new linear fit using weights from Step 3. | Repeat until convergence. | | "diff" could turn up small, or even zero, which would cause numerical problems; | that is why the weights have a ceiling controlled by the "Scatter" parameter. | The weight=1.000 means that the data point has hit this ceiling and contributes | to the result with maximum influence allowed by the procedure. For | Scatter=7.0', all nine data points reach this ceiling and we are stuck at Step | 1. | | I have not gotten around to fitting Gaussian-scattered data, as you suggested. | I may do so in the future, time permitting. | | The procedure of weighted-least squares has a very solid theoretical background, | see, e.g., | | http://en.wikipedia.org/wiki/Least_squares#Weighted_least_squares | http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares | | The one admittedly heuristic detail is Step 3; weight should be 1/variance, but | that is unknown in this case. Step 3 seems like a reasonable replacement, | effectively substituting |diff| for the standard deviation of altitudes at the | given UT. | | This procedure is capable of eliminating lopsidedness to a certain extent, as I | have shown previously. However, if there are too many "lopsided" data points, | the result will follow them to the new "middle" defined by them. I don't know | how we would weed that out without additional information about where the | correct "middle" really is. As I said earlier, in the absence of an independent | check, we must rely on the assumption that a sufficient majority of data points | are "good." | | My motivation was to see what information can be extracted from the data set | alone, without any additional information such as DR. I think that Peter Fogg's | approach of precomputing the slope is fine, and would most likely give a better | practical result. After all, "position tracking" is preferable to "position | establishing from scratch" and it is indeed what happens in real life. But I | think you will agree that academic curiosity has its benefits, too. :-) | | | Peter Hakel | | | | | | ________________________________ | From: George Huxtable | To: NavList@fer3.com | Sent: Sun, January 2, 2011 3:12:28 PM | Subject: [NavList] Re: Rejecting outliers | | I'm failing to understand some aspects of Peter Hakel's approach, though it | least it seems numerical, definable, and repeatable. | | First, though, I should say that problems arise when viewing his Excel | output. It looks as though some columns need to be widened to see their | contents, but it seems that the file is deliberately crippled to prevent me | taking many of the actions I am familiar with in Excel. But I should add | that mine is an old 2000 version, which may be part of the problem. The | alternative .png version opens with "paint" but only allows me to see the | top-left area of the sheet. Is that all I need? | | Now to more substantive matters- | | I don't understand how the various weighting factors have been derived, and | why many of them are exactly 1.0000, when others are much less. | | Presumably, they depend on the divergence of each data point from some | calculated line, which is then readjusted by iteration, but I have failed | to follow how that initial straight-line norm was assessed, or what | algorithm was used to obtain the weights. Answers in words rather than in | statistical symbols would be most helpful to my simple mind. | | You seem to have ended up with a best-fit slope of about 24' in a 5-minute | period, as I did when allowing Excel to make a best-fit, when giving it | freedom to alter the slope as it thought fit. But the slope can be | pre-assessed with sufficient accuracy from known information, and unless | there is some error in the information given, such as an (unlikely) star | mis-identification, we can be sure that the actual slope is nearer to 32', | and the apparent lower figure is no more than a product of scatter in the | data. This is a point that Peter Fogg keeps reiterating, perhaps the only | valid point in all he has written on this topic. | | As a result, we could, if we wished, subtract off that line of known | constant slope from all the data, and end up with a set of numbers, all of | which should be roughly equal, simply scattering around some mean value | that we wish to discover. Then the statistical task of weeding outliers | becomes somewhat simpler. | | ================= | | If you apply your procedure to an artificially-generated data-set, | scattering in a known Gaussian manner about a known mean (which could be | zero), and known to contain no non-Gaussian outliers, what is the resulting | scatter in the answer? How does it compare with the predicted scatter from | simple averaging? I suspect (predict) that it can only be worse, though | perhaps not by much. | | This is the way I picture it. If there is any lopsidedness in the | distribution, then each added observation that is in the direction of the | imbalance will be given greatest weight, whereas any observation that would | act to rebalance it, on the other side, will be attenuated, being further | from the trend. So there will be some effect, however small, that acts to | enhance any unbalance, though probably not to the extent of causing | instability. Does that argument make sense to you? It could be checked out | by some Monte Carlo procedure. | | I presume that the proposed procedure is entirely empirical, and has no | theoretical backing, though there may not be anything wrong with that, if | it works. | | George. | | contact George Huxtable, at george@hux.me.uk | or at +44 1865 820222 (from UK, 01865 820222) | or at 1 Sandy Lane, Southmoor, Abingdon, Oxon OX13 5HX, UK. | | |