Every so often, I see someone get irritated that “can I see that?” tends to mean “may I hold that while I look at it?” Given how common this is, and how natural it seems to me to want to hold something while I examine it, I wonder if there is an underlying reason behind it.
Seeing some of the pictures in a recent blog post by Google’s research team, I wonder if that reason may be related to how “quickly” we learn to recognise new objects — quickly in quotes, because we make “one observation” while a typical machine-learning system may need thousands of examples to learn from — what if we also need a lot of examples, but we don’t realise that we need them because we’re seeing them in a continuous sequence?
Human vision isn’t as straightforward as a video played back on a computer, but it’s not totally unreasonable to say we see things “more than once” when we hold them in our hands — and, crucially, if we hold them while we do so we get to see those things with additional information: the object’s distance and therefore size comes from proprioception (which tells us where our hand is), not just from binocular vision; we can rotate it and see it from multiple angles, or rotate ourselves and see how different angles of light changes its appearance; we can bring it closer to our eyes to see fine detail that we might have missed from greater distance; we can rub the surface to see if markings on the surface are permanent or temporary.
So, the hypothesis (conjecture?) is this: humans need to hold things to look at them properly, just to gather enough information to learn what it looks like in general rather than just from one point of view. Likewise, machine learning systems seem worse than they are for lack of capacity to create realistic alternative perspectives of the things they’ve been tasked with classifying.
Not sure how I’d test both parts of this idea. A combination of robot arm, camera, and machine learning system that manipulates an object it’s been asked to learn to recognise is the easy part; but when testing the reverse in humans, one would need to show them a collection of novel objects, half of which they can hold and the other half of which they can only observe in a way that actively prevents them from seeing multiple perspectives, and then test their relative abilities to recognise the objects in each category.