Pinch zoom - it feels intuitive. You just do it. Often the best tools are like that - ball point pens, velcro, post-it notes, LEDs - all of them are really intuitive to use, so much so that we often take it for granted. But behind that ease of use sits a lot of hard work.

Let’s go back to Pinch zoom. It’s really easy to describe it to someone: just put your finger and thumb on the screen, widen the gap between the two to zoom in, bring them closer to zoom out.

Recently I needed to implement it in code. I know there are existing implementations, but for ‘reasons’ I needed to write it myself.

The problem Link to heading

From a code perspective there is one thing that defines how the viewed item is drawn: converting logical coordinates (such as a map’s internal scale) to physical coordinates (such as the phone screen’s pixels). It’s usually called the transform. It can be thought of as doing two jobs: it translates the view (moves it on the screen) and it scales the view (makes the points move closer or further from each other). There’s a clue here to a complication. The view can pan, which means we need to understand how that works as well. We actually need to handle a zoom and pan operation, rather than just zooming - especially as we can’t separate the two. After you’ve panned the view, you expect the zoom to focus on where you are looking, so the centre of the zoom needs to be panned as well.

The input Link to heading

There’s also the input to consider. Browsers were initially written with mouse and keyboard events, and early touchscreen devices wanted to make things simple. So ‘fake’ mouse events were created for a touch screen, even sending wheel events for pinch zoom. There are also touch events if you need more detail. But these are different again. There’s even some extra logic to avoid sending ‘fake’ wheel events if the touch has been handled - all in an effort to preserve backwards compatibility while adding new features.

Recently both mouse and touch events have been unified to the Pointer Events API - that provides really detailed information such as tangential pressure (I’m looking forward to finding a good use for that).

For the purposes of this discussion, however, there are three events we need to work with: down, up and move. We get a down event when a new ‘pointer’ is pressed/touched etc. and an up when it’s released. Hopefully it’s obvious that we get a move when it changes position. To use this we need to store some sort of state, so that we know what to do. For example if a pointer moves, we need to know how many other pointers are pressed, so we know if we zoom or pan.

As a note, the code in this post isn’t going to work - it’s designed to explain the principle and so elides details such as the definition of vectorAdd.

const state = {};

function pointer_down(event){
    state[event.pointerId] = true
}
function pointer_up(event){
    state[event.pointerId] = false
}
function pointer_move(event){
    // count the number of pointers that are currently down.
    let pointersDown = 0
    for ( const id in state ){
        if ( state[id] ) pointersDown = pointersDown + 1
    }
    // decide what to do
    if ( pointersDown == 1 ){
        do_pan()
    }else if ( pointersDown == 2 ){
        do_zoom()
    }
}

There’s already some code smells here - if there’s more than two pointers then it does nothing. What should it do? This is before we actually work out how much to pan or zoom by.

The pan initially looks pretty simple. We want to create a new transform that shifts the image in the same way as the pointer has moved - requiring us to store the previous position of the pointer…

The zoom is a bit trickier. We really want to understand how much the distance between the points has changed, and scale by that amount, meaning we need to track that as well.

It gets worse when you consider that we might have already moved the view when the second pointer is down. In fact we almost certainly have, given the imprecision of humans. We need to handle what happens after the pinch, as we then start panning again. We also need to handle the fact that once we’ve zoomed in, the panning needs to take account of that. 😓

Maybe it’s just me, but that seems like a lot of work. I fundamentally value elegance in coding - if there’s a way that requires less code then I’ll go for it.

Taking a different view Link to heading

I think it’s helpful to look at this problem from a different perspective. What we want to happen is that the view updates so that the logical point I touched moves with my finger as I interact. So when we get a pointer down, we really need to store where in the underlying coordinate space it was. Thankfully transforms work in both directions, so it’s just a matter of inverting it. I like the simplicity of that.

When something moves, we want to create a new transform so that the logical points (that we recorded when the pointer went down) line up with the current/new physical points. However with only one point we have no information to constrain the scale (so it probably makes sense to keep it the same), with two points we get a rotation as well (which we don’t want), and with more than two points we will need even more complex transforms (I think with three points you gain skew). 😩

We could just ignore all pointers after the first two, but that seems counter-intuitive. I’m not sure a user would expect that.

Perhaps the answer is to concentrate on two points only - the average (mean) logical and physical positions. Then we want a transform where those pairs of positions coincide, which will give us our translation element. We can determine the scale by measuring the distances from those average positions to the actual positions (both logical and physical), and using that as the scale. This gives a nice geometric interpretation of what we’re doing - it’s also rather elegant that a lot of these properties can be computed with a map and reduce operation.

The nasty part is handling the scaling when we’ve got only one point. However, because we’ll get an infinity from dividing by zero, we can use javascript’s ? operator to check we have more than one point, or use the current scale instead.

const state = {};

function pointer_down(event){
    const logicalPosition = getTransform().invert(event.position)
    state[event.pointerId] = { logical:logicalPosition , physical:event.position }
}
function pointer_up(event){
    delete state[event.pointerId]
}
function pointer_move(event){
    if ( !state[event.pointerId] ) return
    // update the state for this event
    state[event.pointerId].physical = event.position
    // get the current pointer details
    const activePointers = Object.keys(state).map( id=>state[id] )
    const activePointersCount = activePointers.length
    // compute the central points
    const logicalCenter = activePointers
        .map( a=>a.logical).reduce( vectorAdd ) / activePointersCount
    const physicalCenter = activePointers
        .map( a=>a.physical ).reduce( vectorAdd ) / activePointersCount
    // translation needs to make the logicalCenter appear at the physicalCenter
    const logicalDistances = activePointers
        .map( a=>distance(a.logical,logicalCenter) )
        .reduce( add )
    const physicalDistances = activePointers
        .map( a=>distance(a.physical,physicalCenter) )
        .reduce( add )
    // configure the transform
    const scale = (activePointersCount>1)?(physicalDistances / logicalDistances):getTransform().scale()
    
    // the translation is in physical coordinates, so we need the offset
    const currentLogicalCenter = vectorScale(logicalCenter, scale)
    setTransform( identity
        .translate( physicalCenter - currentLogicalCenter )
        .scale(scale))
}

I know there’s many aspects to code, and I’m sure people would have issues with parts of this, however I think there’s something elegant about it. I’d be keen to know what people think.

Live demo Link to heading

Conclusions Link to heading

There’s an irony in the fact that I really hope the user never notices how this works. In all intuitive interfaces and design it should always do what the user expects so effectively that it fades into the background. That’s what we mean by true cognitive digital integration. The interface fades into the background, allowing the user to interact directly and intuitively with the object of the code: the image, the text, the data.

Writing this kind of code is definitely not easy. The more compact and elegant it is the longer it takes to write, but the more attractive it is for the user. And that’s critical for integrating the cognitive with the digital.