To me, a speaker of U.S. English, "straight down" implies movement, making the instructions confusing. A non-motive indication of position would be "straight below".
I would phrase it as "press into the screen with another finger" maybe. Common screen coordinates (to me) are left/right (x), up/down (y), and in/out (z). The instruction is trying to refer to the z coord, but using the common word for the y coord.
Of course it implies movement. You have to move your finger to touch the screen. If it said straight down and had no qualifier, then it would be confusing as then down could have two common meanings. However it does have a qualifier within the same statement that makes it extremely clear that the motion described is not a motion across the screen.
Humans don't parse language the way computers do. When they see a phrase commonly associated with movement the concept of movement will enter their minds, even if a precise analysis of the sentence does not indicate movement.