For an input string of finite size and with certain commands restricted, I don't believe that proving harmlessness is equivalent to the halting problem.
I'm not sure if NP-Hard is the right classification for the more restricted case though.
It's still Turing complete. In general, proving pretty much any non-trivial property about a general-purpose (and I mean this term very loosely: you have to strip away a lot of things from your favorite language before this is no longer true) programming language is.
Turing complete.