Then an attacker can't break out of the attribute (and consequently can't do anything bad). htmlspecialchars also accepts a flag, ENT_QUOTES, which allows it to sanitize values for single-quoted attributes properly. If you don't surround an attribute's value at all, then you're still vulnerable: you also have invalid HTML.
The real takeaway here can be found in the middle of the post: "you must be aware of the context in which you’re working with user input."
Right, htmlspecialchars() works perfectly as long as you use it for an attribute value surrounded by quotes, or if you escape a whole string. The only way you can get into trouble is if you use it on an attribute value that doesn't have quotes.
If you are using PHP and want to accept any user input that should be interpreted as HTML, you basically need to be using http://htmlpurifier.org/. If you are going to be accepting text input, clean the input value to ensure proper character encoding is being used (important for multi-byte encoding such as UTF-8) and then use htmlspecialchars(), and make sure to specify your encoding as the third parameter.
// Specify your encoding to the browser
header('Content-Type: text/html; charset=utf8');
function get_escape($field) {
$value = iconv(
'UTF-8',
'UTF-8//IGNORE',
isset($_GET[$field]) ? $_GET[$field] : ''
);
return htmlspecialchars($value, ENT_COMPAT, 'UTF-8');
}
// Safely output text user input
echo '<html><body><p title="' . get_escape('title') . '">' . get_escape('content') . '</p></body></html>';
That's actually not true, and htmlspecialchars() will automatically do the UTF-8 validation for you as long as you have set your charset correctly. But even if you do that, you are still vulnerable.
Inside on* handlers and style attributes, the rules are different. Take something like this:
htmlspecialchars() does its job here. It turns a single quote into ' however, inside on* and style attributes the ' entity is treated as a raw single quote. You need to double-escape in this particular case to be safe, or better yet, don't use raw on* handlers and style attributes. There are much cleaner ways to do those, but if you have to, don't ever put user data in them because you will mess up the escaping.
It is not possible to write a single generic html escaping function that will work in all contexts. If it was, I would have written htmlspecialchars() differently.
There are more examples of how you can mess up even if you always quote your attributes if your escaping function isn't smart. The UTF-7 hack was mentioned, which is good, but the invalid UTF-8 hack wasn't explained. That is, if you send an invalid UTF-8 sequence, like %E0 then certain browsers (well, just IE) will lose their minds unless you make sure you don't display that invalid UTF-8 sequence back to the user. So htmlspecialchars() does more than just escape the set of chars you mentioned, it also validates the characters and makes sure it never outputs an invalid UTF-8 byte sequence.
0xE0 by itself is the first byte of a 3-byte UTF-8 char and IE will simply eat the following 2 bytes to make up the char. So if you output: "<e0>"> even though the byte is inside quotes, IE will eat the following "> and replace those 3 bytes with the dreaded (?) char, but more disastrously it will think it is still inside the quoted attribute so the next raw quote it sees will end the attribute and you have yourself another quoted xss hole.
I'm not trying to be overly pedantic with this reply, but I think when it comes to security related topics it is good to be clear and so I think the conversation should continue until it is clear to readers what is being discussed.
For instance, the original article said:
Please don’t assume that, having read this post, you now know
everything there is to know about HTML escaping. I can
guarantee that you don’t, because I don’t.
Here I can say that using PHP's htmlspecialchars() with a proper encoding and clean data is all you need to know about escaping HTML. In the example of talking about on* attributes, you are now discussing escaping of JavaScript. The reason htmlspecialchars() fails here is not because it is broken, but because you are using it on the wrong language.
In terms of the UTF-7 issue with IE, the example code I posted properly handles those situations since it uses iconv() to clean the UTF-8. Really we've now changed from talking about escaping HTML output to cleaning user input.
I think the most important thing people understand is that there is a lot of knowledge required to write a security PHP application. You don't just need to worry about escaping HTML and SQL injection.
If anyone is interested in learning more about the various aspects, I gave a talk at BostonPHP and the Boston Security Meetup last year that was a survey of PHP security. If anyone is interested in seeing the slides, you can find them at http://wbond.net/security. The slides are HTML-based and contain links to learn even more.
> Inside on* handlers and style attributes, the rules are different.
That may be so, but I'm not sure I'd call that an “HTML” escaping problem per se. An attribute always has an additional syntax, and you have to account for the subsyntaxes of whatever attributes are in place—but those aren't at the HTML layer proper, only implied by it. E.g., a's href attribute takes a URI. onfoo attributes take JavaScript code. style attributes take CSS. So in order to make HTML “safe” (FSVO “safe”) when it contains those attributes, you have to make those values “safe” recursively according to their subsyntaxes—e.g., if you want to allow CSS with url(), then you have a URI inside CSS inside an HTML attribute inside an HTML document inside (for instance) a UTF-8 string, and you have to take all the layers into account.
It may be that a lot of people don't realize there's several potential layers of syntax involved, think of it as a monolithic and simple thing, and then get confused when it is not. Things like PHP htmlspecialchars can inadvertently encourage this kind of inaccurate view.
(I'm not disagreeing with you exactly, just describing from a variant perspective. A bit of redundancy in discussion can create an antialiasing-like effect.)
That certainly wasn't my intention specifically. I mentioned htmlspecialchars because it was the example being used upthread, but any API with analogous functionality is potentially subject to similar provisos, and if any surrounding cultural element encourages its use without thinking through the syntax layers, it can have a similar effect. I assumed this was implicit.
When generating output for a browser what you're really doing is writing an HTML serializer. Kind of tricky to do right by concatenating a bunch of strings together. Some template systems (such as Genshi for Python) actually parse the template as HTML or XML so they understand how to encode all of your outputs correctly for their context.
We use unquoted attributes at Google sometimes. It's for bandwidth/latency-saving reasons. They're usually limited to literal template text (i.e. class names, width/height attributes, etc.) and not user-generated text.
I use single-quoted attributes on a regular basis when editing XML (and HTML, which I edit as XML) documents by hand, mainly because it conserves a noticeable amount of Shift keying on the US keyboard layout. If necessary I mechanically translate them into double quotes ex post facto, but I don't bother if I can assume a style of parser that will handle the single quotes.
I debated whether or not to mention this, and in the end decided I didn't want an in-depth discussion of edge cases to overwhelm the basic message I was trying to get across, which is that context is key.
As far as I know, ` is only an issue when using user input in innerHTML with IE. Are there other situations where it can be harmful?
My understanding (and I tested to confirm) is that IE only treats ` as an attribute delimiter when it's assigned to an element's innerHTML value dynamically. So this is important when working with client-side code, but not so much when generating HTML on the server.
Personally I’d just consider quoting attribute values to be best practice. And BTW, escaping < and > is not even necessary if you are using quotes with HTML attributes, unless you need compatibility with browsers prior to Netscape 2.
> Personally I’d just consider quoting attribute values to be best practice. And BTW, escaping < and > is not even necessary if you are using quotes with HTML attributes, unless you need compatibility with browsers prior to Netscape 2.
<a href="...">Label with evil angle brackets</a><script>...
I put this example into my browser and there was no attack. Replace the second mouseover with alert(2) and it's clear that it is never parsed as javascript. Am I missing something? [Edit: yes, I am missing something]
I was concerned about this because I wrote an open source python html white list that escapes/eliminates everything that isn't on the white list, but it sounds like my filter works as long as you quote attributes. I guess I need to add a warning for that.
> If your system accepts 'foo" onmouseover="alert(1)' as a username, you've got bigger problems.
Technically that shouldn't be a problem. I can put that in HTML, in URL, in the database. I can even make directory with that name and use it in shell scripts — as long as every one of them uses correct escaping.
The example I gave in the blog post was a real-world example: a link to a user's profile page that includes the username in the URL. This is extremely common.
Naturally, only a fool would allow usernames to contain unsafe characters, but there are a lot of fools writing web apps.
PHP's htmlspecialchars function, which is called out specifically in this post, is actually fine for most use-cases. If you write code like this:
Then an attacker can't break out of the attribute (and consequently can't do anything bad). htmlspecialchars also accepts a flag, ENT_QUOTES, which allows it to sanitize values for single-quoted attributes properly. If you don't surround an attribute's value at all, then you're still vulnerable: you also have invalid HTML.The real takeaway here can be found in the middle of the post: "you must be aware of the context in which you’re working with user input."