Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
There's more to HTML escaping than &, <, >, and " (wonko.com)
138 points by jamesbritt on May 8, 2011 | hide | past | favorite | 30 comments


...except when there isn't. ;-)

PHP's htmlspecialchars function, which is called out specifically in this post, is actually fine for most use-cases. If you write code like this:

    <input type="text" name="foo" value="<?php echo htmlspecialchars($_GET['bar']) ?>" />
Then an attacker can't break out of the attribute (and consequently can't do anything bad). htmlspecialchars also accepts a flag, ENT_QUOTES, which allows it to sanitize values for single-quoted attributes properly. If you don't surround an attribute's value at all, then you're still vulnerable: you also have invalid HTML.

The real takeaway here can be found in the middle of the post: "you must be aware of the context in which you’re working with user input."


Right, htmlspecialchars() works perfectly as long as you use it for an attribute value surrounded by quotes, or if you escape a whole string. The only way you can get into trouble is if you use it on an attribute value that doesn't have quotes.

If you are using PHP and want to accept any user input that should be interpreted as HTML, you basically need to be using http://htmlpurifier.org/. If you are going to be accepting text input, clean the input value to ensure proper character encoding is being used (important for multi-byte encoding such as UTF-8) and then use htmlspecialchars(), and make sure to specify your encoding as the third parameter.

  // Specify your encoding to the browser
  header('Content-Type: text/html; charset=utf8');

  function get_escape($field) {
    $value = iconv(
      'UTF-8',
      'UTF-8//IGNORE',
      isset($_GET[$field]) ? $_GET[$field] : ''
    );
    return htmlspecialchars($value, ENT_COMPAT, 'UTF-8');
  }
  
  // Safely output text user input
  echo '<html><body><p title="' . get_escape('title') . '">' . get_escape('content') . '</p></body></html>';


That's actually not true, and htmlspecialchars() will automatically do the UTF-8 validation for you as long as you have set your charset correctly. But even if you do that, you are still vulnerable.

Inside on* handlers and style attributes, the rules are different. Take something like this:

<?php $foo = htmlspecialchars($_GET['foo'], ENT_QUOTES);?> <a href="" onmouseover="a='Fantas<?php echo $foo?>tic';">Mouse Over Me</a>

htmlspecialchars() does its job here. It turns a single quote into &#039; however, inside on* and style attributes the &#039; entity is treated as a raw single quote. You need to double-escape in this particular case to be safe, or better yet, don't use raw on* handlers and style attributes. There are much cleaner ways to do those, but if you have to, don't ever put user data in them because you will mess up the escaping.

You can try a live example here:

http://talks.php.net/show/flux/14

It is not possible to write a single generic html escaping function that will work in all contexts. If it was, I would have written htmlspecialchars() differently.

There are more examples of how you can mess up even if you always quote your attributes if your escaping function isn't smart. The UTF-7 hack was mentioned, which is good, but the invalid UTF-8 hack wasn't explained. That is, if you send an invalid UTF-8 sequence, like %E0 then certain browsers (well, just IE) will lose their minds unless you make sure you don't display that invalid UTF-8 sequence back to the user. So htmlspecialchars() does more than just escape the set of chars you mentioned, it also validates the characters and makes sure it never outputs an invalid UTF-8 byte sequence.

0xE0 by itself is the first byte of a 3-byte UTF-8 char and IE will simply eat the following 2 bytes to make up the char. So if you output: "<e0>"> even though the byte is inside quotes, IE will eat the following "> and replace those 3 bytes with the dreaded (?) char, but more disastrously it will think it is still inside the quoted attribute so the next raw quote it sees will end the attribute and you have yourself another quoted xss hole.


I'm not trying to be overly pedantic with this reply, but I think when it comes to security related topics it is good to be clear and so I think the conversation should continue until it is clear to readers what is being discussed.

For instance, the original article said:

  Please don’t assume that, having read this post, you now know
  everything there is to know about HTML escaping. I can
  guarantee that you don’t, because I don’t.
Here I can say that using PHP's htmlspecialchars() with a proper encoding and clean data is all you need to know about escaping HTML. In the example of talking about on* attributes, you are now discussing escaping of JavaScript. The reason htmlspecialchars() fails here is not because it is broken, but because you are using it on the wrong language.

In terms of the UTF-7 issue with IE, the example code I posted properly handles those situations since it uses iconv() to clean the UTF-8. Really we've now changed from talking about escaping HTML output to cleaning user input.

I think the most important thing people understand is that there is a lot of knowledge required to write a security PHP application. You don't just need to worry about escaping HTML and SQL injection.

If anyone is interested in learning more about the various aspects, I gave a talk at BostonPHP and the Boston Security Meetup last year that was a survey of PHP security. If anyone is interested in seeing the slides, you can find them at http://wbond.net/security. The slides are HTML-based and contain links to learn even more.


> Inside on* handlers and style attributes, the rules are different.

That may be so, but I'm not sure I'd call that an “HTML” escaping problem per se. An attribute always has an additional syntax, and you have to account for the subsyntaxes of whatever attributes are in place—but those aren't at the HTML layer proper, only implied by it. E.g., a's href attribute takes a URI. onfoo attributes take JavaScript code. style attributes take CSS. So in order to make HTML “safe” (FSVO “safe”) when it contains those attributes, you have to make those values “safe” recursively according to their subsyntaxes—e.g., if you want to allow CSS with url(), then you have a URI inside CSS inside an HTML attribute inside an HTML document inside (for instance) a UTF-8 string, and you have to take all the layers into account.

It may be that a lot of people don't realize there's several potential layers of syntax involved, think of it as a monolithic and simple thing, and then get confused when it is not. Things like PHP htmlspecialchars can inadvertently encourage this kind of inaccurate view.

(I'm not disagreeing with you exactly, just describing from a variant perspective. A bit of redundancy in discussion can create an antialiasing-like effect.)


Is this thread a parody of how no one should use PHP?


That certainly wasn't my intention specifically. I mentioned htmlspecialchars because it was the example being used upthread, but any API with analogous functionality is potentially subject to similar provisos, and if any surrounding cultural element encourages its use without thinking through the syntax layers, it can have a similar effect. I assumed this was implicit.


When generating output for a browser what you're really doing is writing an HTML serializer. Kind of tricky to do right by concatenating a bunch of strings together. Some template systems (such as Genshi for Python) actually parse the template as HTML or XML so they understand how to encode all of your outputs correctly for their context.


Indeed, if you're generating the HTML/XML stream as tokens instead of as plain text, the context-sensitive quoting can be done automatically.

When I do have to generate HTML as text I usually go with escaping &<>"' and I double quote all attribute values. Isn't this best practice?

Is there anyone using 's or (eek) unquoted HTML attributes at all?


We use unquoted attributes at Google sometimes. It's for bandwidth/latency-saving reasons. They're usually limited to literal template text (i.e. class names, width/height attributes, etc.) and not user-generated text.


I use single-quoted attributes on a regular basis when editing XML (and HTML, which I edit as XML) documents by hand, mainly because it conserves a noticeable amount of Shift keying on the US keyboard layout. If necessary I mechanically translate them into double quotes ex post facto, but I don't bother if I can assume a style of parser that will handle the single quotes.


Cant believe nowhere in the article is the tick/grave mentioned ( ` )

That bugger can really do some damage.


I debated whether or not to mention this, and in the end decided I didn't want an in-depth discussion of edge cases to overwhelm the basic message I was trying to get across, which is that context is key.

As far as I know, ` is only an issue when using user input in innerHTML with IE. Are there other situations where it can be harmful?


` can be used in place of single or double quotes around attribute values in IE.


My understanding (and I tested to confirm) is that IE only treats ` as an attribute delimiter when it's assigned to an element's innerHTML value dynamically. So this is important when working with client-side code, but not so much when generating HTML on the server.

Am I wrong?


I just tried the following HTML:

    <input type="text" value=`asdf` />
In IE, the input box contained the string asdf. In other browsers, it contained the string `asdf`


You're right. I was mistakenly testing only a limited case (described at http://html5sec.org/#59). Thanks!


Personally I’d just consider quoting attribute values to be best practice. And BTW, escaping < and > is not even necessary if you are using quotes with HTML attributes, unless you need compatibility with browsers prior to Netscape 2.


> Personally I’d just consider quoting attribute values to be best practice. And BTW, escaping < and > is not even necessary if you are using quotes with HTML attributes, unless you need compatibility with browsers prior to Netscape 2.

<a href="...">Label with evil angle brackets</a><script>...


Sorry, I forgot to say within attribute values.


That's exactly the point of the blog post: it's all about context.


    <a href="/user/foo" onmouseover="alert(1)">foo" onmouseover="alert(1)</a>
I put this example into my browser and there was no attack. Replace the second mouseover with alert(2) and it's clear that it is never parsed as javascript. Am I missing something? [Edit: yes, I am missing something]

EDIT: Blerg! I misread it! Disregard!


The first alert() should work - note that it's not part of the template.

The second alert() is just garbage, of course.


I was concerned about this because I wrote an open source python html white list that escapes/eliminates everything that isn't on the white list, but it sounds like my filter works as long as you quote attributes. I guess I need to add a warning for that.

Will someone double check me? https://sourceforge.net/projects/htmlfilterfacto/files/



I'll fix it. Thank you.


I can't find a use case for the example the article is working with. Who in their right mind would take user input and put it in an HTML attribute?

If your system accepts 'foo" onmouseover="alert(1)' as a username, you've got bigger problems.


> If your system accepts 'foo" onmouseover="alert(1)' as a username, you've got bigger problems.

Technically that shouldn't be a problem. I can put that in HTML, in URL, in the database. I can even make directory with that name and use it in shell scripts — as long as every one of them uses correct escaping.

Bobby Tables is welcome on my systems.


> Who in their right mind would take user input and put it in an HTML attribute?

Hacker news for example :) HN allows you to enter a color code in your preferences, which is inserted as a bgcolor attribute on a html table.


The example I gave in the blog post was a real-world example: a link to a user's profile page that includes the username in the URL. This is extremely common.

Naturally, only a fool would allow usernames to contain unsafe characters, but there are a lot of fools writing web apps.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: