Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Same here. Also, I don't use Atom, but looking at the expression I'm pretty sure it fails to account for strings with parentheses. The way the matching is done, it looks like it will happily count:

    func("some call :)")
as extra closing paren. (regex101 agrees)


Sometimes I feel like the difference between a junior, intermediate, and senior developers is that the junior hasn't yet figured out how to use regexps, the intermediate dev has, and the senior dev has figured out not to.

I kid, but...really. The number of times I've seen people burnt by non-trivial regexps in production code is absurd.

"Oh, I need to mangle this CSV that's in the wrong format"? Sure, write a one-off regexp.

"Hm, there might be a security vulnerability in this URL parameter, I need some way to filter out bad inputs..." Oh hell no.

Literally two weeks ago I tracked down a crazy bug to a regexp someone had written to try and correct mistyped email addresses in a signup form (?!) that deleted "invalid" characters. Like a "+". So many facepalms for one short line of code...


The rule I've developed for myself is never clean up data for security purposes. There are some cases where it is valid to examine the data for validity; for instance, there's many places where you'd like to filter null characters or "all of the ASCII control characters except the whitespace", and it's valid to look for those things. But if you find them, reject the input. In theory you'd think you can clean it up, but in practice it turns out to be just too dangerous. Too many concrete instances of real-world failure, and reality always trumps theory.


The famous "Gruber URL regexp", before an update in 2014, would run in exponential time given some URLs that contained parens.

I noticed this when a web app I worked on froze on certain pages. Runaway regexp matching is one of the easiest ways to really lock a JavaScript thread.


I was also hit by this one in my pet IRC bot. :) URL matching would fail (occasionally) and the whole bot would freeze up. :<


I completely agree. Honestly regular expressions are hard to get a handle on anyway. I've done software development for about 12 years now and I probably run into a regular expression pretty regularly and I still find them hard to read and understand. Almost every single time they do things that can easily be accomplished with a loop and a tiny bit of string manipulation and, honestly, even if the regex was a little faster it's unlikely to be so much faster than you need to make the micro-optimization of using regex instead of some looping.

I try to avoid regex where possible and in the event that I do need to use it I try to document it the best I can and to make sure it's as simple as possible to avoid weird issues like this.


out of curiosity: what would you use for the URL validation?

(but i get this is difficult: https://gist.github.com/dperini/729294 )


Well, it's clear to see why people are lead astray. In this case, RFC 3986, which defines what a URI actually is, itself actually proposes a regex[0]. It even claims, and I quote, 'the "first-match-wins" algorithm is identical to the "greedy" disambiguation method used by POSIX regular expressions'. Of course, it doesn't actually validate anything... it just performs rudimentary field extraction. Trying to do anything beyond that with regex is madness.

The ABNF grammar is fairly simple though and, I think, free of ambiguities. I've had luck converting it straight in to PEG form. It's not a trivial or wholly useful endeavour though, and, if you do so, remember to check the errata. For HTTP you'll also have to add in the changes from RFC 7230[1].

Oh, and of course, none of this validates DNS names, their labels, etc. for length, the "LDH rule", or the public suffix list[2], or IP addresses to check whether they have publicly routable prefixes.

Bottom line is, if you want to validate a URL, the best thing to do, much like e-mail, is to just try and GET it.

[0] https://tools.ietf.org/html/rfc3986#appendix-B

[1] https://tools.ietf.org/html/rfc7230#section-2.7.1

[2] https://publicsuffix.org/


The general rule is that any tricky security code should have as many eyeballs on it as possible, so I'd probably start by seeing if my framework had solid support for this, or failing that, see what popular libraries for my platform exist.

In this particular case, the Google Caja project[1] is a good starting place for most HTML/JS/CSS sanitization needs (although the project has a much larger scope than just that); and I think the 'sanitizer' package on npm is a fairly popular wrapper/port of it's basic sanitizing code, and I believe ruby/php/python have their own but I couldn't name them offhand. But it would depend on the exact attack vector you're trying to stop, eg, XSS, remote shell via filename params, etc.

If I had to write it myself, I'd probably go for something as braindead as possible; probably a bunch of nested loops backed by some thorough tests. Regexps are great for magic one liners, but magic one liners are antithetical to good security.

[1]: https://developers.google.com/caja/


I prefer straight up imperative code (or, better yet, whatever Url/Uri API that is provided by the platform).


Just tried this, you are correct... http://imgur.com/J9tP7lk (note the incorrect bracket highlighting)


Jesus, how did nobody catch this before? The least a code editor should do is to match parentheses correctly.


I've never used a language mode that worked properly in all cases. That's why I've stopped using them, aside from very clever and extraordinarily useful ones like paredit.

It terrifies me to think about how these language modes contain regexps for matching regexp syntax. Of course those are wrong.


> I've never used a language mode that worked properly in all cases.

You need a full parser for that to happen, most mode authors don't bother[0]. JS2-mode (https://github.com/mooz/js2-mode/) does that (or attempts to).

Most syntactic colorisers are souped-up tokenisers, not actual parsers. It suffices in 99% of cases, but breaks badly elsewhere.

[0] it would help if more languages made fast parsing machinery externally available as a convenient API. ALAS, that's rarely the case, even less so for fast parsing machinery which doesn't throw the information you need for syntactic coloration out the window.


I'm fairly sure that Emacs's go-mode, which doesn't use a full-blown parser, won't make the mistake of treating a paren inside a string as if it were outside either. Or if it did, that would be easy to fix.

Emacs has a generic lightweight parsing facility that can tell you if you're inside a string, or a comment, or how to skip over a string, or a pair of matching parens/braces/brackets, and so on.


The issue tracker for js2-mode illustrates that such a parsing approach is no guarantee of real-world accuracy either.

https://github.com/mooz/js2-mode/issues

The goal just shifts from having a semi-accurate regexp tokenizer, to having the language mode's complex parser match the complex parser of the language it targets—which will pretty much never happen, unless the language is exquisitely simple, like maybe restricted dialects of Scheme... or Brainfuck.


js2-mode supports ES6 fairly accurately. But of course, that's not enough, people keep asking for new things: JavaScript mixed with HTML, Facebook Flow support, JSX, new ES7 (still unstable) syntax extensions, and so on.

It would be easier with a more stable language. I.e. not JavaScript, which is still going through growing pains.

Even so, "25 Open, 204 Closed" seems pretty good to me.


fundamental-mode hasn't failed me even once, is what I mean. :) I'm not criticizing js2-mode, it's just that I have personally given up hope for accurate language modes in general, and realized that they are a lot of effort and IMO relatively little use.


Does Atom have an API for syntax highlighting with full parsers? I'm pretty sure it only supports bix trees of regexes for that.


[0] - typescript does this I believe


The Textadept editor uses PEGs for its language modes. I haven't tested them but that sounds like it has a chance of being pretty good.


Apparently that's easier said than done.


... with regex.


With anything less than a full-featured lexer and parser of the latest edition of the language in question, I'd say. Including understanding of whatever metaprogramming features or preprocessors that are in play.


It's even more than full-featured for a code editor, really, you really want a language parser that can more gracefully handle partial statements than even the core language parser itself can handle. von Neumann help you if your language was complicated to parse even before you added that requirement.


Visual Studio doesn't even do a full-blown AST - even though it has a full-blown internal parser. It frequently messes up auto-indenting when using macros involving blocks.

Lexing and producing tokens with a loop is probably enough given that auto-indentation works on a subset of states of the parser, if any. As my comment alluded to, using regex is harder to both write and read than a simple loop; it's also most likely slower.


This could be solved by using regexes to tokenise instead, then detecting brackets. Tokenising is something regular expressions can actually do well. Directly parsing source code is not.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: