Against using metaphors too much "I was like
'why don't you just say it in terrabytes of RAM?" -- ekate
== String is not a type ==
The basics of XSS prevention
Which also happen to cover header splitting
and SQL injection
and a whole host of other nasties
Outline:
1. What is XSS?
2. Why is XSS bad?
3. How do we stop XSS?
4. Why is (3) a bad approach?
also, dirty details of escaping in PHP
5. A conceptual model of strings in web applications
also, a generalization to other forms
(maybe also present the conceptual model for XSS)
6. The "right way": Make the tools do it for you
7. The "right way" in practice
8. Bonus! The UTF-8 story
== What is XSS? ==
XSS = Cross-Site Scripting
A fundamental part of the security of the web stems from the
"Same-Origin policy", which restricts the ability of
web sites from accessing information from other web sites (on
different domain names).
XSS means that an attacker is able to get *his* script running
on *your* domain, bypassing the Same-Origin policy.
Basic XSS scenario:
What if:
$description = '">';
then we get:
" />
Presto, instant JavaScript execution.
== Why is XSS bad? ==
If you actually run the previous code, you'll get an undefined function
error. What might an attacker do?
A silent attack may be just gaining access to your account
using your session cookie or remember me cookie.
A web worm might take over your account immediately, and attempt to
spread the XSS vector to other places and infect other people.
Example: Samy.
If you're a high profile target with lots of users or if you're an
open-source project with lots of users, the stakes are high. Even
an XSS attack on a small scripts.mit site undermines trust and
puts your users at risk.
== How do we stop XSS? ==
A first impulse of many people is "Nuke the scripty parts". str_replace()
out any JavaScripts and script tags and etc. Don't do this.
[a full explanation would involve whitelists and a case where those
things might actually be wanted to be kept]
The next response is "escape" the data.
In PHP, this means using htmlspecialchars() or htmlentities()
In Ruby, this means using CGI.escapeHTML()
In Python, this means cgi.escape()
(depending on the modules you're using)
== String is not a type ==
That is not the whole story. And in many cases, knowing just that
is going to get you in trouble.
We're going to be a little more sophisticated then that. We're actually
going to build a conceptual model of strings, so that you understand
why and when to apply escaping, and why the best practices work.
The very simplest story is that of plain-text: this is what a string
would like you to believe it is.
This text, however, can be structured in interesting ways, much like
the text in a novel is structured into chapters and sections. Anything
that is a "plaintext format" is eligible. Some notable web examples:
- HTML
- XML (yes, it's distinct)
- SQL
- JSON
And some not-so-webby examples:
- Rich Text Format
- LaTeX
- Shell code
- Any other programming language
When we *escape* some text, we are actually performing a format shift.
Example:
Plaintext -> HTML
-> <bob@example.com>
Since plaintext to X is the easiest type of format shift, we're dealing
with just this case in our first section. (We'll consider what happens
if you want X to Y, such as Wikitext -> HTML, or even HTML -> safe HTML
later)
Thus, the code we discussed earlier now means this:
$html = htmlspecialchars($plaintext);
$sql = mysql_real_escape($plaintext);
The story is not complete, though. One of the uncanny things about
HTML is the fact that it is actually an amalgamation of many different
formats.
Regular text
There's three different forms in the code above:
1. http://example.com => URL (technically URI)
2. Double-quoted/single-quoted text => Text inside an HTML attribute
3. Regular text => Regular HTML text
And they have DIFFERENT rules.
1. javascript:alert(1); => this is active JavaScript when treated
as a URL, but is ordinary text for an attribute or regular text.
2. He said, "Oh no!" => Valid regular text and single-quoted text,
but not double-quoted text
3. You're funny. => Valid regular text and double-quoted text, but
not double-quoted text
So, let's look at our original example again:
Plaintext -> HTML (regular text)
-> <bob@example.com>
This can happen in other languages too, although a gaffe is usually much
more obvious. Example, SQL:
Plaintext -> SQL (string)
bob -> bob
But if you then put it in the SQL query without quotes (which signify
that the inner text is a string), you'll (usually) get an error:
SELECT * FROM users WHERE name=bob;
ERROR 1054 (42S22): Unknown column 'bob' in 'where clause'
MySQL thinks that the input data is a column, not a string.
== How to apply this conceptual model ==
First, a few quick questions: secure or not?
First thing to point out is that you don't actually know
if this is an HTML document or a text document. Let's say
it's an HTML document. This is secure.
Insecure.
Insecure.
Insecure. As a general rule, one escaping function that's secure
in one context is insecure in another context. The correct function
you should have used was json_encode(). (side-note: a single string
is not actually valid JSON, but you can abuse it that way)
Now, what are some practical considerations?
1. You need to know what format your strings are in. If you're dealing
with plaintext, this boils down to "Is this text escaped or not?"
2. You need to know where your string is going. If it's going into
an SQL query, it should be SQL escaped; if it's going into HTML,
it should go into HTML text.
These questions can be has hard or as easy as you want them to answer.
THE HARD WAY: (read fast)
PHP magic quotes automatically escapes your data for insertion
into the database. If you want to do anything to this data, you
un-escape it first, do some processing, then escape it again.
Sometimes you HTML escape it before you put it in the database,
sometimes you don't. When stuff comes out of the database, you
have to remember to HTML escape only the stuff you didn't already
HTML escape, otherwise you'll have double-escaped.
THE EASY WAY:
Turn off PHP magic quotes. Never escape anything until it's
absolutely necessary; that means SQL escaping right before you
put in a database, that means HTML escaping right before you
output it to the web.
Answers to our questions are trivial, then:
1. Unescaped.
2. It should be obvious from the string concatenation right below
what you should be escaping it as.
Important Note: Data in the database should be UNESCAPED. When your
database recieves the SQL data, it internally de-SQL-escapes it, and
returns a "pure" version of the data. If this seems confusing, think
about the difference between a browser's regular view, and the source
code view. You only see the HTML escaped version when you view source
(which is interpreting HTML text as plaintext); it looks "normal" otherwise.
Another important note: There's no distinction between user
data or environment variables or internally generated data. If it's
plaintext, it needs to be format-shifted before you use it for something
else. A cautionary tale:
This page
Is XSS-vulnerable, if the page is: index.php?">Whee!
== Jack-Of-All-Trades-Escaping ==
Let's look at one common problem under the light of our new
conceptual model.
$safe = htmlspecialchars(mysql_real_escape_string($input));
We used the SQL and HTML escaping functions, so won't this be
"safe" in both contexts? Something odd has happened: our string
was first packaged up in a format suitable for databases... and then
packaged up again in a format suitable for a browser. The only
case when this would work properly is if a browser decoded this
data, and then passed the result directly into an SQL query. Which
is ridiculous.
This is the story of PHP magic quotes, which automatically performs
DB escaping on all input data. This is why you should not use it.
You can imagine the opposite process, however:
$safe = mysql_real_escape_string(htmlspecialchars($input));
Decoded by the database, and then decoded by the browser. But as we've
discussed earlier, there's no guarantee that a browser is going to
be handling the $data: it might be placed in a plaintext variable or
an external API handled in JSON.
Let this be known: MORE does not mean BETTER when it comes to escaping functions.
== The "right way": DOM building, SQL bound queries ==
Up until this point, we've considering escaping strictly in the
context of string concatenation. Now we're going to ask, "Does
string concatenation really make sense?" There are two considerations:
EASE OF USE / COMPLEXITY OF THE FORMAT
If I asked you to make a bitmap file, you wouldn't concatenate byte
wise sequences to draw the file pixel by pixel. You'd use a high
level function for drawing lines or type-setting text (and you'd
probably convert it to JPEG or PNG before outputting it)
HTML straddles the critical point where it's simple enough for people
to get by using string concatenation, but is complex enough for tools
for generating HTML to be popular (e.g. Frontpage or a form building
library). Why not use a tool to generate ALL your HTML?
MAKE IT REALLY HARD TO "DO THE WRONG THING"
is unfortunate because it is both wrong and the
simplest thing of doing things. Even a veteran programmer may
forget an escaping function if they've done it five hundred times
already today. This is also why bulk-escaping inputs before hand
is so attractive: that means you don't have to repeat htmlspecialchars()
ad nauseum in the output code.
What if you used something other than string concatenation, which
in its simplest form didn't require you to call the escape function
because it already knows what you did?
SQL - QUERY BUILDER
Instead of:
mysql_query('SELECT * FROM users WHERE name=' . mysql_real_escape_string($name));
Use:
$sth = $dbh->prepare('SELECT * FROM users WHERE name = ?');
$sth->execute(array($name));
In the second case (for most databases), the value of $name is
never concatenated into the SQL query; it's sent separately in a
data format that you don't have to care about. Since you don't need
to escape anything now, there's no worry about forgetting! Also,
doesn't that code look so much nicer...
HTML, XML - DOM BUILDER
Instead of:
$html = '' . htmlspecialchars($text) . '';
Use:
$b = $doc->createElement('b');
$b->addChild($doc->createTextNode($text));
Extra benefit: your code is guaranteed to be well-formed, and you
can easily run a validator on it natively.
A sidenote on newlines: Plaintext newlines are not displayed in
HTML. As a result, you've probably used a function like nl2br() to
preserve newlines. With a DOM builder, you probably want to create a
helper function that adds a stream of text nodes and empty tags
as children to an empty, using your neighborhood "explode" or "split"
function.
foreach (explode($text, "\n") as $i => $part) {
if ($i !== 0) $b->addChild($doc->createElement('br'));
$b->addChild($doc->createTextNode($part));
}
SHELL CODE - MULTI-ARGUMENT EXEC (PYTHON)
Instead of:
# XXX find real name for shellescape
os.system("stella " + shellescape(name))
Use:
subprocess.call(["stella", name])
(kind of bad example)
URLs - URL BUILDER
Instead of:
$url = 'index.php?name=' . urlencode($foo);
Use:
$url = 'index.php?' . http_build_query(array('name' => $foo));
== Practical considerations of "the right way" ==
1. Verbosity/Ease of use
Compare:
Welcome .
Here's a link
With (DOM):
$p = $doc->createElement('p');
$p->appendChild($doc->createTextNode('This is a paragraph of text '));
$em = $doc->createElement('em', $username);
$p->appendChild($em);
$p->appendChild($doc->createTextNode('. Here's a '));
$a = $doc->createElement('a', 'link');
$a->setAttribute('href', 'http://example.com');
$p->appendChild($a);
The DOM-builder version is substantially longer. There ought to be
a shorter version, but many languages simply don't have it. This is
something someone should write!
Another option is XSLT, but somewhere along the line an XML document
still has to be programatically generated.
One last thing: these tools are very much XML oriented. In order to
get HTML output, a little bit of coaxing is necessary. (Talk to me
later if you want to know more about it)
2. Native or not? (external libraries)
Concatenation is universal. You can use it anywhere, and not have
to worry about dependencies. Some builder APIs are also built into
a language: most languages have SQL bindings by default (obviating the
need for a small wrapper class to emulate bindings).
If a builder API is not native, however, you have to go get one. And
that brings with it all of the pains of an external library (not
going to go into that here). Or you may choose to code one yourself,
a fine example of yak shaving.
3. Performance/Memory issues
Concatenation works great with the input/output model. Once you've
printed that bit of HTML, it gets handed to the output buffer, and
eventually the client; there is no intrinsic need for your application
to store it in memory.
A DOM structure, however, needs to be kept in its entirety in memory.
It as a whole weighs more than a string version of the HTML, because
each node is a full-fledged object and carries the associated overhead.
It needs to be serialized at the end, an extra cost not associated
with concatenation (you could say with concatenation you're serializing
from the very beginning).
These issues can be offset by a robust caching infrastructure (Squid,
for example), even with heavy dynamical content.
== The UTF-8 story ==
Up until now, we've assumed that we've been dealing with text. This
isn't strictly true: I can send null bytes and other non-sensical
byte sequences over HTTP requests. In some cases, this can
lead to security vulnerabilities, but at its basic level, it's just
another form of validation you have to perform on user input--a
username should not null bytes contain.
Why not properly encoding data is bad
- Breaks in strict contexts, like XML
- XSS multibyte attack
To deal with this issue effectively, however, you need to know a
little bit about character encodings.
[basic explanation of what encoding is]
The bottom line is that, whatever encoding you're using (preferably
UTF-8), you should ensure that any text you deal with is in that
encoding. Additionally, you should make sure that characters forbidden
by the HTML specification are stripped out.
How do you do that?
In an 8-bit encoding, that means stripping out bytes:
- 0x00 to 0x1F, excluding 0x09 (the tab) and 0x0A (the newline)--and
possibly 0x0D (carriage return) if you're feeling charitable.
- 0x7F
- If you're using an ISO 8859 charset (usually 8859-1, known as Latin-1),
0x80 to 0x9F
In a Unicode based encoding, that means stripping out:
- U+0000, U+0001 to U+0008, U+000B, U+000E to U+001F, U+007F to
U+009F, U+D800 to U+DFFF, U+FDD0 to U+FDDF, and characters U+FFFE,
U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF,
U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE,
U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF,
U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE,
U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF
Since UTF-8 is variable width, you also have to check whether or not
it's well-formed.
Don't write that down.
Depending on your language, various parts of these steps will have
already been done for you. Many others will not have. For PHP, you
have to do it all yourself. You can find a list of these codepoints
in the HTML 5 specification under the Parser pre-processing
section; for codepoints you 100% must remove, check the XML specification's
Char designation. Any language with Unicode support will normally
check well-formedness automatically; codepoint stripping will usually
have to be done manually.
One last note: htmlentities() versus htmlspecialchars().
htmlentities() never makes sense. If the input encoding is
capable of expressing certain characters, there's no point in
entity-izing them. If the input encoding is not (which is when
you'd want to allow entities of those characters), there's still
no way of entity-izing them, because they're not supported in the
first place. [rewrite this]