Pycon 2015: Serialization formats are not toys

I am on my way back from this years Pycon in Montreal. You should definately have a look at the PyCon YouTube channel where you can find all the talks that where given over the weekend. Too many to choose from? Check out this session by one of my co-workers on how to build a recommendation engine with Python, NumPy and Pandas.

One presentation I want to quickly recap is Serialization formats are not toys by @tveastman. You can check out the slides here.

Eastman looks at possible security risks when using data serialization formats like XML, YAML or JSON. Pretty much all these problems occur, when the parser tries to be too smart for its own good or the serialization format itself includes questionable features that might be useful for a handful of problems but will also gladly assist you shooting yourself in the foot.

YAML

YAML allows you to embed scripting code in your markup. While this might be useful to insert timestamps or something similar into your config files it can also be used to call system commands on the server. An attacker can now execute arbitrary code. When dealing with YAML coming from untrusted sources make sure to use the safer yaml.safe_load function which disables these features instead of the default yaml.load.

XML

XML, being an inherently complex markup format, also has some built-in features that can be used for attacks. The Billion laugs attack for example is a very simple and straight forward way to DOS a server that is accepting and parsing XML. Another attack vector are XML ENTITIES using file or URL descriptors.

<!ENTITY password SYSTEM "file:///etc/passwd">

This entity includes the systems password file into the XML document. Accessing the file is possible since oftentimes the parsing process is running with root permissions. If the document is malformed and rejected, a common way to handle the situation is to send back the original document to the client. The document sent back will now include the content of the systems password file. A similar attack is possible when used not with files but with URLs to access files on other servers inside the internal network.

When handling XML coming from untrusted sources you should dumb down your parser as much as possible. Instead of using lxml, consider using defusedxml which allows you to disable all unnecessary features that increase your attack surface.

JSON

JSON is pretty safe when used with a decent parser. You will be fine as long as you do not use eval(). At this point it is worth mentioning, that the same principle applies for all the instances in which you marshall or pickle which should never be used with untrusted input.

Nicolas Neu

dies & ditt.

Pycon 2015: Serialization Formats Are Not Toys

YAML

XML

JSON