pretty_print on a tree parsed with StringIO

Asked by Schnito

Hello everybody,

I'm having someting strange here with pretty_print and StringIO.
Indeed, If I parse a file by first reading it and then call the StringIO of the result : tree = lxml.html.parse(StringIO(value))

Then, If I call the tostring method like this : lxml.html.tostring(tree, pretty_print=True, include_meta_content_type=False,
             encoding=None, method="xml")

==> Then the result is not pretty printed ...

Whereas if I parse directly the file like this : tree = lxml.html.parse(file)
==> Then the result of the tostring is pretty_printed ...

Would you have an idea for that ?

Thanks a lot !

Question information

Language:
English Edit question
Status:
Answered
For:
lxml Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Schnito (schnito1) said :
#1

Here is the test code :

file = open('data.html', 'r')
value = file.read()

TestPrettyPrintedoutputValue = open('TestPrettyPrintedoutputValue.html', 'w')

tree = lxml.html.parse(StringIO(value))

TestPrettyPrintedoutputValue.write(lxml.html.tostring(tree, pretty_print=True, include_meta_content_type=False,
             encoding=None, method="xml"))

Revision history for this message
Schnito (schnito1) said :
#2

OK, after checking, this is not a problem of StringIO ... it would have been strange.

But my tree is not pretty printed if I parse this file :

<html>
    <head>
    </head>
    <body>
        <div><span> ceci est le span du 1ier div </span><p id="p1">test test</p></div><div><span>ceci est le span du 2ieme div</span></div>
    </body>
</html>

Revision history for this message
scoder (scoder) said :
#3

Yes, libxml2's serialiser isn't always smart enough to figure out a good-looking indentation.

http://codespeak.net/lxml/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output

I get this, for example:

>>> print et.tostring(h)
<html>
    <head>
    </head>
    <body>
        <div><span> ceci est le span du 1ier div </span><p id="p1">test test</p></div><div><span>ceci est le span du 2ieme div</span></div>
    </body>
</html>
>>> print et.tostring(h, pretty_print=True),
<html>
    <head>
    </head>
    <body>
        <div><span> ceci est le span du 1ier div </span><p id="p1">test test</p></div><div><span>ceci est le span du 2ieme div</span></div>
    </body>
</html>
>>> print et.tostring(h, pretty_print=True, method="html"),
<html>
    <head>
    </head>
    <body>
        <div>
<span> ceci est le span du 1ier div </span><p id="p1">test test</p>
</div>
<div><span>ceci est le span du 2ieme div</span></div>
    </body>
</html>

Close, but not quite perfect. You can certainly do better with hand-crafted code. The indent() function on the page below should get you started:

http://effbot.org/zone/element-lib.htm

Can you help with this problem?

Provide an answer of your own, or ask Schnito for more information if necessary.

To post a message you must log in.