+ Reply to Thread
Results 1 to 5 of 5

Thread: DreamMessage and UTF

Hybrid View

  1. #1
    Join Date
    Jun 2009
    Posts
    69

    Question DreamMessage and UTF

    I've been playing around with the MediaWiki Converter and finally came the root of an issue I've been having for a while now.

    When I send a page with certain utf codepoints such as “ it gets garbled somewhere. I thought it was happening in the bridge, but when I loaded up fiddler and watched what was being passed everything comes in and goes out just fine. It's sent as %E2%80%9C in the Post data and returns as the actual unicode bytes 0xE2 0x80 0x9C. A quick test with Curl confirms this.

    However, when I look at the DreamMessage that is returned from Post() it shows up as

    â??

    Which, after the ToDocument() comes out as

    â??

    I did find a brute force workaround, which is to pass the text through a function that just does a bunch of string.replace for all of these characters http://intertwingly.net/stories/2004/04/14/i18n.html . The trick is to send it as the HTML code. What's even better is that when they are sent to the xdoc, it kindly returns it to its UTF glory.

    However, I can see this may come up later as an issue for someone else.

    Also, I'm writing a guide to customizing the MW conversion process now that I've actually done it myself. It may be a good while until its ready though.

  2. #2
    Join Date
    Jul 2006
    Location
    San Diego, CA
    Posts
    5,450

    Default

    That's a good start to get to the bottom of this issue. Can you capture the request that you send into the bridge and the response? I'm especially interested in the response headers. If we can reproduce the issue, we should be able to address.

    A converter guide would be AWESOME!!! Please do it!
    Steve G. Bjorg - Chief Architect
    Did you check the MindTouch FAQ?
    Found a bug? Report it.
    Follow me on Twitter
    Find us on IRC: irc.freenode.net #mindtouch

  3. #3
    Join Date
    Jun 2009
    Posts
    69

    Default

    Here's a quick proof of concept I built. I'm using dream v1.7.1.16671
    Code:
    using System;
    using System.Collections.Generic;
    using System.Text;
    
    using MindTouch.Dream;
    
    namespace dreammessage_utf
    {
        class Program
        {
            static void Main(string[] args)
            {
                string text = "“€2”";
                string htmlText = "“€2”";
    
                Plug p = Plug.New("http://mwc.mindtouch.com/");
                DreamMessage plain = p.With("title", "Test").With("text", text ).Post();
                DreamMessage escaped = p.With("title", "Test").With("text", htmlText).Post();
    
                XDoc plainDoc = plain.ToDocument();
                XDoc escapedDoc = escaped.ToDocument();
            }
        }
    }
    I sent the data to the public bridge, but if you look at the fiddler data below you'll see the issue is in the DreamMessage.

    This is the conversation from "plain"
    Code:
    POST /?title=Test&text=%e2%80%9c%e2%82%ac2%e2%80%9d HTTP/1.0
    Content-Type: text/plain; charset=us-ascii
    User-Agent: Dream/1.7.1.16671
    Host: mwc.mindtouch.com
    Content-Length: 0
    
    HTTP/1.1 200 OK
    Date: Wed, 02 Dec 2009 20:32:17 GMT
    Server: Apache/2.2.6 (Debian) PHP/5.2.6-1+lenny3 with Suhosin-Patch mod_ssl/2.2.6 OpenSSL/0.9.8g
    X-Powered-By: PHP/5.2.6-1+lenny3
    Vary: Accept-Encoding
    Content-Length: 105
    Connection: close
    Content-Type: text/html
    
    <html xmlns:mediawiki="#mediawiki"><head><title>Test</title></head><body><p>“€2”
    </p></body></html>
    And here's from "escaped"
    Code:
    POST /?title=Test&text=%26ldquo;%26euro;2%26rdquo; HTTP/1.0
    Content-Type: text/plain; charset=us-ascii
    User-Agent: Dream/1.7.1.16671
    Host: mwc.mindtouch.com
    Content-Length: 0
    
    HTTP/1.1 200 OK
    Date: Wed, 02 Dec 2009 20:32:20 GMT
    Server: Apache/2.2.6 (Debian) PHP/5.2.6-1+lenny3 with Suhosin-Patch mod_ssl/2.2.6 OpenSSL/0.9.8g
    X-Powered-By: PHP/5.2.6-1+lenny3
    Vary: Accept-Encoding
    Content-Length: 116
    Connection: close
    Content-Type: text/html
    
    <html xmlns:mediawiki="#mediawiki"><head><title>Test</title></head><body><p>&ldquo;&euro;2&rdquo;
    </p></body></html>
    Here is the debugger values from visual studio for the DreamMessages
    Code:
    plain	{
    <message>
    	<status>200</status>
    	<headers>
    		<Vary>Accept-Encoding</Vary>
    		<Connection>close</Connection>
    		<Content-Length>105</Content-Length>
    		<Content-Type>text/html</Content-Type>
    		<Date>Wed, 02 Dec 2009 20:32:17 GMT</Date>
    		<Server>Apache/2.2.6 (Debian) PHP/5.2.6-1+lenny3 with Suhosin-Patch mod_ssl/2.2.6 OpenSSL/0.9.8g</Server>
    		<X-Powered-By>PHP/5.2.6-1+lenny3</X-Powered-By>
    	</headers>
    	<body format="text">
    &amp;lt;html xmlns:mediawiki=&amp;quot;#mediawiki&amp;quot;&amp;gt;&amp;lt;head&amp;gt;&amp;lt;title&amp;gt;Test&amp;lt;/title&amp;gt;&amp;lt;/head&amp;gt;&amp;lt;body&amp;gt;&amp;lt;p&amp;gt;??????2???
    &amp;lt;/p&amp;gt;&amp;lt;/body&amp;gt;&amp;lt;/html&amp;gt;
    	</body>
    </message>
    }	MindTouch.Dream.DreamMessage
    Code:
    escaped	{
    <message>
    	<status>200</status>
    	<headers>
    		<Vary>Accept-Encoding</Vary>
    		<Connection>close</Connection>
    		<Content-Length>116</Content-Length>
    		<Content-Type>text/html</Content-Type>
    		<Date>Wed, 02 Dec 2009 20:32:20 GMT</Date>
    		<Server>Apache/2.2.6 (Debian) PHP/5.2.6-1+lenny3 with Suhosin-Patch mod_ssl/2.2.6 OpenSSL/0.9.8g</Server>
    		<X-Powered-By>PHP/5.2.6-1+lenny3</X-Powered-By>
    	</headers>
    	<body format="text">
    &amp;lt;html xmlns:mediawiki=&amp;quot;#mediawiki&amp;quot;&amp;gt;&amp;lt;head&amp;gt;&amp;lt;title&amp;gt;Test&amp;lt;/title&amp;gt;&amp;lt;/head&amp;gt;&amp;lt;body&amp;gt;&amp;lt;p&amp;gt;&amp;amp;ldquo;&amp;amp;euro;2&amp;amp;rdquo;
    &amp;lt;/p&amp;gt;&amp;lt;/body&amp;gt;&amp;lt;/html&amp;gt;
    	</body>
    </message>
    }	MindTouch.Dream.DreamMessage
    And the XDocs...
    Code:
    plainDoc	{
    <html xmlns:mediawiki="#mediawiki">
    	<head>
    		<title>Test</title>
    	</head>
    	<body>
    		<p>??????2???</p>
    	</body>
    </html>
    }	MindTouch.Dream.XDoc
    Code:
    escapedDoc	{
    <html xmlns:mediawiki="#mediawiki">
    	<head>
    		<title>Test</title>
    	</head>
    	<body>
    		<p>“€2”</p>
    	</body>
    </html>
    }	MindTouch.Dream.XDoc
    Hope this helps. I'll definitely keep working on the converter guide, but it's a massive topic. You guys did a lot of work (and I see a lot of comments starting with SteveB )

  4. #4
    Join Date
    Jul 2006
    Location
    San Diego, CA
    Posts
    5,450

    Default

    Thanks a bunch! That made it really easy to track down.

    The problem was that the response only had 'text/html' as content type, but it needed to have 'text/html; charset=utf-8'. I have local fix for it that will be checked in shortly. In case you have a local copy of the converter, you'll need the apply the following patch:

    Code:
    Index: index.php
    ===================================================================
    --- index.php	(revision 17161)
    +++ index.php	(working copy)
    @@ -137,6 +137,9 @@
     // parse request text
     $result = $p->parse($text, $wgTitle, new ParserOptions);
     
    +// set content-type for response
    +header( "Content-type: text/html; charset=utf-8" );
    +
     // emit converted request text
     echo('<html xmlns:mediawiki="#mediawiki">');
     echo('<head>');
    Steve G. Bjorg - Chief Architect
    Did you check the MindTouch FAQ?
    Found a bug? Report it.
    Follow me on Twitter
    Find us on IRC: irc.freenode.net #mindtouch

  5. #5
    Join Date
    Jun 2009
    Posts
    69

    Default

    That did it! Thanks.

+ Reply to Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts