Lucene and Japanese
Hello - I managed to get Deki Wiki working today on a Linode running Ubuntu 7.10. I first tried it on a VPS with CentOS but this was a time-consuming disaster. Then I provisioned a Linode, and loaded everything without too much trouble, except that I could not get php5-mcrypt, except from the universe repository, and, I needed to add a line to php.conf to avoid having the browser ask if I want to download index.php.
At any rate, my problem that I hope you can help with, is that Lucene does not seem to be able to index documents with Japanese. I have tested only PDFs so far, and have all the requisite packages installed.
Here is what is happening -
* View PDF with Japanese filename and some Japanese content on local OS - filename and file content is well formed Japanese
* Upload Japanese filename - file displays correctly in Deki Wiki page - well-formed Japanese
* Look at /var/www/deki-hayes/attachments/32 and this has a few files under it, but, the Japanese filenames are ?????????? marks. Is this a function of Ubuntu or of my terminal?
* Add some Japanese to an existing page, like special:Admin, and this does not come out in search for quite a while.
* Add some Japanese to a NEW page, and it comes out as a search result very quickly.
* CanNOT search on Japanese contents of uploaded PDF. No search results are returned.
This behavior can be observed in both the current wik.is, and my fresh install.
How can I make sure the various filter programs are working in japanese, and how can I ensure Lucene will, too?
Please advise. I would like to preempt any negativity with the momentum we have gained in Japan, and try to find a workaround for this if it is possible.
Thanks,
Rick
Rick Cogley
Tokyo, Japan