Lucene and Japanese

# 1 Old 11-24-2007, 12:45 PM
RickCogley RickCogley is offline RickCogley's reputation RickCogley is on a distinguished road » Community Member
Join Date: Nov 2007 Posts: 110
Default Lucene and Japanese
Hello - I managed to get Deki Wiki working today on a Linode running Ubuntu 7.10. I first tried it on a VPS with CentOS but this was a time-consuming disaster. Then I provisioned a Linode, and loaded everything without too much trouble, except that I could not get php5-mcrypt, except from the universe repository, and, I needed to add a line to php.conf to avoid having the browser ask if I want to download index.php.

At any rate, my problem that I hope you can help with, is that Lucene does not seem to be able to index documents with Japanese. I have tested only PDFs so far, and have all the requisite packages installed.

Here is what is happening -

* View PDF with Japanese filename and some Japanese content on local OS - filename and file content is well formed Japanese

* Upload Japanese filename - file displays correctly in Deki Wiki page - well-formed Japanese

* Look at /var/www/deki-hayes/attachments/32 and this has a few files under it, but, the Japanese filenames are ?????????? marks. Is this a function of Ubuntu or of my terminal?

* Add some Japanese to an existing page, like special:Admin, and this does not come out in search for quite a while.

* Add some Japanese to a NEW page, and it comes out as a search result very quickly.

* CanNOT search on Japanese contents of uploaded PDF. No search results are returned.

This behavior can be observed in both the current wik.is, and my fresh install.

How can I make sure the various filter programs are working in japanese, and how can I ensure Lucene will, too?

Please advise. I would like to preempt any negativity with the momentum we have gained in Japan, and try to find a workaround for this if it is possible.

Thanks,
Rick
Rick Cogley
Tokyo, Japan
# 2 Old 11-24-2007, 01:00 PM
RickCogley RickCogley is offline RickCogley's reputation RickCogley is on a distinguished road » Community Member
Join Date: Nov 2007 Posts: 110
Looks like XPDF is available in Japanese.

ftp://ftp.foolabs.com/pub/xpdf/xpdf-japanese.tar.gz

I wonder if this could be made to work?
Rick Cogley
Tokyo, Japan
# 3 Old 11-24-2007, 11:06 PM
RickCogley RickCogley is offline RickCogley's reputation RickCogley is on a distinguished road » Community Member
Join Date: Nov 2007 Posts: 110
I created some files and did some testing here:

http://wiki.opengarden.org/index.php...82%B9%E3%83%88

With the default installation, Japanese search of wiki page content, and, powerpoint succeeds. Filenames always succeed, as well, no matter what the file.

Word, Excel, PDF content search all fail.
Rick Cogley
Tokyo, Japan
# 4 Old 11-25-2007, 06:24 PM
SteveB SteveB is online now SteveB's reputation SteveB has a reputation beyond reputeSteveB has a reputation beyond reputeSteveB has a reputation beyond repute » MindTouch Team
Join Date: Jul 2006 Location: San Diego, CA Posts: 4,949
Indexing works as follows: we use an external filter app/script to convert the binary to text (utf8), we then index the produced text with Lucene.

To test the various filter apps (or add new ones), have a look at the mindtoch.deki.startup.xml file, which lists all filters under <indexer>.

Log into your installation and run the filter app, like so:
/opt/deki/bin/filters/pdf2text < FILE.PDF > FILE.TXT

Now check out FILE.TXT and make sure it contains the text contents of the file in UTF8 encoding.
Steve G. Bjorg - Chief Architect
Did you check the MindTouch Deki FAQ?
Found a bug? Report it.
Follow me on Twitter
Find us on IRC: irc.freenode.net #mindtouch
# 5 Old 11-25-2007, 11:11 PM
RickCogley RickCogley is offline RickCogley's reputation RickCogley is on a distinguished road » Community Member
Join Date: Nov 2007 Posts: 110
Thanks Steve. I get this "unknown character collection" when I try it. What can I try next?

root@fire:/var/www/deki-hayes/bin/filters # ./pdf2text < /var/www/deki-hayes/attachments/33/Deki\ Search\ Test\ Japanese.pdf > jppdf.txt
Error: Unknown character collection 'Adobe-Japan1'
root@fire:/var/www/deki-hayes/bin/filters #
Rick Cogley
Tokyo, Japan
# 6 Old 11-25-2007, 11:14 PM
RickCogley RickCogley is offline RickCogley's reputation RickCogley is on a distinguished road » Community Member
Join Date: Nov 2007 Posts: 110
Also, Steve, while I am asking if you do not mind, if I wanted to use a different filter, say one localized for Japanese, and that filter needed command line arguments, where would I specify them?

Are your .bin/filters binaries copies of the ones that get installed when you install packages?
Rick Cogley
Tokyo, Japan
# 7 Old 11-25-2007, 11:21 PM
SteveB SteveB is online now SteveB's reputation SteveB has a reputation beyond reputeSteveB has a reputation beyond reputeSteveB has a reputation beyond repute » MindTouch Team
Join Date: Jul 2006 Location: San Diego, CA Posts: 4,949
It appears that our pdf2text filter is not compatible with Japanese characters. That sucks.

The files under bin/filter are mostly scripts (except for the jxl.jar file which appears to be used by xsl2text).

You can edit pdf2text and see how it works. Key will be to find something that can extract a text version from a PDF so that it can be indexed.

By the way, do you have to run under Linux, or can you also run under Windows? I'm asking b/c we have a new adapter for Windows that uses ifilters and might not have the issue.
Steve G. Bjorg - Chief Architect
Did you check the MindTouch Deki FAQ?
Found a bug? Report it.
Follow me on Twitter
Find us on IRC: irc.freenode.net #mindtouch
# 8 Old 11-26-2007, 12:35 AM
RickCogley RickCogley is offline RickCogley's reputation RickCogley is on a distinguished road » Community Member
Join Date: Nov 2007 Posts: 110
Hi Steve - yes, true. Well, this is in conjunction with the translation effort. All that work will be for nought if the basics don't work, eh?

Looks like xpdf has PDF text extraction, and, Japanese language files etc.

http://www.foolabs.com/xpdf/download.html

Besides PDFs, word and excel don't index either (powerpoint works), with the default install. Anyway, step by step.

For us ourselves, Windows is not an option at this time due to some firewall constraints. I have ours setup on a Linode on Ubuntu, so hopefully I can get most of this stuff working.
Rick Cogley
Tokyo, Japan
# 9 Old 11-26-2007, 08:37 AM
RickCogley RickCogley is offline RickCogley's reputation RickCogley is on a distinguished road » Community Member
Join Date: Nov 2007 Posts: 110
pdf2html is based on xpdf, I found out. Xpdf has a Japanese version called xpdf-japanese, which on ubuntu and debian, is available via the multiverse. xpdf includes the ability to specify a resource file with fonts and encodings, called xpdfrc, and this is referenced when xpdf runs. Also, xpdf comes with a utility called pdf2text. That said, pdf2html comes from poppler-utils, so, it is not running xpdf proper, apparently, and ignores the xpdf resource file.

After I installed the xpdf-japanese, I could get pdf2text working to output to a japanese text file. You are running the pdfs from pdftohtml, to html2text, then thru sed to strip some stuff. I have not yet tried it, but I am getting clean output from pdf2text.

My question is, is all the acrobatics (pushing the html to text then to sed) essential to deki wiki's functioning? Or, can I just try ramming a pdf2text outputted japanese text file thru the index?

root@fire:/rcutils # cat /var/www/deki-hayes/bin/filters/pdf2text
#!/bin/sh
# save stdin to a file since pdftohtml doesn't work on streams
TEMP=`mktemp`
dd of=$TEMP 2> /dev/null
pdftohtml -stdout -i -noframes -enc UTF-8 "$TEMP" | html2text -nobs - - | sed '/^[\=]\+/ d' |sed '$d' | sed '1d' | sed '/^$/d' # trim first, last and blank lines
rm $TEMP
Rick Cogley
Tokyo, Japan
# 10 Old 11-26-2007, 08:52 AM
SteveB SteveB is online now SteveB's reputation SteveB has a reputation beyond reputeSteveB has a reputation beyond reputeSteveB has a reputation beyond repute » MindTouch Team
Join Date: Jul 2006 Location: San Diego, CA Posts: 4,949
Rick,

Great work!

While I didn't write the pdf2text script, I'm pretty the acrobatics aren't necessary, but were just the means to an end. Just replace the pdf2text script with yours and let me know how it goes. Thanks!
Steve G. Bjorg - Chief Architect
Did you check the MindTouch Deki FAQ?
Found a bug? Report it.
Follow me on Twitter
Find us on IRC: irc.freenode.net #mindtouch
Page 1 of 3 1 2 3 >

Thread Tools

Search this Thread

Search this Thread Advanced Search

Display Modes

Powered by MindTouch 2010