PDA

View Full Version : Searching portuguese strings not working



denakitan
06-18-2007, 01:01 PM
Hi!

I'm a brazillian user of Dekiwiki, running it on a Windows XP machine and hoping to migrate to a Linux box soon. In portuguese, we use characters such as 'á', 'ç', 'ê'. When I try to search for words containing these characters, Dekiwiki doesn't return any results. Does anybody know what I have to do to make it work?

Thanks,
Dennis

PeteE
06-18-2007, 03:23 PM
Lucene uses a "Tokenizer" to break strings into words and index those words.

The standard tokenizer is built specifically for the English language. I searched the Lucene source code and the lucene mailing list for a Portuguese analyzer but wasn't able to find an implementation.

To implement a Portuguese analyzer at the very least we would need a set of Stop Words. Stop words are common words that shouldn't be index. For example, in English, we use stopwords like "a", "an", "and", "are", "as", "at", "be", "but", "by"

If you could provide a list of stopwords I could try to put together an analyzer for you. I'd need some help testing though since I don't know Portuguese! :)

denakitan
06-18-2007, 04:38 PM
Hi PeteE! Thanks for the reply.

I thought Dekiwiki uses Lucene just for searching attchments. So, it is used for regular searching too. Ok, I can try getting some portuguese stopwords.

I'd like to confirm some other things too to know if it's possible that the following itens are causing this problem.

1) When I toggle html source, characters that appears in portuguese turns into code like "&#231". But I accessed the Xinha web site and when I run their demos, the characters appears the same way I input.

2) I accessed my local database too and the characters looks different too. Like, á appears in the editor's html view as "&#225" and in the database as "á".

Is it possible that the itens above are influencing this problem?

Thanks,
Dennis

SteveB
06-18-2007, 04:45 PM
Hi denakitan,

Welcome to the forums!

The "&#231" is a numerically encoded HTML entity, which is required since standard HTML only supports ASCII. The database, on the other hand, uses UTF-8 (btw, make sure you specify UTF8 as encoding when connecting to your db or you might see garbage). Just to make life a little more fun, DekiWiki processes everything in Unicode. :)

It would immensely help if you provided us with a simple page written in Portoguese that uses special characters. We can then make sure that it gets processed the right way. Thanks!

PeteE
06-18-2007, 04:57 PM
1) When I toggle html source, characters that appears in portuguese turns into code like "&#231". But I accessed the Xinha web site and when I run their demos, the characters appears the same way I input.

2) I accessed my local database too and the characters looks different too. Like, á appears in the editor's html view as "&#225" and in the database as "á".

Is it possible that the itens above are influencing this problem?

Thanks,
Dennis

Dennis - Thanks for the info. The Gooseberry version of DekiWiki did some "magic" and converted some characters incorrectly. In the upcoming Hayes release, we've merged in the newest Xhina changes and portuguese characters look like they're being saved properly now.

You might want to wait a week or so for the next Hayes beta to be released and see if that gives you better results.

thanks,
pete

denakitan
06-18-2007, 04:58 PM
Hi SteveB!

I don't know much about these encoding stuff but I will search some material on the net to learn more about. Ok. I will check if i'm using UTF8 as encoding when connecting to database.

Ok. I will send a text in portuguese soon.

Oh, just to add some more detail about my problem that may help to identify the problem, here it is:

- I don't have problem writing or displaying portuguese. The articles looks correctly. These numerically encoded HTML just appears when I toggle to HTML source and in the results of some searching I do, where HTML tags and comments like "<!-- Tidy found serious XHTML errors: -->" appears too.

- I had a problem once when I migrate the DekiWiki server. I made a backup of the database to load it in the new server. The articles were looking correctly, but their titles at the left menu were displaying encoded characters.

Well, that's it. Thanks,

SteveB
06-18-2007, 05:09 PM
Shoot me a screenshot and some Portoguese to: steveb {at} mindtouch {dot} com.

Thanks!

denakitan
06-26-2007, 02:34 PM
Hi!

I have some updated informations about my problems. During these days I have installed DekiWiki on a Linux box ( Ubuntu 7.04 ). Almost all problems that was faced in the Windows installation were solved. I will put all of them in the end of the message.

Talking about my problem searching portuguese strings, it still happens in Linux, but I discovered some interesting things.

1 - I noticed that I could found an article when the searched string is on the title of the article. If the searched string is present only in the body of the article it tells that no result was found.

2 - Taking a look at the database I did a query on the cur table and noticed that titles were in the following format:

Eventos/Apresentação_DekiWiki

and the articles in this format ( replaced ; by ' to display the special characters ):

Apresenta&#231'&#227'o sobre o DekiWiki realizada em 25/06/2007.

I found that the tables were using the following configuration:

CHARACTER SET latin1 COLLATE latin1_swedish_ci

3 - Then I changed the configuration to:

CHARACTER SET utf8 COLLATE utf8_bin

The titles of new articles kept displaying the same way:

TÃ*tulo_com_acentuação

But the articles not:

<p>Corpo do artigo com acentua&ccedil;&atilde;o.</p>

But still I was not able to find the article. Just by its title.

4 - I copied the string from a title and updated the body field of the article with the copied value. This way, I could find an article even searching for a string with special characters used in portuguese found in the article.

5 - Concluding. I can only find some string when it is stored in the database in the format of the titles like:

Eventos/Apresentação_DekiWiki/Roteiro_de_demonstração_do_DekiWiki

Well, about the issue, that's all. I will keep trying to solve it.
Thanks,
Dennis

denakitan
06-26-2007, 02:55 PM
Windows issues that were solved:

- Searching in attachments
- Comparing articles - Always returned that versions were the same
- There's no log
- PDF exporting
- Log changing throught the interface
- Search result showing html tags

denakitan
06-26-2007, 03:43 PM
When I search the string 'Makefile', dekiwiki returns some articles. One of them is ( replaced ; by ' to display the way it apperars here ):

#Desenvolvimento/MakeFile/Introdução ao Makefile
29.0KB (4367 words) - 19:02, 20 Apr 2007

Introdu&#231'&#227'o ao Makefile
make, makefile, GNUmakefile

The title displays correctly, but the article not. When I enter the article both displays correctly, so it is just an issue with the searching results display.

Looking at the source code, I noticed that an 'amp' string is added to the word, causing the wrong displaying:

<blockquote>Introdu&amp;#231;&amp;#227;o ao <span class="searchmatch">Makefile</span>

Well, I think that's it. Thank you again.
Dennis

SteveB
06-27-2007, 09:43 PM
Dennis,

We'll be looking into these issues after Beta2 comes out. Sorry for the delay. The next Beta will focus on internationalization issues such as localization and searching for text with international characters.

Thanks again for bringing this issue to thr forefront!

SteveB
07-02-2007, 07:21 AM
Quick update on international searching: it appears that our recent changes to how the indexer works has also taken care of the issues found in searching for international text. There is still the issue of stop-words, but that's much less severe.

Dennis, thanks again for providing some sample text. It came in really useful in checking for the bug you reported. Thanks!