<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>
<channel>
	<title>apt-blog.net   无证程序员的PT桑 &#187; ibus</title>
	<atom:link href="http://apt-blog.net/tag/ibus/feed" rel="self" type="application/rss+xml" />
	<link>http://apt-blog.net</link>
	<description>潜逃中。</description>
	<lastBuildDate>Fri, 18 May 2012 11:25:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>ibus数据库高频词错误修正脚本</title>
		<link>http://apt-blog.net/a_script_fixing_errors_in_ibus_pinyin</link>
		<comments>http://apt-blog.net/a_script_fixing_errors_in_ibus_pinyin#comments</comments>
		<pubDate>Fri, 24 Apr 2009 11:24:18 +0000</pubDate>
		<dc:creator>PT</dc:creator>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[ibus]]></category>
		<category><![CDATA[数据库]]></category>
		<category><![CDATA[输入法]]></category>
		<guid isPermaLink="false">http://apt-blog.net/archives/299.html</guid>
		<description><![CDATA[使用ibus时间长了，常常突然发现有些本来常驻的首选或者常用字词突然掉到后面，甚至到了第二页，并不是被其他词挤掉，而是可能ibus的用户数据库出现错乱了。 不知道这是ibus程序的bug，还是ibus所用的SQLite数据库系统本身的问题，本来当用户输入一个拼音，ibus从用户数据库里面提出对应字的用户输入频数，决定字词的位置；如果用户第一次选择输入某个字，那么该字的记录就添加到用户数据库中，下次输入时便以此记录来提前该字的位置。理论上，在用户数据库里面一个词条的记录最多只能出现一次（多音字算多个字），然而，在实际的使用中，有时不知什么原因，某个本来常用的字被当作第一次输入再次加入到数据库当中，下次输入时，该字便作为低频字来排序，导致位置变得很后，带来不少不便。 这个Python脚本就是把这样的词条找出来，并把后来加入的记录删掉，把词条频数还原。 脚本下载：http://code.google.com/p/ptcoding/source/browse/trunk/ibus_fix (svn目录内的ibux_db_fix.py，其他的两个是测试脚本) 程序功能： 自动备份用户词库 检出用户数据库中出现了两次，但不是多音字词的词条 将后加入的词条删除 检出错词的SQL： SELECT * FROM py_phrase WHERE phrase IN (SELECT phrase FROM py_phrase GROUP BY phrase HAVING COUNT(*) = 2) 尚存缺陷： 如果同一个词条的记录出现了3次或以上，程序不能鉴别（极少可能出现，可修改脚本内的SQL语句来查询出来） 如果一个字本身是多音字，其中一个音节出现了上述情况，程序不能鉴别（貌似概率也挺低的） 如果两个记录中的用户输入频数相同，两条记录都会被删掉（倒不是坏事，影响不大） Python源码： 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [...]]]></description>
			<content:encoded><![CDATA[<p>使用ibus时间长了，常常突然发现有些本来常驻的首选或者常用字词突然掉到后面，甚至到了第二页，并不是被其他词挤掉，而是可能ibus的用户数据库出现错乱了。</p>
<p>不知道这是ibus程序的bug，还是ibus所用的SQLite数据库系统本身的问题，本来当用户输入一个拼音，ibus从用户数据库里面提出对应字的用户输入频数，决定字词的位置；如果用户第一次选择输入某个字，那么该字的记录就添加到用户数据库中，下次输入时便以此记录来提前该字的位置。理论上，在用户数据库里面一个词条的记录最多只能出现一次（多音字算多个字），然而，在实际的使用中，有时不知什么原因，某个本来常用的字被当作第一次输入再次加入到数据库当中，下次输入时，该字便作为低频字来排序，导致位置变得很后，带来不少不便。</p>
<p>这个Python脚本就是把这样的词条找出来，并把后来加入的记录删掉，把词条频数还原。<br />
<img src="http://apt-blog.net/wp-content/uploads/2009/04/ibusdbfix.png" /></p>
<p>脚本下载：<a href="http://code.google.com/p/ptcoding/source/browse/trunk/ibus_fix">http://code.google.com/p/ptcoding/source/browse/trunk/ibus_fix</a><br />
(svn目录内的ibux_db_fix.py，其他的两个是测试脚本)</p>
<p>程序功能：</p>
<ol>
<li>自动备份用户词库</li>
<li>检出用户数据库中出现了两次，但不是多音字词的词条
</li>
<li>将后加入的词条删除</li>
</ol>
<p>检出错词的SQL：</p>
<blockquote><p>SELECT * FROM py_phrase<br />
            WHERE phrase IN<br />
                (SELECT phrase<br />
                    FROM py_phrase<br />
                    GROUP BY phrase<br />
                    HAVING COUNT(*) = 2)</p></blockquote>
<p>尚存缺陷：</p>
<ol>
<li>如果同一个词条的记录出现了3次或以上，程序不能鉴别（极少可能出现，可修改脚本内的SQL语句来查询出来）</li>
<li>如果一个字本身是多音字，其中一个音节出现了上述情况，程序不能鉴别（貌似概率也挺低的）</li>
<li>如果两个记录中的用户输入频数相同，两条记录都会被删掉（倒不是坏事，影响不大）</li>
</ol>
<p>Python源码：<br />
<span id="more-299"></span></p>
<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
</pre></td><td class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span>
<span style="color: #808080; font-style: italic;"># -*- coding: utf-8 -*-</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">os</span>
<span style="color: #ff7700;font-weight:bold;">import</span> sqlite3
&nbsp;
DB = <span style="color: #dc143c;">os</span>.<span style="color: black;">getenv</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;HOME&quot;</span><span style="color: black;">&#41;</span> + <span style="color: #483d8b;">&quot;/.ibus/pinyin/user.db&quot;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #dc143c;">os</span>.<span style="color: black;">path</span>.<span style="color: black;">exists</span><span style="color: black;">&#40;</span>DB<span style="color: black;">&#41;</span>:
     <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;囧……似乎没有安装ibus……PT发来贺电……&quot;</span>
     exit<span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># ------ Backup database file --------</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">time</span>
nowtime = <span style="color: #dc143c;">time</span>.<span style="color: black;">strftime</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;_%Y-%m-%d-%H_%M_%S&quot;</span>, <span style="color: #dc143c;">time</span>.<span style="color: black;">localtime</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
DB_BK = DB + nowtime
execute = <span style="color: #483d8b;">&quot;cp -v %s %s&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>DB, DB_BK<span style="color: black;">&#41;</span>
<span style="color: #dc143c;">os</span>.<span style="color: black;">system</span><span style="color: black;">&#40;</span>execute<span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;ibus用户数据库已备份到&quot;</span>,DB_BK
&nbsp;
&nbsp;
<span style="color: #808080; font-style: italic;"># ------ Connect to Database ---------</span>
con = sqlite3.<span style="color: black;">connect</span><span style="color: black;">&#40;</span>DB<span style="color: black;">&#41;</span>
c = con.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
c.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;&quot;&quot;SELECT * FROM py_phrase
            WHERE phrase IN
                (SELECT phrase
                    FROM py_phrase
                    GROUP BY phrase
                    HAVING COUNT(*) = 2)&quot;&quot;&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
rows = c.<span style="color: black;">fetchall</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
badphrase = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># ------ Detemine bad phrases -------</span>
<span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>rows<span style="color: black;">&#41;</span>, <span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span>:
    flag = <span style="color: #008000;">True</span>
    phrase = rows<span style="color: black;">&#91;</span>i:i+<span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> j <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">5</span><span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">if</span> phrase<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>j<span style="color: black;">&#93;</span> <span style="color: #66cc66;">!</span>= phrase<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>j<span style="color: black;">&#93;</span>:
            flag = <span style="color: #008000;">False</span>
    <span style="color: #ff7700;font-weight:bold;">if</span> flag:
        badphrase.<span style="color: black;">append</span><span style="color: black;">&#40;</span>phrase<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
&nbsp;
&nbsp;
<span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>badphrase<span style="color: black;">&#41;</span>:
    <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;没有发现错误词条……PT发来贺电…… http://apt-blog.net&quot;</span>
<span style="color: #ff7700;font-weight:bold;">else</span>:
    <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;发现以下错误词条, 共%d个：&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>badphrase<span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> row <span style="color: #ff7700;font-weight:bold;">in</span> badphrase:
        <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;**[%s]**&quot;</span> <span style="color: #66cc66;">%</span> row<span style="color: black;">&#91;</span>-<span style="color: #ff4500;">3</span><span style="color: black;">&#93;</span>
    <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>执行优化清理……&quot;</span>
&nbsp;
    <span style="color: #808080; font-style: italic;"># ------  Clean work to Database</span>
    <span style="color: #ff7700;font-weight:bold;">try</span>:
        <span style="color: #ff7700;font-weight:bold;">for</span> row <span style="color: #ff7700;font-weight:bold;">in</span> badphrase:
            sql = <span style="color: #483d8b;">&quot;DELETE FROM py_phrase WHERE phrase = <span style="color: #000099; font-weight: bold;">\&quot;</span>%s<span style="color: #000099; font-weight: bold;">\&quot;</span> AND user_freq = %s&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>row<span style="color: black;">&#91;</span>-<span style="color: #ff4500;">3</span><span style="color: black;">&#93;</span>, row<span style="color: black;">&#91;</span>-<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
            <span style="color: #808080; font-style: italic;">#print sql</span>
            c.<span style="color: black;">execute</span><span style="color: black;">&#40;</span>sql<span style="color: black;">&#41;</span>
&nbsp;
        con.<span style="color: black;">commit</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;清理完成……PT发来贺电…… http://apt-blog.net&quot;</span>
    <span style="color: #ff7700;font-weight:bold;">except</span> sqlite3.<span style="color: black;">OperationalError</span>:
        <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;清理无法完成，请先退出ibus...&quot;</span>
&nbsp;
con.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></td></tr></table></div>
]]></content:encoded>
			<wfw:commentRss>http://apt-blog.net/a_script_fixing_errors_in_ibus_pinyin/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>测试ibus输入法默认词库的流行词覆盖度</title>
		<link>http://apt-blog.net/testing_ibus_pinyin</link>
		<comments>http://apt-blog.net/testing_ibus_pinyin#comments</comments>
		<pubDate>Tue, 31 Mar 2009 16:03:37 +0000</pubDate>
		<dc:creator>BOYPT</dc:creator>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[ibus]]></category>
		<category><![CDATA[词库]]></category>
		<category><![CDATA[输入法]]></category>
		<guid isPermaLink="false">http://apt-blog.net/archives/214.html</guid>
		<description><![CDATA[这些天一直在想怎么扩充ibus输入法的词库，虽然一般使用感觉还好。在网上找到sogou提供了一个“互联网词库”，里面是搜索引擎分析出来的15万多词语，本想拿来导入到ibus，先用python测试了一下有多少词语已经在ibus的默认词库中，最后发现15万流行词中只有200多不在默认词库中，ibus词库确实挺优秀。 程序输出：（测试代码见后） seached: 157200 times. 215 phrases not in the database, written in file 'notexist' 查看notexist文件，发现除了后半部分一大堆频度为1的成语之外，只有20多个大频率词没在默认词库： （- -｜原来连“裸体”都没有？太和谐了！建议广滇驹推荐ibus为国家首选输入法） 乾坤 3561275 N, 乾隆 3088184 N, 乾净 1533219 夥伴 1052393 瞭望 984469 宏碁 979267 乾脆 953204 乾燥 624377 清乾隆 480337 乾隆皇帝 380252 N, 阿房宫 235461 乾隆年间 214986 定乾坤 210477 乾隆帝 149133 乾坤袋 143966 著色 111072 萧乾 84647 [...]]]></description>
			<content:encoded><![CDATA[<p>这些天一直在想怎么扩充ibus输入法的词库，虽然一般使用感觉还好。在网上找到sogou提供了一个“互联网词库”，里面是搜索引擎分析出来的15万多词语，本想拿来导入到ibus，先用python测试了一下有多少词语已经在ibus的默认词库中，最后发现15万流行词中只有200多不在默认词库中，ibus词库确实挺优秀。</p>
<p>程序输出：（测试代码见后）</p>
<p>seached: 157200 times. 215 phrases not in the database,<br />
 written in file 'notexist'</p>
<p>查看notexist文件，发现除了后半部分一大堆频度为1的成语之外，只有20多个大频率词没在默认词库：<br />
（- -｜原来连“裸体”都没有？太和谐了！建议广滇驹推荐ibus为国家首选输入法）</p>
<blockquote><p>乾坤	3561275	N,<br />
乾隆	3088184	N,<br />
乾净	1533219<br />
夥伴	1052393<br />
瞭望	984469<br />
宏碁	979267<br />
乾脆	953204<br />
乾燥	624377<br />
清乾隆	480337<br />
乾隆皇帝	380252	N,<br />
阿房宫	235461<br />
乾隆年间	214986<br />
定乾坤	210477<br />
乾隆帝	149133<br />
乾坤袋	143966<br />
著色	111072<br />
萧乾	84647<br />
小夥子	79076<br />
瞭望台	71630<br />
寒伧	50780	V,ADJ,<br />
祼体	46797
</p></blockquote>
<p>其实ibus词库不用再怎么扩充了，呵呵，当然<big>萌萌的草泥马</big>、雅篾蝶、法克鱿之类的新新词汇，还得用户自己敲一下咯，或者能找到专用的神兽词库……<br />
<span id="more-214"></span></p>
<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
</pre></td><td class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">import</span> sqlite3
&nbsp;
con = sqlite3.<span style="color: black;">connect</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'/usr/share/ibus-pinyin/engine/py.db'</span><span style="color: black;">&#41;</span>
c = con.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
diclib = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Freq/SogouLabDic.dic&quot;</span>,<span style="color: #483d8b;">'r'</span><span style="color: black;">&#41;</span>
rec_notexist = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'notexist'</span>,<span style="color: #483d8b;">'w'</span><span style="color: black;">&#41;</span>
&nbsp;
seachCounter = <span style="color: #ff4500;">0</span>
notExistCounter = <span style="color: #ff4500;">0</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">while</span> <span style="color: #ff4500;">1</span>:
    line = diclib.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>line<span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">break</span>
    data = line.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\t</span>'</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">try</span>:
        phrase = data<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>.<span style="color: black;">decode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;gbk&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
        c.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;select *<span style="color: #000099; font-weight: bold;">\</span>
            from py_phrase<span style="color: #000099; font-weight: bold;">\</span>
            where phrase = ?&quot;</span>, <span style="color: black;">&#91;</span>phrase.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'utf-8'</span><span style="color: black;">&#41;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
&nbsp;
        rows = c.<span style="color: black;">fetchall</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
        seachCounter += <span style="color: #ff4500;">1</span>
    <span style="color: #ff7700;font-weight:bold;">except</span> <span style="color: #008000;">UnicodeDecodeError</span>, e:
        <span style="color: #ff7700;font-weight:bold;">print</span> e
        <span style="color: #ff7700;font-weight:bold;">print</span> data<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
        rec_notexist.<span style="color: black;">writelines</span><span style="color: black;">&#40;</span>line<span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">continue</span>
    <span style="color: #ff7700;font-weight:bold;">except</span> BaseException, e:
        <span style="color: #ff7700;font-weight:bold;">print</span> e
        <span style="color: #ff7700;font-weight:bold;">break</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> rows:
        notExistCounter += <span style="color: #ff4500;">1</span>
        rec_notexist.<span style="color: black;">writelines</span><span style="color: black;">&#40;</span>line<span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">print</span> phrase
&nbsp;
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;Seached: %d times. %d phrases not in the database, <span style="color: #000099; font-weight: bold;">\n</span> <span style="color: #000099; font-weight: bold;">\</span>
written in file 'notexist'&quot;</span> <span style="color: #66cc66;">%</span><span style="color: black;">&#40;</span>seachCounter, notExistCounter<span style="color: black;">&#41;</span>
&nbsp;
rec_notexist.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
diclib.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
con.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></td></tr></table></div>
]]></content:encoded>
			<wfw:commentRss>http://apt-blog.net/testing_ibus_pinyin/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>关于ibus输入法词库</title>
		<link>http://apt-blog.net/exporing_the_ibus_pinyin_word_database</link>
		<comments>http://apt-blog.net/exporing_the_ibus_pinyin_word_database#comments</comments>
		<pubDate>Mon, 30 Mar 2009 08:46:03 +0000</pubDate>
		<dc:creator>BOYPT</dc:creator>
				<category><![CDATA[Unix/Linux]]></category>
		<category><![CDATA[ibus]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[词库]]></category>
		<category><![CDATA[输入法]]></category>
		<guid isPermaLink="false">http://apt-blog.net/archives/168.html</guid>
		<description><![CDATA[目前Linux下几个拼音输入法都处于初级的开发阶段，很难说哪个特别成熟，除了老牌的Fctix，基于SCIM平台有默认的智能、巨蟒、SunPinYin，当然还有我用的ibus。SunPinYin是Sun的OpenSolaris里面的一个项目，基于“统计语言模型”，技术刚刚的，据说反应极快，虽然目前功能欠缺，但真让人期待。 默认词库最大的似乎是巨蟒，据说用了sogou早期的词库，但是似乎词库处理上算法有点粗糙，而Fcitx的词库实在太小……ibus算中规中矩，词库不小，不算新，但也很容易让用户上手。 ibus当然也不完美，比如删词功能就经常不行（Ctrl + num），之前有hao的首选字突然变成了“号”，但明显“好”才更常用，郁闷了几天，安装了sqlitebrowser，打开用户词库，找到“号”把user_freq调回单位数（居然说我输入了几百次，晕！可能某次程序出错多循环了一会。） 盯着词库看挺好玩的，想到如果能导入搜狗词库多好（ibus比较却成语类的词），还顺手照书上例子试了下用Python读取ibus的数据库。没什么意义，当是数据库编程的Hello World吧。 #!/usr/bin/python &#160; import sqlite3 &#160; con = sqlite3.connect&#40;'/home/pentie/.ibus/pinyin/user.db'&#41; c = con.cursor&#40;&#41; c.execute&#40;&#34;&#34;&#34;select phrase,user_freq from py_phrase where user_freq = 1 &#34;&#34;&#34;&#41; &#160; rows = c.fetchall&#40;&#41; f = open&#40;'one','w'&#41; for record in rows: l = u&#34;%s,%s\n&#34; % record f.writelines&#40;l.encode&#40;&#34;utf-8&#34;&#41;&#41; &#160; f.close&#40;&#41; con.close&#40;&#41; 在网上搜了一下，还真有人写了个导入词库的脚本：http://forum.ubuntu.org.cn/viewtopic.php?f=8&#38;t=188685 帖子的说明还算详细的，我从sogou细胞词库下载了“成语词条”，稍做修改导入成功后试了下，不错。]]></description>
			<content:encoded><![CDATA[<p>目前Linux下几个拼音输入法都处于初级的开发阶段，很难说哪个特别成熟，除了老牌的Fctix，基于SCIM平台有默认的智能、巨蟒、SunPinYin，当然还有我用的ibus。SunPinYin是<a href="http://www.opensolaris.org/os/project/input-method/" target="_blank">Sun的OpenSolaris里面的一个项目</a>，基于<span class="bold">“统计语言模型”</span>，技术刚刚的，据说反应极快，虽然目前功能欠缺，但真让人期待。</p>
<p>默认词库最大的似乎是巨蟒，据说用了sogou早期的词库，但是似乎词库处理上算法有点粗糙，而Fcitx的词库实在太小……ibus算中规中矩，词库不小，不算新，但也很容易让用户上手。</p>
<div id="attachment_175" class="wp-caption aligncenter" style="width: 391px"><a href="http://apt-blog.net/wp-content/uploads/2009/03/sqlitebrowser.png"><a href="http://apt-blog.net/wp-content/uploads/2009/03/sqlitebrowser.png" rel="lightbox[168]" title="sqlitebrowser"><img class="aligncenter size-full wp-image-175" title="sqlitebrowser" src="http://apt-blog.net/wp-content/uploads/2009/03/sqlitebrowser.png" alt="sqlitebrowser" width="381" height="463" /></a></a><p class="wp-caption-text">顺便练习下SQL</p></div>
<p>ibus当然也不完美，比如删词功能就经常不行（Ctrl + num），之前有hao的首选字突然变成了“号”，但明显“好”才更常用，郁闷了几天，安装了sqlitebrowser，打开用户词库，找到“号”把user_freq调回单位数（居然说我输入了几百次，晕！可能某次程序出错多循环了一会。）</p>
<p>盯着词库看挺好玩的，想到如果能导入搜狗词库多好（ibus比较却成语类的词），还顺手照书上例子试了下用Python读取ibus的数据库。没什么意义，当是数据库编程的Hello World吧。<br />
<span id="more-168"></span></p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">import</span> sqlite3
&nbsp;
con = sqlite3.<span style="color: black;">connect</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'/home/pentie/.ibus/pinyin/user.db'</span><span style="color: black;">&#41;</span>
c = con.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
c.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;&quot;&quot;select phrase,user_freq
    from py_phrase
    where user_freq = 1
    &quot;&quot;&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
rows = c.<span style="color: black;">fetchall</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
f = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'one'</span>,<span style="color: #483d8b;">'w'</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">for</span> record <span style="color: #ff7700;font-weight:bold;">in</span> rows:
    l = u<span style="color: #483d8b;">&quot;%s,%s<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span> <span style="color: #66cc66;">%</span> record
    f.<span style="color: black;">writelines</span><span style="color: black;">&#40;</span>l.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;utf-8&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
f.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
con.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>
<p>在网上搜了一下，还真有人写了个导入词库的脚本：<a href="http://forum.ubuntu.org.cn/viewtopic.php?f=8&amp;t=188685" target="_blank">http://forum.ubuntu.org.cn/viewtopic.php?f=8&amp;t=188685</a></p>
<p>帖子的说明还算详细的，我从sogou细胞词库下载了“<a href="http://pinyin.sogou.com/dict/cell.php?id=178" target="_blank">成语词条</a>”，稍做修改导入成功后试了下，不错。</p>
]]></content:encoded>
			<wfw:commentRss>http://apt-blog.net/exporing_the_ibus_pinyin_word_database/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

