<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>
<channel>
	<title>apt-blog.net   无证程序员的PT桑 &#187; 词库</title>
	<atom:link href="http://apt-blog.net/tag/%e8%af%8d%e5%ba%93/feed" rel="self" type="application/rss+xml" />
	<link>http://apt-blog.net</link>
	<description>潜逃中。</description>
	<lastBuildDate>Fri, 18 May 2012 11:25:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>测试ibus输入法默认词库的流行词覆盖度</title>
		<link>http://apt-blog.net/testing_ibus_pinyin</link>
		<comments>http://apt-blog.net/testing_ibus_pinyin#comments</comments>
		<pubDate>Tue, 31 Mar 2009 16:03:37 +0000</pubDate>
		<dc:creator>BOYPT</dc:creator>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[ibus]]></category>
		<category><![CDATA[词库]]></category>
		<category><![CDATA[输入法]]></category>
		<guid isPermaLink="false">http://apt-blog.net/archives/214.html</guid>
		<description><![CDATA[这些天一直在想怎么扩充ibus输入法的词库，虽然一般使用感觉还好。在网上找到sogou提供了一个“互联网词库”，里面是搜索引擎分析出来的15万多词语，本想拿来导入到ibus，先用python测试了一下有多少词语已经在ibus的默认词库中，最后发现15万流行词中只有200多不在默认词库中，ibus词库确实挺优秀。 程序输出：（测试代码见后） seached: 157200 times. 215 phrases not in the database, written in file 'notexist' 查看notexist文件，发现除了后半部分一大堆频度为1的成语之外，只有20多个大频率词没在默认词库： （- -｜原来连“裸体”都没有？太和谐了！建议广滇驹推荐ibus为国家首选输入法） 乾坤 3561275 N, 乾隆 3088184 N, 乾净 1533219 夥伴 1052393 瞭望 984469 宏碁 979267 乾脆 953204 乾燥 624377 清乾隆 480337 乾隆皇帝 380252 N, 阿房宫 235461 乾隆年间 214986 定乾坤 210477 乾隆帝 149133 乾坤袋 143966 著色 111072 萧乾 84647 [...]]]></description>
			<content:encoded><![CDATA[<p>这些天一直在想怎么扩充ibus输入法的词库，虽然一般使用感觉还好。在网上找到sogou提供了一个“互联网词库”，里面是搜索引擎分析出来的15万多词语，本想拿来导入到ibus，先用python测试了一下有多少词语已经在ibus的默认词库中，最后发现15万流行词中只有200多不在默认词库中，ibus词库确实挺优秀。</p>
<p>程序输出：（测试代码见后）</p>
<p>seached: 157200 times. 215 phrases not in the database,<br />
 written in file 'notexist'</p>
<p>查看notexist文件，发现除了后半部分一大堆频度为1的成语之外，只有20多个大频率词没在默认词库：<br />
（- -｜原来连“裸体”都没有？太和谐了！建议广滇驹推荐ibus为国家首选输入法）</p>
<blockquote><p>乾坤	3561275	N,<br />
乾隆	3088184	N,<br />
乾净	1533219<br />
夥伴	1052393<br />
瞭望	984469<br />
宏碁	979267<br />
乾脆	953204<br />
乾燥	624377<br />
清乾隆	480337<br />
乾隆皇帝	380252	N,<br />
阿房宫	235461<br />
乾隆年间	214986<br />
定乾坤	210477<br />
乾隆帝	149133<br />
乾坤袋	143966<br />
著色	111072<br />
萧乾	84647<br />
小夥子	79076<br />
瞭望台	71630<br />
寒伧	50780	V,ADJ,<br />
祼体	46797
</p></blockquote>
<p>其实ibus词库不用再怎么扩充了，呵呵，当然<big>萌萌的草泥马</big>、雅篾蝶、法克鱿之类的新新词汇，还得用户自己敲一下咯，或者能找到专用的神兽词库……<br />
<span id="more-214"></span></p>
<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
</pre></td><td class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">import</span> sqlite3
&nbsp;
con = sqlite3.<span style="color: black;">connect</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'/usr/share/ibus-pinyin/engine/py.db'</span><span style="color: black;">&#41;</span>
c = con.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
diclib = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Freq/SogouLabDic.dic&quot;</span>,<span style="color: #483d8b;">'r'</span><span style="color: black;">&#41;</span>
rec_notexist = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'notexist'</span>,<span style="color: #483d8b;">'w'</span><span style="color: black;">&#41;</span>
&nbsp;
seachCounter = <span style="color: #ff4500;">0</span>
notExistCounter = <span style="color: #ff4500;">0</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">while</span> <span style="color: #ff4500;">1</span>:
    line = diclib.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>line<span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">break</span>
    data = line.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\t</span>'</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">try</span>:
        phrase = data<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>.<span style="color: black;">decode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;gbk&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
        c.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;select *<span style="color: #000099; font-weight: bold;">\</span>
            from py_phrase<span style="color: #000099; font-weight: bold;">\</span>
            where phrase = ?&quot;</span>, <span style="color: black;">&#91;</span>phrase.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'utf-8'</span><span style="color: black;">&#41;</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
&nbsp;
        rows = c.<span style="color: black;">fetchall</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
        seachCounter += <span style="color: #ff4500;">1</span>
    <span style="color: #ff7700;font-weight:bold;">except</span> <span style="color: #008000;">UnicodeDecodeError</span>, e:
        <span style="color: #ff7700;font-weight:bold;">print</span> e
        <span style="color: #ff7700;font-weight:bold;">print</span> data<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
        rec_notexist.<span style="color: black;">writelines</span><span style="color: black;">&#40;</span>line<span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">continue</span>
    <span style="color: #ff7700;font-weight:bold;">except</span> BaseException, e:
        <span style="color: #ff7700;font-weight:bold;">print</span> e
        <span style="color: #ff7700;font-weight:bold;">break</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> rows:
        notExistCounter += <span style="color: #ff4500;">1</span>
        rec_notexist.<span style="color: black;">writelines</span><span style="color: black;">&#40;</span>line<span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">print</span> phrase
&nbsp;
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;Seached: %d times. %d phrases not in the database, <span style="color: #000099; font-weight: bold;">\n</span> <span style="color: #000099; font-weight: bold;">\</span>
written in file 'notexist'&quot;</span> <span style="color: #66cc66;">%</span><span style="color: black;">&#40;</span>seachCounter, notExistCounter<span style="color: black;">&#41;</span>
&nbsp;
rec_notexist.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
diclib.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
con.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></td></tr></table></div>
]]></content:encoded>
			<wfw:commentRss>http://apt-blog.net/testing_ibus_pinyin/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>关于ibus输入法词库</title>
		<link>http://apt-blog.net/exporing_the_ibus_pinyin_word_database</link>
		<comments>http://apt-blog.net/exporing_the_ibus_pinyin_word_database#comments</comments>
		<pubDate>Mon, 30 Mar 2009 08:46:03 +0000</pubDate>
		<dc:creator>BOYPT</dc:creator>
				<category><![CDATA[Unix/Linux]]></category>
		<category><![CDATA[ibus]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[词库]]></category>
		<category><![CDATA[输入法]]></category>
		<guid isPermaLink="false">http://apt-blog.net/archives/168.html</guid>
		<description><![CDATA[目前Linux下几个拼音输入法都处于初级的开发阶段，很难说哪个特别成熟，除了老牌的Fctix，基于SCIM平台有默认的智能、巨蟒、SunPinYin，当然还有我用的ibus。SunPinYin是Sun的OpenSolaris里面的一个项目，基于“统计语言模型”，技术刚刚的，据说反应极快，虽然目前功能欠缺，但真让人期待。 默认词库最大的似乎是巨蟒，据说用了sogou早期的词库，但是似乎词库处理上算法有点粗糙，而Fcitx的词库实在太小……ibus算中规中矩，词库不小，不算新，但也很容易让用户上手。 ibus当然也不完美，比如删词功能就经常不行（Ctrl + num），之前有hao的首选字突然变成了“号”，但明显“好”才更常用，郁闷了几天，安装了sqlitebrowser，打开用户词库，找到“号”把user_freq调回单位数（居然说我输入了几百次，晕！可能某次程序出错多循环了一会。） 盯着词库看挺好玩的，想到如果能导入搜狗词库多好（ibus比较却成语类的词），还顺手照书上例子试了下用Python读取ibus的数据库。没什么意义，当是数据库编程的Hello World吧。 #!/usr/bin/python &#160; import sqlite3 &#160; con = sqlite3.connect&#40;'/home/pentie/.ibus/pinyin/user.db'&#41; c = con.cursor&#40;&#41; c.execute&#40;&#34;&#34;&#34;select phrase,user_freq from py_phrase where user_freq = 1 &#34;&#34;&#34;&#41; &#160; rows = c.fetchall&#40;&#41; f = open&#40;'one','w'&#41; for record in rows: l = u&#34;%s,%s\n&#34; % record f.writelines&#40;l.encode&#40;&#34;utf-8&#34;&#41;&#41; &#160; f.close&#40;&#41; con.close&#40;&#41; 在网上搜了一下，还真有人写了个导入词库的脚本：http://forum.ubuntu.org.cn/viewtopic.php?f=8&#38;t=188685 帖子的说明还算详细的，我从sogou细胞词库下载了“成语词条”，稍做修改导入成功后试了下，不错。]]></description>
			<content:encoded><![CDATA[<p>目前Linux下几个拼音输入法都处于初级的开发阶段，很难说哪个特别成熟，除了老牌的Fctix，基于SCIM平台有默认的智能、巨蟒、SunPinYin，当然还有我用的ibus。SunPinYin是<a href="http://www.opensolaris.org/os/project/input-method/" target="_blank">Sun的OpenSolaris里面的一个项目</a>，基于<span class="bold">“统计语言模型”</span>，技术刚刚的，据说反应极快，虽然目前功能欠缺，但真让人期待。</p>
<p>默认词库最大的似乎是巨蟒，据说用了sogou早期的词库，但是似乎词库处理上算法有点粗糙，而Fcitx的词库实在太小……ibus算中规中矩，词库不小，不算新，但也很容易让用户上手。</p>
<div id="attachment_175" class="wp-caption aligncenter" style="width: 391px"><a href="http://apt-blog.net/wp-content/uploads/2009/03/sqlitebrowser.png"><a href="http://apt-blog.net/wp-content/uploads/2009/03/sqlitebrowser.png" rel="lightbox[168]" title="sqlitebrowser"><img class="aligncenter size-full wp-image-175" title="sqlitebrowser" src="http://apt-blog.net/wp-content/uploads/2009/03/sqlitebrowser.png" alt="sqlitebrowser" width="381" height="463" /></a></a><p class="wp-caption-text">顺便练习下SQL</p></div>
<p>ibus当然也不完美，比如删词功能就经常不行（Ctrl + num），之前有hao的首选字突然变成了“号”，但明显“好”才更常用，郁闷了几天，安装了sqlitebrowser，打开用户词库，找到“号”把user_freq调回单位数（居然说我输入了几百次，晕！可能某次程序出错多循环了一会。）</p>
<p>盯着词库看挺好玩的，想到如果能导入搜狗词库多好（ibus比较却成语类的词），还顺手照书上例子试了下用Python读取ibus的数据库。没什么意义，当是数据库编程的Hello World吧。<br />
<span id="more-168"></span></p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">import</span> sqlite3
&nbsp;
con = sqlite3.<span style="color: black;">connect</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'/home/pentie/.ibus/pinyin/user.db'</span><span style="color: black;">&#41;</span>
c = con.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
c.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;&quot;&quot;select phrase,user_freq
    from py_phrase
    where user_freq = 1
    &quot;&quot;&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
rows = c.<span style="color: black;">fetchall</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
f = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'one'</span>,<span style="color: #483d8b;">'w'</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">for</span> record <span style="color: #ff7700;font-weight:bold;">in</span> rows:
    l = u<span style="color: #483d8b;">&quot;%s,%s<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span> <span style="color: #66cc66;">%</span> record
    f.<span style="color: black;">writelines</span><span style="color: black;">&#40;</span>l.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;utf-8&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
f.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
con.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>
<p>在网上搜了一下，还真有人写了个导入词库的脚本：<a href="http://forum.ubuntu.org.cn/viewtopic.php?f=8&amp;t=188685" target="_blank">http://forum.ubuntu.org.cn/viewtopic.php?f=8&amp;t=188685</a></p>
<p>帖子的说明还算详细的，我从sogou细胞词库下载了“<a href="http://pinyin.sogou.com/dict/cell.php?id=178" target="_blank">成语词条</a>”，稍做修改导入成功后试了下，不错。</p>
]]></content:encoded>
			<wfw:commentRss>http://apt-blog.net/exporing_the_ibus_pinyin_word_database/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

