簡介
有分析顯示字母頻率就像詞頻,不同作者或寫作主題的作品中往往各不相同。當為x射線(x-rays)撰文時,文章中就會有大量的字母X。而撰寫用x射線治療卡達(Qatar)的斑馬(zebras)時,一般很少出現的字母X、Q和Z就會充斥文中。可從作者的字母使用頻率中看出他的某些寫作習慣。例如,海明威的寫作風格明顯不同於福克納。字母、雙字母組、三字母組、單詞頻率、單詞長度和句子長度,這些都可以經統計後用以證明或反駁某一作品是某作者所寫,甚至待鑑別作品與作者的寫作風格相近也可用這一方法。
只能靠分析大量有代表性的文本才可得出準確的字母平均頻率,而藉由現代計算機和龐大的文本語料庫,很容易完成這樣的統計工作。又聾又瞎網(Deafandblind)列出了各種文本材料(新聞報告、宗教文本、科學文本和一般小說)的字母頻率順序,其中在一般小說類里,字母“h”與“i”的排位差異尤甚,由Linotype排字機的“etao in s hrdlu”變成了“etao hn isrdlu”。
赫伯特·S·基姆在他那部經典的密碼學入門著作 《密碼和隱密寫作》(Codes and Secret Writing)里提道:英文的字母頻率排列順序是 ETAON RISHD LFCMU GYPWB VKJXQ Z,最常見的字母對是 TH HE AN RE ER IN ON AT ND ST ES EN OF TE ED OR TI HI AS TO,最常見的連寫字母對是 LL EE SS OO TT FF RR NN PP CC。
使用最多的前12個字母占了總使用次數的80%,使用最多的前8個字母則占了總使用次數的65%。數種排名函式能很好地擬合字母頻率,而雙參數Cocho/Beta排名函式(two-parameter Cocho/Beta rank function)是當中的佼佼者。用另一種不能調節參數的排名函式也能不錯地擬合字母頻率分布,該函式也能擬合蛋白質序列中的胺基酸頻率。
使用VIC暗號或其他基於縱橫棋盤格的暗號時,間諜常用助記符如“a sin to err”(最後的r不計)來記住最常用的8個字母。在密碼解謎遊戲cryptograms和單詞解謎遊戲如猜單詞遊戲、Scrabble、香蕉拼字遊戲和電視遊戲節目幸運輪中,須要運用字母頻率和頻率分析。在古典文學中,愛倫坡早在其著名小說《金甲蟲》描述了如何用英文字母頻率的知識去解開故事中的替換式密碼,找出船長基德埋藏寶藏的所在。
字母頻率在一些鍵盤布局的設計上舉足輕重。Blickensderfer打字機在下排放置最常用的字母。德沃夏克鍵盤將最常用的字母放在最易輸入的中排,即除拇指外的八指所放之處。
英語中
字母頻率
下面列出的表格引自Algoritmy網站。而這個列表和其他的表稍微不同,如美國康奈爾大學數學探索項目(Math Explorer's Project)在統計40000個單詞後得到了大同小異的另一表。牛津大學出版社分析簡明牛津詞典的詞條後也得出百分比稍有不同的一表。
英語中空格出現的頻率比使用最多的字母(e)還稍稍多點(約為107%),而非字母字元(如數字、標點等)統共後排名第四,即在字母“T”和“A”之間。
字母 | 英語中出現的頻率 |
a | 8.167% |
b | 1.492% |
c | 2.782% |
d | 4.253% |
e | 12.702% |
f | 2.228% |
g | 2.015% |
h | 6.094% |
i | 6.966% |
j | 0.153% |
k | 0.772% |
l | 4.025% |
m | 2.406% |
n | 6.749% |
o | 7.507% |
p | 1.929% |
q | 0.095% |
r | 5.987% |
s | 6.327% |
t | 9.056% |
u | 2.758% |
v | 0.978% |
w | 2.360% |
x | 0.150% |
y | 1.974% |
z | 0.074% |
首字母頻率
單詞中首字母的頻率如下:
首字母 | 單詞頻率 |
a | 11.602% |
b | 4.702% |
c | 3.511% |
d | 2.670% |
e | 2.007% |
f | 3.779% |
g | 1.950% |
h | 7.232% |
i | 6.286% |
j | 0.597% |
k | 0.590% |
l | 2.705% |
m | 4.374% |
n | 2.365% |
o | 6.264% |
p | 2.545% |
q | 0.173% |
r | 1.653% |
s | 7.755% |
t | 16.671% |
u | 1.487% |
v | 0.649% |
w | 6.753% |
x | 0.037% |
y | 1.620% |
z | 0.034% |
其他語言
列表一
字母 | 法語 | 德語 | 西班牙語 | 葡萄牙語 | 世界語 | 義大利語 | 土耳其語 | 瑞典語 | 波蘭語 | 荷蘭語 | 道本語 |
a | 7.636% | 6.516% | 12.525% | 14.634% | 12.117% | 11.745% | 11.680% | 9.341% | 11.503% | 7.486% | 17.2% |
b | 0.901% | 1.886% | 2.215% | 1.043% | 0.980% | 0.927% | 2.952% | 1.254% | 1.740% | 1.584% | 0 |
c | 3.260% | 2.732% | 4.139% | 3.882% | 0.776% | 4.501% | 0.970% | 1.213% | 3.895% | 1.242% | 0 |
d | 3.669% | 5.076% | 5.860% | 4.992% | 3.044% | 3.736% | 4.871% | 4.521% | 4.225% | 5.933% | 0 |
e | 14.715% | 17.396% | 13.681% | 12.570% | 8.995% | 11.792% | 9.007% | 9.647% | 8.352% | 18.914% | 7.4% |
f | 1.066% | 1.656% | 0.692% | 1.023% | 1.037% | 1.153% | 0.444% | 1.931% | 0.143% | 0.805% | 0 |
g | 0.866% | 3.009% | 1.768% | 1.303% | 1.171% | 1.644% | 1.340% | 3.269% | 1.731% | 3.403% | 0 |
h | 0.737% | 4.757% | 0.703% | 0.781% | 0.384% | 0.636% | 1.145% | 2.103% | 1.015% | 2.380% | 0 |
i | 7.529% | 7.550% | 6.247% | 6.186% | 10.012% | 11.283% | 8.274% | 7.190% | 9.328% | 6.499% | 14.8% |
j | 0.545% | 0.268% | 0.443% | 0.397% | 3.501% | 0.011% | 0.046% | 0.652% | 1.836% | 1.461% | 3.0% |
k | 0.049% | 1.417% | 0.011% | 0.015% | 4.163% | 0.009% | 4.715% | 3.214% | 2.753% | 2.248% | 5.1% |
l | 5.456% | 3.437% | 4.967% | 2.779% | 6.145% | 6.510% | 5.752% | 5.229% | 3.064% | 3.568% | 10.2% |
m | 2.968% | 2.534% | 3.157% | 4.738% | 2.994% | 2.512% | 3.745% | 3.460% | 2.515% | 2.213% | 4.4% |
n | 7.095% | 9.776% | 6.71% | 5.046% | 7.955% | 6.883% | 7.231% | 8.796% | 6.737% | 10.032% | 11.6% |
o | 5.378% | 2.594% | 8.683% | 10.735% | 8.779% | 9.832% | 2.653% | 4.317% | 7.167% | 6.063% | 7.7% |
p | 2.521% | 0.670% | 2.510% | 2.523% | 2.745% | 3.056% | 0.788% | 1.437% | 2.445% | 1.370% | 3.7% |
q | 1.362% | 0.018% | 0.877% | 1.204% | 0 | 0.505% | 0 | 0.007% | 0 | 0.009% | 0 |
r | 6.553% | 7.003% | 6.871% | 6.530% | 5.914% | 6.367% | 6.948% | 8.309% | 5.743% | 6.411% | 0 |
s | 7.948% | 7.273% | 7.977% | 7.805% | 6.092% | 4.981% | 2.950% | 6.374% | 6.224% | 3.733% | 4.1% |
t | 7.244% | 6.154% | 4.632% | 4.736% | 5.276% | 5.623% | 3.049% | 8.693% | 2.475% | 6.923% | 4.6% |
u | 6.311% | 4.346% | 3.927% | 4.634% | 3.183% | 3.011% | 3.430% | 2.066% | 2.062% | 2.192% | 3.2% |
v | 1.628% | 0.846% | 1.138% | 1.665% | 1.904% | 2.097% | 0.977% | 2.289% | 0 | 1.854% | 0 |
w | 0.074% | 1.921% | 0.017% | 0.037% | 0 | 0.033% | 0.016% | 2.107% | 6.313% | 1.821% | 2.8% |
x | 0.427% | 0.034% | 0.215% | 0.253% | 0 | 0 | 0.007% | 0.103% | 0 | 0.036% | 0 |
y | 0.128% | 0.039% | 1.008% | 0.006% | 0 | 0.020% | 3.371% | 0.601% | 3.206% | 0.035% | 0 |
z | 0.326% | 1.134% | 0.517% | 0.470% | 0.494% | 1.181% | 1.497% | 0.020% | 5.852% | 1.374% | 0 |
à | 0.486% | 0 | 0 | 0.072% | 0 | 0.635% | 0 | 0 | 0 | 0 | 0 |
â | 0.051% | 0 | 0 | 0.562% | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
á | 0 | 0 | 0.502% | 0.118% | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
å | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.221% | 0 | - | 0 |
ä | 0 | 0.447% | 0 | 0 | 0 | 0 | 0 | 1.809% | 0 | 0 | 0 |
ã | 0 | 0 | 0 | 0.733% | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ą | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 0.699% | - | 0 |
œ | 0.018% | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | - | 0 |
ç | 0.085% | 0 | 0 | 0.530% | 0 | 0 | 0.825% | 0 | 0 | - | 0 |
ĉ | 0 | 0 | 0 | 0 | 0.657% | 0 | 0 | 0 | 0 | - | 0 |
ć | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 0.743% | - | 0 |
è | 0.271% | 0 | 0 | 0 | 0 | 0.263% | 0 | 0 | 0 | 0 | 0 |
é | 1.504% | 0 | 0.433% | 0.337% | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ê | 0.225% | 0 | 0 | 0.450% | 0 | 0 | 0 | 0 | 0 | - | 0 |
ë | 0.001% | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ę | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 1.035% | - | 0 |
ĝ | 0 | 0 | 0 | 0 | 0.691% | 0 | 0 | 0 | 0 | - | 0 |
ğ | 0 | 0 | 0 | 0 | 0 | 0 | 1.129% | 0 | 0 | - | 0 |
ĥ | 0 | 0 | 0 | 0 | 0.022% | 0 | 0 | 0 | 0 | - | 0 |
î | 0.045% | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | - | 0 |
ì | 0 | 0 | 0 | 0 | 0 | 0.030% | 0 | 0 | 0 | 0 | |
í | 0 | 0 | 0.725% | 0.132% | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ï | 0.005% | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ı | 0 | 0 | 0 | 0 | 0 | 0 | 5.199% | 0 | 0 | - | 0 |
ĵ | 0 | 0 | 0 | 0 | 0.055% | 0 | 0 | 0 | 0 | - | 0 |
ł | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 2.109% | - | 0 |
ñ | 0 | 0 | 0.311% | 0 | 0 | 0 | 0 | 0 | 0 | - | 0 |
ń | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 0.362% | - | 0 |
ò | 0 | 0 | 0 | 0 | 0 | 0.002% | 0 | 0 | 0 | 0 | 0 |
ö | 0 | 0.573% | 0 | 0 | 0 | 0 | 0.270% | 0.514% | 0 | 0 | 0 |
ô | 0.023% | 0 | 0 | 0.635% | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ó | 0 | - | 0.827% | 0.296% | 0 | 0 | 0 | 0 | 1.141% | 0 | 0 |
ŝ | 0 | 0 | 0 | 0 | 0.385% | 0 | 0 | 0 | 0 | - | 0 |
ş | 0 | 0 | 0 | 0 | 0 | 0 | 1.938% | 0 | 0 | - | 0 |
ś | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 0.514% | - | 0 |
ß | 0 | 0.307% | 0 | 0 | 0 | 0 | 0 | 0 | 0 | - | 0 |
ù | 0.058% | 0 | 0 | 0 | 0 | 0.166% | 0 | 0 | 0 | 0 | 0 |
ú | 0 | 0 | 0.168% | 0.207% | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ŭ | 0 | 0 | 0 | 0 | 0.520% | 0 | 0 | 0 | 0 | - | 0 |
ü | 0 | 0.995% | 0.012% | 0.026% | 0 | 0 | 1.992% | 0 | 0 | 0 | 0 |
ź | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 0.078% | - | 0 |
ż | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 0.706% | - | 0 |
列表二
根據上表,英語中使用頻率最高的10個字母為 etaoin shrdlu,而其他語言的排列順序如下:
語言 | 排序 | 語族與其他 |
法語 | esait nrulo | 印歐語系- 羅曼語族;傳統上使用發音更便利的 esartinulop排列。 |
西班牙語 | eaosr nidlt | 印歐語系-羅曼語族 |
葡萄牙語 | aeosr indmt | 印歐語系-羅曼語族 |
義大利語 | eaion lrtsc | 印歐語系-羅曼語族 |
世界語 | aieon lsrtk | 人工語言-基於印歐語系,詞源上多採用羅曼辭彙,音位系統本質上是斯拉夫形式,也有少量日耳曼語言特徵。 |
德語 | enisr atdhu | 印歐語系-日耳曼語族 |
瑞典語 | eantr isldo | 印歐語系-日耳曼語族 |
土耳其語 | aeinr ldkmu | 阿爾泰語系-突厥語族 |
荷蘭語 | enati rodsl | 印歐語系-日耳曼語族 |
波蘭語 | aoien wszrd | 印歐語系-斯拉夫語族 |
以上語言基本使用相似的25個(或以上)字母。而道本語的排列順序是 ainlo ektms,與以上語言不同的是道本語只使用了14個字母。