Python自然语言处理学习笔记(19):3.3 使用Unicode进行文字处理

 

3.3 Text Processing with Unicode 使用Unicode进行文字处理

 

Our programs will often need to deal with different languages, and different character sets. The concept of “plain text” is a fiction(虚构). If you live in the English-speaking world you probably use ASCII, possibly without realizing it. If you live in Europe you might use one of the extended Latin character sets, containing such characters as “ø” for Danish(丹麦语) and Norwegian(挪威语), “ő” for Hungarian(匈牙利语), “ñ” for Spanish and Breton(法国的布列塔尼语), and “ň” for Czech(捷克语) and Slovak(斯洛伐克语). In this section, we will give an overview of how to use Unicode for processing texts that use non-ASCII character sets.

 

What Is Unicode?   神马是Unicode?

 

Unicode supports over a million characters. Each character is assigned a number, called a code point(编码点). In Python, code points are written in the form \uXXXX, where XXXX is the number in four-digit hexadecimal form(十六进制).

 

Within a program, we can manipulate Unicode strings just like normal strings. However, when Unicode characters are stored in files or displayed on a terminal, they must be encoded as a stream of bytes. Some encodings (such as ASCII and Latin-2) use a single byte per code point, so they can support only a small subset of Unicode, enough for a single language. Other encodings (such as UTF-8) use multiple bytes and can represent the full range of Unicode characters.

 

Text in files will be in a particular encoding, so we need some mechanism for translating it into Unicode—translation into Unicode is called decoding(解码). Conversely, to write out Unicode to a file or a terminal, we first need to translate it into a suitable encoding— this translation out of Unicode is called encoding(编码), and is illustrated in Figure 3-3.

 

        Figure 3-3. Unicode decoding and encoding

 

From a Unicode perspective, characters are abstract entities that can be realized as one or more glyphs(图像字符). Only glyphs can appear on a screen or be printed on paper. A font(字体) is a mapping from characters to glyphs.

 

Extracting Encoded Text from Files 从文件提取编码文本

 

Let’s assume that we have a small text file, and that we know how it is encoded. For example, polish-lat2.txt, as the name suggests, is a snippet(片段) of Polish text (from the Polish Wikipedia; see http://pl.wikipedia.org/wiki/Biblioteka_Pruska ). This file is encoded as Latin-2, also known as ISO-8859-2. The function nltk.data.find() locates the file for us.

 

  >>> path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

 

The Python codecs(编码解码器) module provides functions to read encoded data into Unicode strings, and to write out Unicode strings in encoded form. The codecs.open() function takes an encoding parameter to specify the encoding of the file being read or written. So let’s import the codecs module, and call it with the encoding 'latin2' to open our Polish file as Unicode:

 

  > >>> import codecs

>>> f = codecs.open(path, encoding='latin2')

 

 

For a list of encoding parameters allowed by codecs, see http://docs.python.org/lib/standard-encodings.html . Note that we can write Unicode-encoded data to a file using f = codecs.open(path, 'w', encoding='utf-8'). Text read from the file object f will be returned in Unicode. As we pointed out earlier, in order to view this text on a terminal, we need to encode it, using a suitable encoding.

 

The Python-specific encoding unicode_escape is a dummy(傻瓜) encoding that converts all non-ASCII characters into their \uXXXX representations. Code points above the ASCII 0–127 range but below 256 are represented in the two-digit form \xXX.

 

  >>> for line in f:
  ...     line 
= line.strip()
  ...     
print line.encode('unicode_escape')
  Pruska Biblioteka Pa\u0144stwowa. Jej dawne zbiory znane pod nazw\u0105
  
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
  Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y
  odnalezione po 
1945 r. na terytorium Polski. Trafi\u0142y do Biblioteki
  Jagiello\u0144skiej w Krakowie, obejmuj\u0105 ponad 
500 tys. zabytkowych
  archiwali\xf3w, m.
in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.

 

 

The first line in this output illustrates a Unicode escape string preceded by the \u escape string, namely \u0144. The relevant Unicode character will be displayed on the screen as the glyph ń. In the third line of the preceding example, we see \xf3, which corresponds to the glyph ó, and is within the 128–255 range.

 

In Python, a Unicode string literal can be specified by preceding an ordinary string literal with a u, as in u'hello'. Arbitrary Unicode characters are defined using the \uXXXX escape sequence(转义序列) inside a Unicode string literal. We find the integer ordinal(序数) of a character using ord(). For example:

 

  >>> ord('a')
  
97

 

 

The hexadecimal four-digit notation for 97 is 0061, so we can define a Unicode string literal with the appropriate escape sequence:


  >>> a = u'\u0061'
  
>>> a
  u
'a'
  
>>> print a

  a 

 

Notice that the Python print statement is assuming a default encoding of the Unicode character, namely ASCII. However, ń is outside the ASCII range, so cannot be printed unless we specify an encoding. In the following example, we have specified that print should use the repr() of the string, which outputs the UTF-8 escape sequences (of the form \xXX) rather than trying to render(显示) the glyphs.

 

  >>> nacute = u'\u0144'
  
>>> nacute
  u
'\u0144'
  
>>> nacute_utf = nacute.encode('utf8')
  
>>> print repr(nacute_utf)
  
'\xc5\x84'

 

 

If your operating system and locale are set up to render UTF-8 encoded characters, you ought to be able to give the Python command print nacute_utf and see ń on your screen.

Window的命令行不支持UTF-8的,使用print nacute_utf.decode('utf-8')可以显示

 

There are many factors determining what glyphs are rendered on your screen. If you are sure that you have the correct encoding, but your Python code is still failing to produce the glyphs you expected, you should also check that you have the necessary fonts installed on your system.

 

The module unicodedata lets us inspect the properties of Unicode characters. In the following example, we select all characters in the third line of our Polish text outside the ASCII range and print their UTF-8 escaped value, followed by their code point integer using the standard Unicode convention (i.e., prefixing the hex digits with U+), followed by their Unicode name.

 


  
>>> import unicodedata
  
>>> lines = codecs.open(path, encoding='latin2').readlines()
  
>>> line = lines[2]
  
>>> print line.encode('unicode_escape')
  Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y\n
  
>>> for c in line:
  ...     
if ord(c) > 127:
  ...         
print '%r U+%04x %s' % (c.encode('utf8'), ord(c), unicodedata.name(c))
  
'\xc3\xb3' U+00f3 LATIN SMALL LETTER O WITH ACUTE
  
'\xc5\x9b' U+015b LATIN SMALL LETTER S WITH ACUTE
  
'\xc5\x9a' U+015a LATIN CAPITAL LETTER S WITH ACUTE
  
'\xc4\x85' U+0105 LATIN SMALL LETTER A WITH OGONEK
  
'\xc5\x82' U+0142 LATIN SMALL LETTER L WITH STROKE

If you replace the %r (which yields the repr() value) by %s in the format string of the preceding code sample, and if your system supports UTF-8, you should see an output like the following:

 

  ó U+00f3 LATIN SMALL LETTER O WITH ACUTE

  ś U+015b LATIN SMALL LETTER S WITH ACUTE

  Ś U+015a LATIN CAPITAL LETTER S WITH ACUTE

  ą U+0105 LATIN SMALL LETTER A WITH OGONEK

  ł U+0142 LATIN SMALL LETTER L WITH STROKE

 

Alternatively, you may need to replace the encoding 'utf8' in the example by 'latin2', again depending on the details of your system.

The next examples illustrate how Python string methods and the re module accept Unicode strings.


 
  
>>> line.find(u'zosta\u0142y')
  
54
  
>>> line = line.lower()
  
>>> print line.encode('unicode_escape')
  niemc\xf3w pod koniec ii wojny \u015bwiatowej na dolny \u015bl\u0105sk, zosta\u0142y\n
  
>>> import re
  
>>> m = re.search(u'\u015b\w*', line)
  
>>> m.group() 

  u '\u015bwiatowej'

  

NLTK tokenizers allow Unicode strings as input, and correspondingly yield Unicode strings as output.

 

  >>>> nltk.word_tokenize(line) 
  [u
'niemc\xf3w', u'pod', u'koniec', u'ii', u'wojny', u'\u015bwiatowej',
  u
'na', u'dolny', u'\u015bl\u0105sk', u'zosta\u0142y']

 

 

Using Your Local Encoding in Python Python使用你的本地编码

 

If you are used to working with characters in a particular local encoding, you probably want to be able to use your standard methods for inputting and editing strings in a Python file. In order to do this, you need to include the string '# -*- coding: <coding> -*-' as the first or second line of your file. Note that <coding> has to be a string like 'latin-1', 'big5', or 'utf-8' (see Figure 3-4). Figure 3-4 also illustrates how regular expressions can use encoded strings.

 

转载于:https://www.cnblogs.com/yuxc/archive/2011/08/06/2129447.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/275320.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

小程序卡片叠层切换卡片_现在,卡片和清单在哪里?

小程序卡片叠层切换卡片重点 (Top highlight)介绍 (Intro) I was recently tasked to redesign the results of the following filters:我最近受命重新设计以下过滤器的结果&#xff1a; Filtered results for users (creatives) 用户的筛选结果(创意) 2. Filtered results fo…

记一次Sentry部署过程

前言 Sentry 是一个开源的实时错误报告工具&#xff0c;支持前后端、其他后端语言以及主流框架等。既然是开源&#xff0c;那么我们可以在自己的服务器上搭建&#xff0c;本文记录搭建的过程以及搭建过程中遇到的一些问题&#xff0c;也可以跟着这个教程来搭建一遍 部署环境 Ub…

效率神器!UI 稿智能转换成前端代码

做前端&#xff0c;不搬砖大家好&#xff0c;我是若川。从事前端五年之久&#xff0c;也算见证了前端数次变革&#xff0c;从到DW&#xff08;Dreamweaver&#xff09;到H5C3、从JQuery到MVC框架&#xff0c;无数前端大佬在为打造前端完整生态做出努力&#xff0c;由于他们的努…

$.when.apply_When2Meet vs.LettuceMeet:UI和美学方面的案例研究

$.when.apply并非所有计划应用程序都是一样创建的。 (Not all scheduling apps are created equal.) As any college student will tell you, we use When2Meet almost religiously. Between classes, extracurriculars, work, and simply living, When2Meet is the scheduling…

BZOJ4825: [Hnoi2017]单旋(Splay)

题面 传送门 题解 调了好几个小时……指针太难写了…… 因为只单旋最值&#xff0c;我们以单旋\(\min\)为例&#xff0c;那么\(\min\)是没有左子树的&#xff0c;而它旋到根之后&#xff0c;它的深度变为\(1\)&#xff0c;它的右子树里所有节点深度不变&#xff0c;其它所有节点…

前端不容你亵渎

大家好&#xff0c;我是若川&#xff0c;点此加我微信进源码群&#xff0c;一起学习源码。同时可以进群免费看Vue专场直播&#xff0c;有尤雨溪分享「Vue3 生态现状以及展望」背景最近我在公众号的后台收到一条留言&#xff1a;言语里充满了对前端的不屑和鄙夷&#xff0c;但仔…

用jquery阻止事件起泡

jquery使用过程中阻止事件起泡实例 1、通过返回false来取消默认的行为并阻止事件起泡。jQuery 代码:$("form").bind("submit", function() { return false; })2、通过使用 preventDefault() 方法只取消默认的行为。jQuery 代码:$("form").bind(…

利益相关者软件工程_如何向利益相关者解释用户体验的重要性

利益相关者软件工程With the ever increasing popularity of user experience (UX) design there is a growing need for good designers. However, there’s a problem for designers here as well. How can you show the importance of UX to your stakeholders and convince…

云栖大会上,阿里巴巴重磅发布前端知识图谱!

大家好&#xff0c;我是若川&#xff0c;点此加我微信进源码群&#xff0c;一起学习源码。同时可以进群免费看Vue专场直播&#xff0c;有尤雨溪分享「Vue3 生态现状以及展望」阿里巴巴前端知识图谱&#xff0c;由大阿里众多前端技术专家团历经1年时间精心整理&#xff0c;从 初…

Linux下“/”和“~”的区别

在linux中&#xff0c;”/“代表根目录&#xff0c;”~“是代表目录。Linux存储是以挂载的方式&#xff0c;相当于是树状的&#xff0c;源头就是”/“&#xff0c;也就是根目录。 而每个用户都有”家“目录&#xff0c;也就是用户的个人目录&#xff0c;比如root用户的”家“目…

在当今移动互联网时代_谁在提供当今最好的电子邮件体验?

在当今移动互联网时代Hey, a new email service from the makers of Basecamp was recently launched. The Verge calls it a “genuinely original take on messaging”, and it indeed features some refreshing ideas for the sometimes painful exercise we call inbox man…

插件式开发小记

在做插件开发时&#xff0c;小记一下&#xff0c;用来备忘&#xff1a; 1.DEV8.2的XtraTabControl控件如何获得当前打开的子窗体&#xff1a;XtraForm frm (XtraForm)xtraTabControl1.SelectedTabPage.Controls[0];2.插件开发的底层标准最好是抽象类&#xff0c;这样扩展性好。…

linux运维工程师学习路线

一、学习路线&#xff1a; 1.青铜&#xff1a; 1、Linux基础知识、基本命令&#xff08;起源、组成、常用命令如cp、ls、file、mkdir等常见操作命令&#xff09; 2、Linux用户及权限基础 3、Linux系统进程管理进阶 4、linux高效文本、文件处理命令&#xff08;vim、grep、sed、…

React 全新文档上线!

大家好&#xff0c;我是若川&#xff0c;点此加我微信进源码群&#xff0c;一起学习源码。同时可以进群免费看明天的Vue专场直播&#xff0c;有尤雨溪分享「Vue3 生态现状以及展望」&#xff0c;还可以领取50场录播视频和PPT。React 官方文档改版耗时 1 年&#xff0c;今天已完…

POJ2392

题意:奶牛们要用K个不同类型的石头建太空电梯.每一种石头的高度为Hi&#xff0c;数量为Ci,且不能放在高于Ai的地方,求最高能建多高的太空电梯. 分析:多重背包,数组标记.显然将ai小的放在下面会更优.所以先排序. code: const maxh41000; var cnt:array[0..maxh] of longint;h,…

网络低俗词_从“低俗小说”中汲取7堂课,以创建有影响力的作品集

网络低俗词重点 (Top highlight)Design portfolios and blockbuster movies had become more and more generic. On the design side, I blame all the portfolio reviews and articles shared by “experienced” designers that repeat the same pieces of advice regardless…

Vue多个组件映射到同一个组件,页面不刷新?

问题 在做项目的过程中,有这么一个场景&#xff1a;多个组件通过配置路由,都跳转到同一个组件,他们之间的区别就是,传入的参数不同.请看router对象&#xff1a; userCenterLike: {name: user-center,params: {index: 0}},userCenterHistory: {name: user-center,params: {index…

尤雨溪写的100多行的“玩具 vite”,十分有助于理解 vite 原理

1. 前言大家好&#xff0c;我是若川。最近组织了源码共读活动&#xff0c;感兴趣的可以加我微信 ruochuan12想学源码&#xff0c;极力推荐之前我写的《学习源码整体架构系列》jQuery、underscore、lodash、vuex、sentry、axios、redux、koa、vue-devtools、vuex4、koa-compose、…

webflow如何使用_我如何使用Webflow构建辅助项目以帮助设计人员进行连接

webflow如何使用I launched Designer Slack Communities a while ago, aiming to help designers to connect with likeminded people. By sharing my website with the world, I’ve connected with so many designers. The whole experience is a first time for me, so I wa…

atmega8 例程:T1定时器 CTC模式 方波输出

/******************************************************************* * 函数库说明&#xff1a;ATMEGA8 T1定时器 CTC模式 方波输出 * 版本&#xff1a; v1.00 * 修改&#xff1a; 庞辉 芜湖联大飞思卡尔工作室…