热门搜索 :
考研考公
您的当前位置:首页正文

Python入门:NLTK(一)安装和Tokenizer

来源:东饰资讯网

前言

想要用NLTK的原因是最近自己喜欢上了用Jupyter写代码(话说把Jupyter搭在服务器上真是爽),不是非要处理时间信息的话,一些简单的自然语言处理的操作不想在Java和python之间来回切了。

NLTK简介及安装

python
>>> import nltk
>>> nltk.downloard()

Mac会蹦出对话框,CentOS还是是命令行。根据提示,选择download,选择all。这里注意下,你可能需要根据提示选择config修改一下下载文件夹之类的设定。

常用操作

  1. Sentence Tokenize
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize_list = sent_tokenize(text)
>>> import nltk.data
>>> tokenizer = nltk.data.load(‘tokenizers/punkt/english.pickle’)
>>> tokenizer.tokenize(text)
>>> spanish_tokenizer = nltk.data.load(‘tokenizers/punkt/spanish.pickle’)
>>> spanish_tokenizer.tokenize(‘Hola amigo. Estoy bien.’)
  1. Word Tokenize
>>> from nltk.tokenize import word_tokenize
>>> word_tokenize(‘Hello World.’)
[‘Hello’, ‘World’, ‘.’]
>>> word_tokenize(“this’s a test”)
[‘this’, “‘s”, ‘a’, ‘test’]

Word Tokenize是TreebankWordTokenizer的皮包函数(看成包皮的请面壁)。所以下面这个代码和上面等价。

>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize("this's a test")
[‘this’, “‘s”, ‘a’, ‘test’]
>>> from nltk.tokenize import PunktWordTokenizer
>>> punkt_word_tokenizer = PunktWordTokenizer()
>>> punkt_word_tokenizer.tokenize("this's a test")
[‘this’, “‘s”, ‘a’, ‘test’]

以及

>>> from nltk.tokenize import WordPunctTokenizer
>>> word_punct_tokenizer = WordPunctTokenizer()
>>> word_punct_tokenizer.tokenize("This's a test")
[‘This’, “‘”, ‘s’, ‘a’, ‘test’]
Top