环境配置 | 有关NLP的库安装学习使用示例,原理解释及出错解析
创始人
2024-04-07 01:12:34
0

1.Spacy库学习

1.1.介绍

spacy:文本预处理库,Python和Cython中的高级自然语言处理库,它建立在最新的研究基础之上,从一开始就设计用于实际产品。spaCy带有预先训练的统计模型和单词向量,目前支持20多种语言的标记。它具有世界上速度最快的句法分析器,用于标签的卷积神经网络模型,解析和命名实体识别以及与深度学习整合。它是在MIT许可下发布的商业开源软件。【1】

1.2.安装

win10,pycharm,anaconda的虚拟环境(要注意pip和conda不能重复)

pip install spacy -i https://pypi.tuna.tsinghua.edu.cn/simple

1.3.示例使用

1.3.1.英文分词的实现

import spacy # 导包#########英文分词##########
# 加载英文模型
nlp = spacy.load("en_core_web_sm")# 使用模型,传入句子即可
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
# 获取分词结果
print([token.text for token in doc])

 结果

1.3.2.中文分词及单词编码的实现

#########对中文进行分词和Word Embedding##########
import spacy # 导包
# 加载模型,并排除掉不需要的components
nlp1 = spacy.load("zh_core_web_sm", exclude=("tagger", "parser", "senter", "attribute_ruler", "ner"))
# 对句子进行处理
doc = nlp1("自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。")
# for循环获取每一个token与它对应的向量
for token in doc:# 这里为了方便展示,只截取5位,但实际该模型将中文词编码成了96维的向量print(token.text, token.tensor[:5])

结果

1.3.3.韩语分词及单词编码的实现

########对韩语句法依存解析###########(虚拟环境中韩语模型下载命令)python -m spacy download ko_core_news_smimport spacy # 导包
from spacy.lang.ko.examples import sentencesnlp2 = spacy.load("ko_core_news_sm")
doc = nlp2(sentences[0])
print(doc.text)
for token in doc:print(token.text, token.pos_, token.dep_)

 结果

 可参考【2】

1.3.4.检测英文主题及实体类型

import spacy# Load the English NLP model
nlp = spacy.load('en_core_web_sm')# The text we want to examine
text = """London is the capital and most populous city of England and
the United Kingdom. Standing on the River Thames in the south east
of the island of Great Britain, London has been a major settlement
for two millennia. It was founded by the Romans, who named it Londinium.
"""# Parse the text with spaCy. This runs the entire pipeline.
doc = nlp(text)# 'doc' now contains a parsed version of text. We can use it to do anything we want!
# For example, this will print out all the named entities that were detected:
for entity in doc.ents:print(f"{entity.text} ({entity.label_})")

得到一个在我们的文档中检测到的命名实体和实体类型的列表:

 1.3.5.词汇与文本相似度

import spacy
#python -m spacy download en_core_web_lg
nlp = spacy.load("en_core_web_lg")
# 词汇语义相似度(关联性)banana = nlp.vocab['banana']
dog = nlp.vocab['dog']
fruit = nlp.vocab['fruit']
animal = nlp.vocab['animal']print(dog.similarity(animal), dog.similarity(fruit))  # 0.6618534 0.23552845
print(banana.similarity(fruit), banana.similarity(animal))  # 0.67148364 0.2427285# 文本语义相似度(关联性)
target = nlp("Cats are beautiful animals.")doc1 = nlp("Dogs are awesome.")
doc2 = nlp("Some gorgeous creatures are felines.")
doc3 = nlp("Dolphins are swimming mammals.")

1.4.实现原理

组件:tok2vec,标记器,形态化器,解析器,词形还原器(trainable_lemmatizer),senter,ner。

spaCy的处理过程(Processing Pipeline)

当调用文本时,spaCy 首先标记文本以生成对象。然后通过几个不同的步骤进行处理 - 这也是 称为处理管道。训练管道使用的管道通常包括标记器、词形还原器、分析器 和实体识别器。每个管道组件返回已处理的、 然后将其传递给下一个组件。

tok2vec:

1.5.错误修正

错误1

在pip install spacy后,运行出现没有spacy.load()时

 卸载spacy

pip uninstall spacy

然后重新安装

pip install spacy -i https://pypi.tuna.tsinghua.edu.cn/simple

错误原因分析:错误是由将文件命名为“spacy”引起的,显然它会产生命名冲突。

解决方案:修改文件名spacy.py,不能与spacy库同名。

错误2

实现代码python -m spacy download en_core_web_sm,出现错误如下

E:\Anaconda3\envs\tf24\lib\site-packages\h5py\__init__.py:39: UserWarning: h5py is running against HDF5 1.10.5 when it was built against 1.10.6, this may cause problems
  '{0}.{1}.{2}'.format(*version.hdf5_built_version_tuple)
Warning! ***HDF5 library version mismatched error***
The HDF5 header files used to compile this application do not match
the version used by the HDF5 library to which this application is linked.
Data corruption or segmentation faults may occur if the application continues.
This can happen when an application was compiled by one version of HDF5 but
linked with a different version of static or shared HDF5 library.
You should recompile the application or check your shared library related
settings such as 'LD_LIBRARY_PATH'.
You can, at your own risk, disable this warning by setting the environment
variable 'HDF5_DISABLE_VERSION_CHECK' to a value of '1'.
Setting it to 2 or higher will suppress the warning messages totally.
Headers are 1.10.6, library is 1.10.5

错误原因分析:pycharm会对库 版本更新,升级新的版本,导致版本不匹配

解决方案:(我的版本h5py-2.10.0 和 tensorflow-2.4.0 Python3.7)

卸载pip uninstall h5py

安装pip install h5py==2.10.0

修改后成功!!

2.Textacy学习

用于执行各种自然语言处理任务的Python库,建立在高性能spaCy库的基础上,在 spaCy 之上实现了几种常见的数据抽取算法。

示例

import spacy
import textacy.extract# Load the large English NLP model
nlp = spacy.load('en_core_web_sm')# The text we want to examine
text = """London is the capital and most populous city of England and the United Kingdom.
Standing on the River Thames in the south east of the island of Great Britain,
London has been a major settlement for two millennia. It was founded by the Romans,
who named it Londinium.
"""# Parse the document with spaCy
doc = nlp(text)# Extract semi-structured statements
statements = textacy.extract.semistructured_statements(doc, "London")# Print the results
print("Here are the things I know about London:")for statement in statements:subject, verb, fact = statement
print(f" - {fact}")

错误1

 Traceback (most recent call last):
  File "G:/NLP/bert-master/bert-master/nlpbase/textacypre.py", line 18, in
    statements = textacy.extract.semistructured_statements(doc, "London")
TypeError: semistructured_statements() takes 1 positional argument but 2 were given(如图)

参考文献

【1】Trained Models & Pipelines · spaCy Models Documentation

【2】恩田 / 梅卡布科 / README.md — 比特桶 (bitbucket.org) 

【3】英语文本处理工具库——spaCy - 简书 (jianshu.com)

相关内容

热门资讯

喜欢穿一身黑的男生性格(喜欢穿... 今天百科达人给各位分享喜欢穿一身黑的男生性格的知识,其中也会对喜欢穿一身黑衣服的男人人好相处吗进行解...
发春是什么意思(思春和发春是什... 本篇文章极速百科给大家谈谈发春是什么意思,以及思春和发春是什么意思对应的知识点,希望对各位有所帮助,...
网络用语zl是什么意思(zl是... 今天给各位分享网络用语zl是什么意思的知识,其中也会对zl是啥意思是什么网络用语进行解释,如果能碰巧...
为什么酷狗音乐自己唱的歌不能下... 本篇文章极速百科小编给大家谈谈为什么酷狗音乐自己唱的歌不能下载到本地?,以及为什么酷狗下载的歌曲不是...
家里可以做假山养金鱼吗(假山能... 今天百科达人给各位分享家里可以做假山养金鱼吗的知识,其中也会对假山能放鱼缸里吗进行解释,如果能碰巧解...
华为下载未安装的文件去哪找(华... 今天百科达人给各位分享华为下载未安装的文件去哪找的知识,其中也会对华为下载未安装的文件去哪找到进行解...
四分五裂是什么生肖什么动物(四... 本篇文章极速百科小编给大家谈谈四分五裂是什么生肖什么动物,以及四分五裂打一生肖是什么对应的知识点,希...
怎么往应用助手里添加应用(应用... 今天百科达人给各位分享怎么往应用助手里添加应用的知识,其中也会对应用助手怎么添加微信进行解释,如果能...
客厅放八骏马摆件可以吗(家里摆... 今天给各位分享客厅放八骏马摆件可以吗的知识,其中也会对家里摆八骏马摆件好吗进行解释,如果能碰巧解决你...
苏州离哪个飞机场近(苏州离哪个... 本篇文章极速百科小编给大家谈谈苏州离哪个飞机场近,以及苏州离哪个飞机场近点对应的知识点,希望对各位有...