上海网站排名前十呼和浩特住房和城乡建设部网站

news/2025/9/30 2:38:44/文章来源:

上海网站排名前十,呼和浩特住房和城乡建设部网站,17做网站广州沙河,一键查询个人房产情感数据对LSTM股票预测模型的影响研究作者#xff1a;丁纪翔发布时间#xff1a;06/28/2021 摘要#xff1a;探究了情感结构化特征数据在LSTM股票预测模型中的影响。利用Pandas对所给数据进行预处理#xff08;数据载入、清洗与准备、规整、时间序列处理、数据聚合等丁纪翔发布时间06/28/2021 摘要探究了情感结构化特征数据在LSTM股票预测模型中的影响。利用Pandas对所给数据进行预处理数据载入、清洗与准备、规整、时间序列处理、数据聚合等。[1] 借助NLTK和LM金融词库对非结构化文本信息进行情感分析并将所得结构化数据融入纯技术指标的股票数据中。分析各股票指标的相关性实现数据降维。基于Keras的以MSE为误差评价方法的LSTM模型实现对股票收盘价Close的预测。最终得出当训练样本充足时融入了情感特征数据使得预测精度适当增加的结论。实验说明设计一个预测股票价格的方法并用实例证明此方法的有效性。所给的数据要求全部都要使用注意数据需清洗、特征综合使用可自己额外补充资源或数据。提供的数据说明全标题 a) 这是股票平台上发布的对各公司的分析文章 b) 标题文章的标题 c) 字段1_链接_链接原文章所在的URL d) ABOUT文章针对的公司都为缩写形式多个公司以逗号隔开 e) TIME文章发布的时间 f) AUTHOR作者 g) COMMENTS采集时文章的被评论次数摘要 a) 这是股票平台上发布的对各公司的分析文章的摘要部分和“全标题”中的内容对应 b) 标题文章的标题 c) 字段2文章发布的时间 d) 字段5文章针对的公司及提及的公司 i. About为针对公司都提取缩写的大写模型多个公司以逗号隔开 ii. include为提及的其它公司都提取缩写的大写模型多个公司以逗号隔开 e) 字段1摘要的全文字内容回帖 a) 这是网友在各文章下的回复内容 b) Title各文章的标题空标题的用最靠近的有内容的下方标题 c) Content回复的全文字内容论坛 a) 这是网友在各公司的论坛页面下对之进行评论的发帖内容 b) 字段1作者 c) 字段2发帖日期 d) 字段3帖子内容 e) 字段4_链接具体的各公司的页面URL 股票价格 a) 为各公司工作日股票的价格 b) PERMNO公司编号 c) Date日期 d) TICKER公司简写 e) COMNAM公司全写 f) BIDLO最低价 g) ASKHI最高价 h) PRC 收盘价 i) VOL成交量 j) OPENPRC 开盘价文章目录情感数据对LSTM股票预测模型的影响研究1 LSTM1.1 LSTM是什么1.2 为什么决定使用LSTM2 深度学习名词概念解释2.1 为什么要使用多于一个epoch2.2 Batch 和 Batch_Size2.3 Iterations2.4 为什么不要shuffle3 实验过程3.1 库导入3.2 pandas核心设置3.3 数据载入、数据清洗与准备、数据规整、时间序列处理3.3.1 股票价格.csv3.3.2 论坛.csv3.3.3 全标题.xlsx3.3.4 摘要.xlsx3.3.5 回帖3.4 情感分析3.4.1 情感分析思路3.4.2 词库导入和添加停用词3.4.3 函数定义3.4.4 情感分析处理3.4.5 情感特征数据聚合3.5 \* 融入情感数据的股票指标相关性分析3.5.1 数据联合3.5.2 pairplot绘图3.5.3 股票指标相关性分析3.6 LSTM预测融合情感特征的股票数据3.6.1 时间序列转有监督函数定义3.6.2 融合情感的股票数据归一化3.6.3 时间序列构建有监督数据集3.6.4 训练集验证集划分3.6.5 基于Keras的LSTM模型搭建3.6.5 (一)、重塑LSTM的输入X3.6.5 (二)、搭建LSTM模型并绘制损失图3.6.6 预测结果并反归一化3.6.7 模型评估3.7 对比实验预测纯技术指标的股票数据3.7.1 对比实验流程通用函数构造3.7.2 对比实验结果分析3.7.3 对比实验结论3.8 补充对比实验补充AAPL股票技术指标样本量进行预测3.8.1 数据获取3.8.2 数据处理3.8.3 预测分析3.8.4 结果分析3.9 2018全年含情感特征的股票数据预测实验3.9.1 情感特征数据聚合3.9.2 预测分析3.9.3 结果分析4. 结论与总结5. 参考文献核心思想使用LSTM模型解决股票数据的时间序列预测问题和使用NLTK库对文本情感进行分析。根本观点历史会不断重演。本次作业均基于如下假设股票规律并不是完全随机的而是受人类心理学中某些规律的制约在面对相似的情境时会根据以往的经验和规律作出相似的反应。因此可以根据历史资料的数据来预测未来股票的波动趋势。在股票的技术指标中收盘价是一天结束时的价格又是第二天的开盘价联系前后两天因此最为重要。[2] 影响因素影响股票价格的因素除了基本的股票技术指标外股票价格还和股民的情绪和相关股票分析文章的情感密切相关。分析方法将股票的技术指标和股民大众的情感评价相结合[3]选择AAPL个股对股票价格即收盘价进行预测。分别对只含有技术指标和含有技术指标和情感评价的样本进行LSTM建模使用MSE均方误差作为损失函数对二者预测结果进行评价。 1 LSTM 1.1 LSTM是什么 LSTM NetworksLong Short-Term Memory- Hochreiter 1997长短期记忆神经网络是一种特殊的RNN能够学习长的依赖关系记住较长的历史信息。 1.2 为什么决定使用LSTM Deep Neural Networks (DNN)深度神经网络有若干输入和一个输出在输出和输入间学习得到一个线性关系接着通过一个神经元激活函数得到结果1或-1. 但DNN不能较好地处理时间序列数据。Recurrent Neural Networks (RNN)循环神经网络可以更好地处理序列信息但其缺点是不能记忆较长时期的时间序列而且 Standard RNN Shortcomings 难以训练给定初值条件下收敛难度大。 LSTM解决了RNN的缺陷。LSTM相较于RNN模型增加了Forget Gate Layer遗忘门可以对上一个节点传进的输入进行选择性忘记。接着选择需要记忆的重要输入信息。也就是“忘记不重要的记住重要的”。这样就解决了RNN在长序列训练过程中的梯度消失和梯度爆炸问题在长序列训练中有更佳的表现。因此我选用LSTM作为股票时间序列数据的训练模型。 2 深度学习名词概念解释 WrodsDefinitionsEpoch使用训练集的全部数据对模型进行一次完整的训练被称之为“一代训练”。包括一次正向传播和一次反向传播Batch使用训练集中的一小部分样本对模型权重进行一次反向传播的参数更新这一小部分样本被称为“一批数据”Iteration使用一个Batch数据对模型进行一次参数更新的过程被称之为“一次迭代 [Source1] https://www.jianshu.com/p/22c50ded4cf7?fromgroupmessage 2.1 为什么要使用多于一个epoch 只传递一次完整数据集是不够的需要在神经网络中传递多次。随着epoch数量的增加神经网络中的权重更新次数也在增加这就导致了拟合曲线从欠拟合变为过拟合。每次epoch之后需要对总样本shuffle再进入下一轮训练。本次实验不用shuffle 对不同数据集epoch个数不同。 2.2 Batch 和 Batch_Size 目前绝大部分深度学习框架使用Mini-batch Gradient Decent 小批梯度下降把数据分为若干批Batch每批有Batch_Size个数据按批更新权重一个Batch中的一组数据共同决定本次梯度的下降方向。 NumberofBatchesTrainingSetSizeBatchSizeNumber of Batches \frac{Training Set Size}{Batch Size} NumberofBatchesBatchSizeTrainingSetSize 小批梯度下降克服了在数据量较大的情况下时Batch Gradient Decent 的计算开销大、速度慢和 Stochastic Gradient Decent 的随机性、收敛效果不佳的缺点。 [Source2] https://blog.csdn.net/dancing_power/article/details/97015723 2.3 Iterations 一次iteration进行一次前向传播和反向传播。前向传播基于属性X得到预测结果y。反向传播根据给定的损失函数求解参数权重。 NumbersofIterationsNumberofBatchedNumbers of Iterations Number of Batched NumbersofIterationsNumberofBatched 2.4 为什么不要shuffle 避免数据投入的顺序对网络训练造成影响增加训练的随机性提高网络的泛化性能。但是针对本次股票价格的预测使用LSTM模型考虑时间因素因此需要设置shuffleFalse按时序顺序依次使用Batch更新参数。 3 实验过程以下实验均基于对Apple, Inc.AAPL苹果公司的股票进行预测分析。 CORPORATIONABBR AAPL 3.1 库导入 # 数据分析的核心库 import numpy as np import pandas as pd from matplotlib import pyplot as plt # 时间序列处理 from datetime import datetime from dateutil.parser import parse as dt_parse # 正则库 import re # os库 from os import listdir # NLTK自然语言处理库 import nltk from nltk.corpus import stopwords # seaborn成对图矩阵生成 from seaborn import pairplot # sklearn库的归一化、训练集测试集划分 from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split # Keras LSTM from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM, Dense, Dropout # sklearn MSE from sklearn.metrics import mean_squared_error3.2 pandas核心设置 # 设置pandas的最大显示行数、列数和输出宽度 pd.set_option(display.max_rows, 6) pd.set_option(display.max_columns, 999) pd.set_option(display.max_colwidth, 50)3.3 数据载入、数据清洗与准备、数据规整、时间序列处理 3.3.1 股票价格.csv sharePrices pd.read_csv(股票价格.csv) sharePricesPERMNOdateTICKERCOMNAMBIDLOASKHIPRCVOLOPENPRC01002620180702JJSFJ J SNACK FOODS CORP150.70000153.27499152.92000100388.0152.1799911002620180703JJSFJ J SNACK FOODS CORP151.35001153.73000153.3200155547.0153.6700021002620180705JJSFJ J SNACK FOODS CORP152.46001156.00000155.81000199370.0153.95000..............................9415159343620181227TSLATESLA INC301.50000322.17169316.130008575133.0319.840009415169343620181228TSLATESLA INC318.41000336.23999333.870009938992.0323.100019415179343620181231TSLATESLA INC325.26001339.20999332.799996302338.0337.79001 941518 rows × 9 columns 索引过滤索引过滤出TICKER公司简写为AAPL的数据行。 sharePricesAAPL sharePrices[sharePrices[TICKER]CORPORATIONABBR]DataFrame降维不需要PERMNO公司编号、COMNAM公司全写、TICKER公司简写这三列数据删除列。 sharePricesAAPL.drop([PERMNO, COMNAM, TICKER], axis1, inplaceTrue)索引数据类型检测确保相应索引的数据类型为float。 sharePricesAAPL.info()class pandas.core.frame.DataFrame Int64Index: 126 entries, 163028 to 163153 Data columns (total 6 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 126 non-null int64 1 BIDLO 126 non-null float642 ASKHI 126 non-null float643 PRC 126 non-null float644 VOL 126 non-null float645 OPENPRC 126 non-null float64 dtypes: float64(5), int64(1) memory usage: 6.9 KB索引检查检查date索引是否存在重复。 sharePricesAAPL[date].is_uniqueTrue时间序列将date日期转化为时间序列索引并按此时间序列以升序排序。 # date列转化为datetime类 sharePricesAAPL[date] sharePricesAAPL[date].apply(lambda dt: datetime.strptime(str(dt), %Y%m%d)) # 设date列为索引 sharePricesAAPL.set_index(date, inplaceTrue) # 按date升序排列 sharePricesAAPL.sort_values(bydate, inplaceTrue, ascendingTrue)BIDLOASKHIPRCVOLOPENPRCdate2018-07-02183.42000187.30187.1799917612113.0183.820012018-07-03183.53999187.95183.9200013909764.0187.789992018-07-05184.28000186.41185.3999916592763.0185.25999..................2018-12-27150.07001156.77156.1499953117005.0155.840002018-12-28154.55000158.52156.2300042291347.0157.500002018-12-31156.48000159.36157.7400135003466.0158.53000 126 rows × 5 columns 缺失值处理检查AAPL股票技术指标数据每列缺失比发现无缺失。若有则可对BIDLO最低价、ASKHI最高价、PRC收盘价、VOL成交量有缺失的数据行直接删除。对OPENPRC开盘价有缺失的使用拉格朗日插值法进行填充。其实之后对股票价格.csv分析可知缺失项的分布都在同一行故只要使用df.dropna()删除存在任意数目缺失项的行即可。 sharePricesAAPL.isnull().mean()BIDLO 0.0 ASKHI 0.0 PRC 0.0 VOL 0.0 OPENPRC 0.0 dtype: float64重建索引重命名索引方便后期使用映射为BIDLO-low、ASKHI-high、PRC-close、VOL-vol、OPENPRC-open。改变索引顺序为open、high、low、vol、close。 # rename AAPL_newIndex {BIDLO: low,ASKHI: high,PRC: close,VOL: vol,OPENPRC: open} sharePricesAAPL.rename(columnsAAPL_newIndex, inplaceTrue) # reindex AAPL_newColOrder [open, high, low, vol, close] sharePricesAAPL sharePricesAAPL.reindex(columnsAAPL_newColOrder)检测过滤异常值无异常。 sharePricesAAPL.describe()openhighlowvolclosecount126.000000126.000000126.0000001.260000e02126.000000mean201.247420203.380885198.8933443.510172e07201.106033std21.36852421.49993221.5969661.577876e0721.663971..................50%207.320000209.375000205.7851503.234006e07207.76000575%219.155000222.172503216.7981754.188390e07219.602500max230.780000233.470000229.7800009.624355e07232.070010 8 rows × 5 columns 数据存储存储处理好的数据为AAPL股票价格.csv存至补充数据1925102007文件夹。方便后续读取使用。 sharePricesAAPL.to_csv(补充数据1925102007/AAPL股票价格.csv)3.3.2 论坛.csv 字段1字段2字段3字段4_链接0ComputerBlue31-Dec-18Lets create a small spec POS portfolio $COTY ...https://seekingalpha.com/symbol/COTY1Darren McCammon31-Dec-18$RICK Now that weve reported results, well ...https://seekingalpha.com/symbol/RICK2Jonathan Cooper31-Dec-18Do any $APHA shareholders support the $GGB tak...https://seekingalpha.com/symbol/APHA...............25114Power Hedge1-Jan-18USD Expected to Collapse in 2018 https://goo.g...https://goo.gl/RG1CDd25115Norman Tweed1-Jan-18Happy New Year everyone! Im adding to $MORL ...https://seekingalpha.com/symbol/MORL25116User 409863051-Jan-18Jamie Diamond says Trump is most pro business ...NaN 25117 rows × 4 columns 缺失值处理删除字段4各公司页面的URL缺失的数据行。 forum pd.read_csv(论坛.csv) forum.dropna(inplaceTrue)字符串操作和正则观察字段4URLseekingalpha.com/symbol/网址后的内容为公司简称使用pandas字符串操作和正则对公司简称进行提取提取失败则删除该数据行。将字段4的数据内容替换为公司简称。 forum_regExp re.compile(rseekingalpha\.com/symbol/([A-Z])) def forumAbbr(link):# 成功查找公司简称则返回简称否则以缺失值填补res forum_regExp.search(link)return np.NAN if res is None else res.group(1) forum[字段4_链接] forum[字段4_链接].apply(forumAbbr)索引过滤提取所有公司简称为AAPL的评论。降维处理字段1作者名称无用可以删除。索引重构重命名索引字段3帖子内容-remark。时间序列将字段2转化为时间序列索引命名为date并按此索引升序排列。 # 索引过滤 forum forum[forum[字段4_链接]CORPORATIONABBR] # 降维处理 forum.drop([字段1, 字段4_链接], axis1, inplaceTrue) # 索引重构 AAPL_newIndex_forum {字段2: date, 字段3: remark} forum.rename(columnsAAPL_newIndex_forum, inplaceTrue) # 时间序列 forum[date] forum[date].apply(lambda dt: datetime.strptime(str(dt), %d-%b-%y))正则过滤评论网址观察评论不难发现部分评论内有网址使用正则表达式过滤之防止对后续情感分析产生影响。 forum_regExp_linkFilter re.compile(r(http|https):\/\/[\w\-_](\.[\w\-_])([\w\-\.,?^%:/~\#]*[\w\-\?^%/~\#])?) forum[remark] forum[remark].apply(lambda x: forum_regExp_linkFilter.sub(, x)) forumdateremark2042018-12-26Many Chinese companies are encouraging their e...4182018-12-21This Week in Germany | Apple Smashed $AAP...4712018-12-21$AAPL gets hit with another partial ban in Ger............247022018-01-05$AAPL. Claims by GHH is 200 billion repatriati...249022018-01-03$AAPL Barclays says battery replacement could ...250832018-01-022018 will be the year for $AAPL to hit the 1 t... 330 rows × 2 columns 同时在进行情感分析时应增加停用词AAPL. 数据存储存储为补充数据1925102007/AAPL论坛.csv。 # 数据储存 forum.to_csv(补充数据1925102007/AAPL论坛.csv, indexFalse)3.3.3 全标题.xlsx 标题字段1_链接_链接ABOUTTIMEAUTHORCOMMENTSUnnamed: 60Micron Technology: Insanely Cheap Stock Given ...https://seekingalpha.com/article/4230920-micro...MUDec. 31, 2018, 7:57 PMRuerd Heeg75 CommentsNaN1Molson Coors Seems Attractive At These Valuationshttps://seekingalpha.com/article/4230922-molso...TAPDec. 31, 2018, 7:44 PMSanjit Deepalam16 CommentsNaN2Gerdau: The Brazilian Play On U.S. Steelhttps://seekingalpha.com/article/4230917-gerda...GGBDec. 31, 2018, 7:10 PMShannon Bruce1 CommentNaN........................17925Big Changes For Centurylink, ATT And Verizon ...https://seekingalpha.com/article/4134687-big-c...CTL, T, VZJan. 1, 2018, 5:38 AMEconDad32 CommentsNaN17926UPS: If The Founders Were Alive Todayhttps://seekingalpha.com/article/4134684-ups-f...UPSJan. 1, 2018, 5:11 AMRoger Gaebel15 CommentsNaN17927U.S. Silica - Buying The Dip Of This Booming C...https://seekingalpha.com/article/4134664-u-s-s...SLCAJan. 1, 2018, 12:20 AMThe Value Investor27 CommentsNaN 17928 rows × 7 columns 索引过滤提取所有ABOUT为AAPL的标题数据行。降维处理字段1_链接_链接、ABOUT、AUTHOR、COMMENTS、Unnamed: 6列删除。索引重构重命名索引标题-title、ABOUT-abbr、TIME-date。时间序列将date转化为时间序列索引并按此索引升序排列。数据存储存储为补充数据1925102007/AAPL全标题.csv。 allTitles pd.read_excel(全标题.xlsx) # 索引过滤 allTitles allTitles[allTitles[ABOUT]CORPORATIONABBR] # 降维 allTitles.drop([字段1_链接_链接,ABOUT,AUTHOR,COMMENTS,Unnamed: 6], axis1, inplaceTrue) # 索引重构 AAPL_newIndex_allTitles {标题: title, TIME: date} allTitles.rename(columnsAAPL_newIndex_allTitles, inplaceTrue) # 时间序列处理 # 因时间日期格式非统一故选用dateutil包对parser.parse方法识别多变时间格式 allTitles[date] allTitles[date].apply(lambda dt: dt_parse(dt)) # 设date列为索引 allTitles.set_index(date, inplaceTrue) # 按date升序排列 allTitles.sort_values(bydate, inplaceTrue, ascendingTrue) # 数据储存 allTitles.to_csv(补充数据1925102007/AAPL全标题.csv) allTitlestitledate2018-01-04 10:12:00Apple Ia Above A Golden Cross And Has A Posi...2018-01-08 10:59:00Apple Cash: What Would Warren Buffett Say?2018-01-16 06:34:00Apples iPhone Battery Replacement Could Consu.........2018-12-31 08:52:00Will Apple Beat Its Guidance?2018-12-31 17:12:00How Much Stock Could Apple Have Repurchased In...2018-12-31 17:36:00Will Apple Get Its Mojo Back? 204 rows × 1 columns 3.3.4 摘要.xlsx 标题字段2字段5字段10HealthEquity: Strong Growth May Be Slowing Hea...Apr. 1, 2019 10:46 PM ET| About: HealthEquity, Inc. (HQY)SummaryHealthEquity’s revenue and earnings hav...1Valero May Rally Up To 40% Within The Next 12 ...Apr. 1, 2019 10:38 PM ET| About: Valero Energy Corporation (VLO)SummaryValero is ideally positioned to benefit...2Apple Makes A China MoveApr. 1, 2019 7:21 PM ET| About: Apple Inc. (AAPL)SummaryCompany cuts prices on many key product..................10128Rubicon Technology: A Promising Net-Net Cash-B...Jul. 24, 2018 2:16 PM ET| About: Rubicon Technology, Inc. (RBCN)SummaryRubicon is trading well below likely li...10129Stamps.com: A Cash MachineJul. 24, 2018 1:57 PM ET| About: Stamps.com Inc. (STMP)SummaryThe Momentum Growth Quotient for the co...10130Can Heineken Turn The Mallya Drama In Its Ow...Jul. 24, 2018 1:24 PM ET| About: Heineken N.V. (HEINY), Includes: BUD,...SummaryMallya, United Breweries chairman, can... 10131 rows × 4 columns 经检查摘要.xlsx无缺失值我们只需要标题和字段1摘要的全文字内容其余数据列删去。将索引映射为标题-title、字段1-abstract. abstracts pd.read_excel(摘要.xlsx) abstracts.drop([字段2, 字段5], axis1, inplaceTrue) newIndex_abstracts {标题: title, 字段1: abstract} abstracts.rename(columnsnewIndex_abstracts, inplaceTrue)求交集和AAPL全标题.csv中title相对应的数据行是针对AAPL股票公司文章的摘要只需要对AAPL文章的摘要即可。 abstracts abstracts.merge(allTitles, on[title], howinner)保存存储为补充数据1925102007/AAPL摘要.csv。 abstracts.to_csv(补充数据1925102007/AAPL摘要.csv, indexFalse) abstractstitleabstract0Will Apple Get Its Mojo Back?SummaryApple has been resting on a reputation ...1How Much Stock Could Apple Have Repurchased In...SummaryApples stock plummeted from $227.26 to...2Will Apple Beat Its Guidance?SummaryApple has sold fewer iPhones, which gen............83Apple: Still The Ultimate Value Growth Stock T...SummaryApple reported superb earnings on Tuesd...84Apple In 2023SummaryWhere can the iPhone go from here?The A...85Apples Real Value TodaySummaryApple has reached new highs this week.W... 86 rows × 2 columns 3.3.5 回帖 pd.read_excel(回帖/SA_Comment_Page131-153.xlsx)字段标题10you should all switch to instagramNaN1Long Facebook and Instagram. They will recover...NaN2Personally, I think people will be buying FB a...NaN.........19968Thank you for the article.If you really think ...Qiwi: The Current Sell-Off Was Too Emotional19969Isnt WRK much better investment than PKG? ThanksNaN19970GuruFocus is also showing a Priotroski score o...Packaging Corporation Of America: Target Retur... 19971 rows × 2 columns pd.read_csv(回帖/SA_Comment_Page181-255(1).csv)字段1标题0I bought at $95 and holding strong. Glad I did...NaN1The price rally you are referring to is not be...Michael Kors: Potential For Further Upside Ahead2only a concern if you own it....NaN.........19997What can Enron Musk do legally to boost balan...NaN19998The last two weeks feels like a short squeeze....NaN19999 Tesla is no longer a growth or value proposi...NaN 20000 rows × 2 columns 索引重命名字段1回帖内容-content、标题-title.注意.csv和.xlsx不同缺失值处理对于回帖中标题1各文章标题的定义空标题的用最靠近的有内容的下方标题故采取用下一个非缺失值填充前缺失值的方法df.fillna(methodbfill)。数据文件读取使用os.listdir()返回指定文件夹下包含的文件名列表以.xlsx或.csv结尾的文件均为数据文件读入后进行上述缺失值处理和索引重命名。回帖过滤遍历所有数据文件找出所有title在AAPL全标题.csv中的回帖行数据检查是否有缺失存至补充数据1925102007/AAPL回帖.csv # 数据文件读取 repliesFiles listdir(回帖) allAALPReplies [] newIndex_replies_csv {字段1: content, 标题: title} newIndex_replies_xlsx {字段: content, 标题1: title} # 遍历回帖目录下所有回帖数据找出和AAPL相关的回帖 for file in repliesFiles:path 回帖/fileif file.endswith(.csv):replies pd.read_csv(path)newIndex_replies newIndex_replies_csvelif file.endswith(.xlsx):replies pd.read_excel(path)newIndex_replies newIndex_replies_xlsxelse:print(Wrong file format,, file)break# 索引重命名replies.rename(columnsnewIndex_replies, inplaceTrue)# 缺失值填充replies.fillna(methodbfill, inplaceTrue)# 回帖过滤allAALPReplies.extend(replies.merge(allTitles, on[title], howinner).values) # 所有和AAPL文章标题所对应的回帖 allAALPReplies pd.DataFrame(allAALPReplies, columns[content, title]) # 保存 allAALPReplies.to_csv(补充数据1925102007/AAPL回帖.csv, indexFalse) # 展示 allAALPRepliescontenttitle0Understood. But let me ask you. 64GB of pics i...iPhone XR And XS May Be Apples Most Profitabl...1Just upgraded from 6 to XS, 256G. Love it. Il...iPhone XR And XS May Be Apples Most Profitabl...2Yup, AAPL will grow profits 20% per year despi...iPhone XR And XS May Be Apples Most Profitabl............4503With all due respect, never have paid for and ...Gain Exposure To Apple Through Berkshire Hathaway4504This ones easy - own both!Gain Exposure To Apple Through Berkshire Hathaway4505No Thanks! I like my divys,and splits too much...Gain Exposure To Apple Through Berkshire Hathaway 4506 rows × 2 columns 3.4 情感分析使用第三方NLP库NLTK (Natural Language Toolkit) NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries. 安装完nltk库以后需要使用nltk.download()命令下载相应语料库。因为速度太慢我选择直接装nltk_data数据包核心数据包放在补充文件夹内。为提高情感分析效率和精度停用词还需增加[!, , ,. ,? ,-s ,-ly ,/s , s, AAPL, apple, $, %]. 使用stopwords.add()添加停用词。 [Source3] http://www.nltk.org 金融情感词库LM (LoughranMcDonald) sentiment word lists 2018 [Loughran-McDonald Sentiment Word Lists](https://sraf.nd.edu/textual-analysis/resources/#LM Sentiment Word Lists) is an Excel file containing each of the LM sentiment words by category (Negative, Positive, Uncertainty, Litigious, Strong Modal, Weak Modal, Constraining). 词库路径/补充数据1925102007/LoughranMcDonald_SentimentWordLists_2018.xlsx [Source4] https://sraf.nd.edu/textual-analysis/resources 3.4.1 情感分析思路分词处理使用NLTK对文本这里指评论数据进行分词处理tokenize停用词处理去除停用词stopwords结构化利用LM金融情感词库中的Positive和Negative表单词库计算pos和neg值作为非结构化文本数据的结构化特征。即以评论中posWords和negWords的占比作为文本数据的特征数据聚合对上述数据进行聚合操作并按工作日股票的交易时间是Business Day为单位进行重采样 posNumofPosWrodsTotalWordspos \frac{Num of PosWrods}{Total Words} posTotalWordsNumofPosWrods negNumofNegWrodsTotalWordsneg \frac{Num of NegWrods}{Total Words} negTotalWordsNumofNegWrods 3.4.2 词库导入和添加停用词 # 词库导入 wordListsPath 补充数据1925102007/LoughranMcDonald_SentimentWordLists_2018.xlsx posWords pd.read_excel(wordListsPath, headerNone, sheet_namePositive).iloc[:,0].values negWords pd.read_excel(wordListsPath, headerNone, sheet_nameNegative).iloc[:,0].values# 添加停用词 extraStopwords [!, , ,. ,? ,-s ,-ly ,/s , s, AAPL, apple, $, %] stopWs stopwords.words(english) extraStopwords3.4.3 函数定义 def structComment(sentence, posW, negW, stopW):结构化句子:param sentence: 待结构化的评论:param posW: 正词性:param negW: 负词性:param stopW: 停用词:return: 去除停用词后的评论中posWords和negWords的占比(pos, neg)# 分词tokenizer nltk.word_tokenize(sentence)# 停用词过滤tokenizer [w.upper() for w in tokenizer if w.lower() not in stopW]# 正词提取posWs [w for w in tokenizer if w in posW]# 负词提取negWs [w for w in tokenizer if w in negW]# tokenizer长度len_token len(tokenizer)# 句子长度为0即分母为0时if len_token0:return 0, 0else:return len(posWs)/len_token, len(negWs)/len_tokendef NLProcessing(fileName, colName):自然语言处理方法将传入的fileName(.csv)对应的数据中的colName列文本数据结构化并保存:param fileName: 文件名在文件夹补充数据1925102007/ 下查找对应文件:param colName: 需要结构化的文本数据列:return: 新增pos和neg列的DataFramepathNLP 补充数据1925102007/fileName.csvdata pd.read_csv(pathNLP)# pos和neg结构化数据列构造posAndneg [ structComment(st, posWords, negWords, stopWs) for st in data[colName].values]# 构造posAndneg的DataFrameposAndneg pd.DataFrame(posAndneg, columns[pos, neg])# 轴向连接data pd.concat([data, posAndneg], axis1)# 删除文本数据列data.drop([colName], axis1, inplaceTrue)# 保存结构化的数据data.to_csv(pathNLP)return data3.4.4 情感分析处理 # AAPL论坛.csv forum NLProcessing(AAPL论坛, remark) # AAPL摘要.csv abstracts NLProcessing(AAPL摘要, abstract) # AAPL回帖.csv allAALPReplies NLProcessing(AAPL回帖, content)3.4.5 情感特征数据聚合上述操作得到带有title列的结构化数据AAPL回帖.csv和AAPL摘要.csv后先将回帖和摘要用concat函数沿纵轴连接再以title为索引与AAPL全标题.csvallTitles进行外联合并Outer Merge删除无用的title列。forum结构化数据和上一步所得数据进行concat轴相连接沿纵轴。最后以时间天为单位进行重采样得出每日的pos和neg特征的平均值。 # 轴相连接abstracts和allAALPReplies allEssaysComment pd.concat([abstracts,allAALPReplies], ignore_indexTrue) # 联表 allEssaysComment allTitles.merge(allEssaysComment, howouter, ontitle) # 删除缺失行 allEssaysComment.dropna(inplaceTrue) # 删除title列 allEssaysComment.drop(title, axis1, inplaceTrue) # 和forum情感数据进行轴向连接 allEssaysComment pd.concat([allEssaysComment,forum], ignore_indexTrue) # 删除pos和neg均为0的无用数据行 allEssaysComment allEssaysComment[(allEssaysComment[pos]allEssaysComment[neg])0]# 设date为时间序列索引 allEssaysComment[date] pd.to_datetime(allEssaysComment[date]) allEssaysComment.set_index(date, inplaceTrue) # 按工作日重采样求pos和neg的均值不存在的天以0填充 allEssaysComment allEssaysComment.resample(B).mean() allEssaysComment.fillna(0, inplaceTrue) # 储存 allEssaysComment.to_csv(补充数据1925102007/allPosAndNeg.csv) # 展示 allEssaysCommentposnegdate2018-01-050.0416670.0434782018-01-080.0000000.0000002018-01-090.0000000.090909.........2018-12-240.0000000.0000002018-12-250.0000000.0000002018-12-260.0909090.090909 254 rows × 2 columns 3.5 * 融入情感数据的股票指标相关性分析方法希望借助seaborn的pairplot函数绘制AAPL股票价格.csvsharePricesAAPL的各项指标数据两两关联的散点图对角线为变量的直方图从而探究不同指标间的关系。目的分析股票各指标间的关系。以及是否找出线性相关程度高的指标删除之以减少LSTM的训练时间成本。 pairplot函数文档http://seaborn.pydata.org/generated/seaborn.pairplot.html 3.5.1 数据联合将2.2所得时间序列情感分析数据allPosAndNeg.csv和AAPL股票价格.csvsharePricesAAPL以date为索引合并。联合时可以发现评论数据的时间跨度足以覆盖AAPL股票价格数据所以不用担心缺失值的问题。 [Jump to relative contents] # 文件读取 sharePricesAAPL pd.read_csv(补充数据1925102007/AAPL股票价格.csv) allPosAndNeg pd.read_csv(补充数据1925102007/allPosAndNeg.csv) # 合并 sharePricesAAPLwithEmotion sharePricesAAPL.merge(allPosAndNeg, howinner, ondate) # 序列化时间索引date sharePricesAAPLwithEmotion[date] pd.DatetimeIndex(sharePricesAAPLwithEmotion[date]) sharePricesAAPLwithEmotion.set_index(date, inplaceTrue) # reindex AAPL_newColOrder_emotionPrices [open, high, low, vol, pos, neg, close] sharePricesAAPLwithEmotion sharePricesAAPLwithEmotion.reindex(columnsAAPL_newColOrder_emotionPrices) # 保存 sharePricesAAPLwithEmotion.to_csv(补充数据1925102007/AAPL股票价格融合情感.csv)3.5.2 pairplot绘图留下必要的OHLC技术指标对剩余的vol、pos、neg进行相关性分析绘图实验时我也绘制了OHLC技术指标的轴线网格图可以发现其两两间具有较高的线性相关性。 # Parameters: # data: pandas.DataFrame [Tidy (long-form) dataframe where each column is a variable and each row is an observation.] # diag_kind: {‘auto’, ‘hist’, ‘kde’, None} [Kind of plot for the diagonal subplots.] # kind: {‘scatter’, ‘kde’, ‘hist’, ‘reg’} [Kind of plot to make.] fig1 pairplot(sharePricesAAPLwithEmotion[[vol, pos, neg]], diag_kindhist, kindreg) # save the fig1 to 补充数据1925102007/ fig1.savefig(补充数据1925102007/fig1_a_Grid_of_Axes.png)3.5.3 股票指标相关性分析观察所得Fig1: a Grid of Axes不难发现指标vol、pos、neg之间线性相关性较弱所以均保留作为LSTM预测指标。 3.6 LSTM预测融合情感特征的股票数据依赖的库Keras、Sklearn、Tensorflow [4] 预测目标close收盘价引用函数series_to_supervised(data, n_in1, n_out1, dropnanTrue) 来源Time Series Forecasting With Python 用途Frame a time series as a supervised learning dataset. 将输入的单变量或多变量时间序列转化为有监督学习数据集。参数Arguments data: Sequence of observations as a list or NumPy array. n_in: Number of lag observations as input (X). n_out: Number of observations as output (y). dropnan: Boolean whether or not to drop rows with NaN values. # 因为LSTM已经具有记忆功能了所以我的n_in和n_out参数直接使用默认的1即可也就是构造[t-1]现态列和[t]次态列。返回值Returns Pandas DataFrame of series framed for supervised learning. 3.6.1 时间序列转有监督函数定义 def series_to_supervised(data, n_in1):# 默认参数n_out1dropnanTrue# 对该函数进行微调注意data为以close列需要预测的列结尾的DataFrame时间序列股票数据n_vars 1 if type(data) is list else data.shape[1]df pd.DataFrame(data)cols, names list(), list()# input sequence (t-n, ... t-1)for i in range(n_in, 0, -1):cols.append(df.shift(i))names [(var%d(t-%d) % (j1, i)) for j in range(n_vars)]# forecast sequence (t, t1, ... tn)for i in range(0, n_out):cols.append(df.shift(-i))if i 0:names [(var%d(t) % (j1)) for j in range(n_vars)]else:names [(var%d(t%d) % (j1, i)) for j in range(n_vars)]# put it all togetheragg pd.concat(cols, axis1)agg.columns names# 删除无关的次态[t]列只留下需要预测的close[t]列和上一时刻状态特征[t-1]列agg.drop(agg.columns[[x for x in range(data.shape[1], 2*data.shape[1]-1)]], axis1, inplaceTrue)# drop rows with NaN valuesif dropnan:agg.dropna(inplaceTrue)return agg3.6.2 融合情感的股票数据归一化 # 读取数据 sharePricesAAPLwithEmotion pd.read_csv(补充数据1925102007/AAPL股票价格融合情感.csv, parse_dates[date], index_coldate).values # 生成归一化容器 # feature_range参数沿用默认(0,1) scaler MinMaxScaler() # 训练模型 scaler scaler.fit(sharePricesAAPLwithEmotion) # 归一化 sharePricesAAPLwithEmotion scaler.fit_transform(sharePricesAAPLwithEmotion) # 部分结果展示 sharePricesAAPLwithEmotion[:5,:]array([[0.4316836 , 0.43640137, 0.44272148, 0.06118638, 0. ,0. , 0.47336914],[0.47972885, 0.44433594, 0.44416384, 0.01698249, 0. ,0. , 0.4351243 ],[0.44911044, 0.42553711, 0.45305926, 0.04901593, 0. ,0. , 0.45248692],[0.4510469 , 0.45024426, 0.46411828, 0.05954544, 0. ,0. , 0.4826372 ],[0.50042364, 0.47766101, 0.51340305, 0.08659896, 0. ,0. , 0.51325663]])3.6.3 时间序列构建有监督数据集 # 使用series_to_supervised函数构建有监督数据集 sharePricesAAPLwithEmotion series_to_supervised(sharePricesAAPLwithEmotion) sharePricesAAPLwithEmotionvar1(t-1)var2(t-1)var3(t-1)var4(t-1)var5(t-1)var6(t-1)var7(t-1)var7(t)10.4316840.4364010.4427210.0611860.00.00.4733690.43512420.4797290.4443360.4441640.0169820.00.00.4351240.45248730.4491100.4255370.4530590.0490160.00.00.4524870.482637...........................1200.1482510.1289060.1047000.6242520.00.00.1173160.0457531210.1054100.0806880.0365430.9940590.00.00.0457530.0000001220.0000000.0000000.0000000.2956430.00.00.0000000.121305 122 rows × 8 columns 3.6.4 训练集验证集划分 # 必须规定ndarray的dtype为float32默认float64否则后续输入LSTM模型报错 sharePricesAAPLwithEmotion sharePricesAAPLwithEmotion.values.astype(np.float32) # 训练集:验证集7:3 X_train, X_test, y_train, y_test train_test_split(sharePricesAAPLwithEmotion[:,:-1], sharePricesAAPLwithEmotion[:,-1], test_size0.3, shuffleFalse)3.6.5 基于Keras的LSTM模型搭建参考文档 Keras core: Dense and Dropout Keras Activation relu Keras Losses mean_squared_error Keras Optimizer adam Keras LSTM Layers Keras Sequential Model 3.6.5 (一)、重塑LSTM的输入X LSTM的输入格式为**shape [samples,timesteps,features]** samples样本数量 timesteps时间步长 features (input_dim)每一个时间步上的维度重塑X_train和X_test # reshape input to be 3D [samples, timesteps, features] X_train X_train.reshape((X_train.shape[0], 1, X_train.shape[1])) X_test X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))3.6.5 (二)、搭建LSTM模型并绘制损失图建立Sequential模型添加LSTM层64个隐藏层神经元1个输出层神经元指定多层LSTM模型第一层的input_shape参数回归模型设定Dropout在每次训练时的丢弃比rate为0.4设定Dense全连接层的输出空间维度units为1激活函数activation为relu整流线性单元设定Sequential的损失函数loss为MSEMean-Square Error均方误差优化器optimizer为adam模型训练设置epochs50; batch_size30 def LSTMModelGenerate(Xtrain, Xtest, ytrain, ytest):LSTM模型搭建函数:param Xtrain: 训练集属性:param Xtest: 测试集属性:param ytrain: 训练集标签:param ytest: 测试集标签:return: history,model模型编译记录和模型# 搭建LSTM模型_model Sequential()_model.add(LSTM(64, input_shape(Xtrain.shape[1], Xtrain.shape[2])))_model.add(Dropout(0.4))_model.add(Dense(1, activationrelu))# 模型编译_model.compile(lossmse, optimizeradam)# 模型训练_history _model.fit(Xtrain, ytrain, epochs50, batch_size30, validation_data(Xtest, ytest), shuffleFalse, verbose0)return _history,_modelhistory, model LSTMModelGenerate(X_train, X_test, y_train, y_test)损失图绘制 def drawLossGraph(_history, title, num):损失图绘制寻找最优epochs:param _history: 训练历史:param title: 图表标题:param num: 图表编号:return: 无plt.plot(_history.history[loss], colorg, labeltrain)plt.plot(_history.history[val_loss], colorr, labeltest)plt.title(Fignum. title)plt.xlabel(epochs)plt.ylabel(loss)plt.legend()# 保存于补充数据1925102007/savingPath 补充数据1925102007/fignum_title.replace( , _).pngplt.savefig(savingPath, dpi400, bbox_inchestight)# 展示plt.show()drawLossGraph(history, titleLSTM Loss Graph for Stock Prices with Emotions, num2)损失图分析由Fig2含情感的股票价格LSTM损失图可以看出MSE随迭代次数增加而减小在大约30次迭代后其趋于稳定收敛。 3.6.6 预测结果并反归一化 # 因为只要对结果列进行反归一化操作 # 故不用inverse_transform函数 # 这里自定义对某列的反归一化函数 inverse_transform_col def inverse_transform_col(_scaler, y, n_col):对某个列进行反归一化处理的函数:param _scaler: sklearn归一化模型:param y: 需要反归一化的数据列:param n_col: y在归一化时所属的列编号:return: y的反归一化结果y y.copy()y - _scaler.min_[n_col]y / _scaler.scale_[n_col]return y# 模型预测结果绘图函数 def predictGraph(yTrain, yPredict, yTest, timelabels, title, num):预测结果图像绘制函数:param yTrain: 训练集结果:param yPredict: 验证集的预测结果:param yTest: 验证集的真实结果:param timelabels: x轴刻度标签:param title: 图表标题:param num: 图标编号:return: 无len_yTrain yTrain.shape[0]len_y len_yTrainyPredict.shape[0]# 真实曲线绘制plt.plot(np.concatenate([yTrain,yTest]), colorr, labelsample)# 预测曲线绘制plt.plot([x for x in range(len_yTrain,len_y)],yPredict, colorg, labelpredict)# 标题和轴标签plt.title(Fignum. title)plt.xlabel(date)plt.ylabel(close)plt.legend()# 刻度和刻度标签xticks [0,len_yTrain,len_y-1]xtick_labels [timelabels[x] for x in xticks]plt.xticks(ticksxticks, labelsxtick_labels, rotation30)# 保存于补充数据1925102007/savingPath 补充数据1925102007/fignum_title.replace( , _).pngplt.savefig(savingPath, dpi400, bbox_inchestight)# 展示plt.show()# 由X_test前日股票指标预测当天股票close值 # 注predict生成的array需降维成 shape(n_samples, ) y_predict model.predict(X_test)[:,0]# 反归一化 # 重新读取 AAPL股票价格融合情感.csv sharePricesAAPLwithEmotion pd.read_csv(补充数据1925102007/AAPL股票价格融合情感.csv) col_n sharePricesAAPLwithEmotion.shape[1]-2 # 预测结果反归一化 inv_yPredict inverse_transform_col(scaler, y_predict, col_n) # 真实结果反归一化 inv_yTest inverse_transform_col(scaler, y_test, col_n) # 训练集结果反归一化以绘制完整图像 inv_yTrain inverse_transform_col(scaler, y_train, col_n) # 绘图 predictGraph(inv_yTrain, inv_yPredict, inv_yTest, timelabelssharePricesAAPLwithEmotion[date].values, titlePrediction Graph of Stock Prices with Emotions, num3)3.6.7 模型评估误差评价方法MSE # sklearn.metrics.mean_squared_error(y_true, y_pred) mse mean_squared_error(inv_yTest, inv_yPredict) print(带有情感特征的股票数据预测结果的均方误差MSE为 , mse)带有情感特征的股票数据预测结果的均方误差MSE为 160.42007分析观察Fig3可知用含有情感特征的股票数据训练的LSTM模型预测结果绿色曲线和真实结果红色曲线的后段总体变化趋势一致即真实值下降或上升时预测值跟着下降或上升。在模型预测的开始阶段拟合效果较好但随着时间推移预测值和真实值的结果差距愈发增大。 3.7 对比实验预测纯技术指标的股票数据作为对比导入补充数据1925102007/AAPL股票价格.csv具体操作和上述一致对不含情感特征的纯技术指标股票数据进行预测分析。操作基本一致故不作详细注释 3.7.1 对比实验流程通用函数构造 def formatData(sharePricesData):模式化样本数据的函数:param sharePricesData: 样本数据的DataFrame:return: X_train, X_test, y_train, y_test, scaler# 归一化_scaler MinMaxScaler()_scaler _scaler.fit(sharePricesData)sharePricesData _scaler.fit_transform(sharePricesData)# 构建有监督数据集sharePricesData series_to_supervised(sharePricesData)# dtype为float32sharePricesData sharePricesData.values.astype(np.float32)# 训练集和验证集的划分_X_train, _X_test, _y_train, _y_test train_test_split(sharePricesData[:,:-1], sharePricesData[:,-1], test_size0.3, shuffleFalse)# reshape input_X_train _X_train.reshape((_X_train.shape[0], 1, _X_train.shape[1]))_X_test _X_test.reshape((_X_test.shape[0], 1, _X_test.shape[1]))return _X_train, _X_test, _y_train, _y_test, _scalerdef invTransformMulti(_scaler, _y_predict, _y_test, _y_train, _col_n):# 批量反归一化_inv_yPredict inverse_transform_col(_scaler, _y_predict, _col_n)_inv_yTest inverse_transform_col(_scaler, _y_test, _col_n)_inv_yTrain inverse_transform_col(_scaler, _y_train, _col_n)return _inv_yPredict, _inv_yTest, _inv_yTrain# 读取数据 sharePricesAAPL pd.read_csv(补充数据1925102007/AAPL股票价格.csv, parse_dates[date], index_coldate).values # 标准化数据输入 X_train, X_test, y_train, y_test, scaler formatData(sharePricesAAPL) # 建模 history, model LSTMModelGenerate(X_train, X_test, y_train, y_test)# 损失函数绘图 drawLossGraph(history, titleLSTM Loss Graph for Stock Prices without Emotions, num4)# 预测 y_predict model.predict(X_test)[:,0] # 反归一化 sharePricesAAPL pd.read_csv(补充数据1925102007/AAPL股票价格.csv) col_n sharePricesAAPL.shape[1]-2 inv_yPredict, inv_yTest, inv_yTrain invTransformMulti(scaler, y_predict, y_test, y_train, col_n) # 绘图 predictGraph(inv_yTrain, inv_yPredict, inv_yTest, timelabelssharePricesAAPL[date].values, titlePrediction Graph of Stock Prices without Emotions, num5)# 均方误差 mse mean_squared_error(inv_yTest, inv_yPredict) print(无情感特征的纯技术指标股票数据预测结果的均方误差MSE为 , mse)无情感特征的纯技术指标股票数据预测结果的均方误差MSE为 142.502273.7.2 对比实验结果分析对比Fig3和Fig5含情感和不含情感均方误差通过去除情感信息用LSTM模型得出的纯技术指标的股票close预测结果单就误差来看要优于含情感特征的股票数据预测结果纯技术指标预测的精度更高总体上更接近于真值。 MSE (含情感特征) 160.42007 MSE (纯技术指标) 142.50227 曲线特征显然含有情感数据信息的预测结果曲线较无情感的预测曲线更灵敏。Fig3含情感特征的预测曲线随真值曲线的升降而涨跌真值曲线的变化突变趋势较为完整地体现在预测曲线中而Fig5纯技术指标的预测曲线随真值曲线的波动并不明显。 Fig3. Prediction Graph of Stock Prices with Emotions Fig5. Prediction Graph of Stock Prices without Emotions 3.7.3 对比实验结论在现有数据下从总体上来看纯技术指标的股票数据预测精度更高但从局部来看融入了情感特征的股票数据则更加灵敏。实验结果基本和预期一致。结果表明股票的价格涨跌并非无规律的随机游走而是和股民的情感息息相关。在对股票数据的预测中融入互联网论坛上股民大众的情感数据信息能够更好地判断出未来一段时间内股票的涨跌情况从而帮助判断股票的最佳购入点和卖出点、分析股票投资风险。情感数据信息有助于在量化投资中辅助股民和数据分析师做出最优决策。 3.8 补充对比实验补充AAPL股票技术指标样本量进行预测在数据联合步骤时发现所给补充数据1925102007/AAPL股票价格.csv数据并不能覆盖所有的评论数据allPosAndNeg.csv。此外该数据样本量较少按训练集和验证集7:3比例划分后导致训练集样本数只有88条。因此决定使用英为财情股票行情网站所提供的2018年全年AAPL股票工作日纯技术指标数据使用上述方法对收盘价close进行预测和2.5 对比实验进行对比。事实上 AAPL股票价格.csv覆盖时间为2018-07-02至2018-12-31 allPosAndNeg.csv覆盖时间为2018-01-05至2018-12-31. 3.8.1 数据获取从英为财情AAPL个股页面下载近五年AAPL纯技术指标股票数据储存于补充数据1925102007\AAPLHistoricalData_5years.csv. 3.8.2 数据处理 # 读取数据 allYearAAPL pd.read_csv(补充数据1925102007/AAPLHistoricalData_5years.csv, parse_dates[Date], index_colDate) # 时间序列索引切片 allYearAAPL allYearAAPL[2018-12-31:2018-01-01] # 排序 allYearAAPL.sort_index(inplaceTrue) # 展示 allYearAAPLClose/LastVolumeOpenHighLowDate2018-01-02$43.065101602160$42.54$43.075$42.3152018-01-03$43.0575117844160$43.1325$43.6375$42.992018-01-04$43.257589370600$43.135$43.3675$43.02..................2018-12-27$39.0375206435400$38.96$39.1925$37.51752018-12-28$39.0575166962400$39.375$39.63$38.63752018-12-31$39.435137997560$39.6325$39.84$39.12 251 rows × 5 columns # pandas字符串切割、Series类型修改去除$ allYearAAPL[[Close/Last, Open, High, Low]] allYearAAPL[[Close/Last, Open, High, Low]].apply(lambda x: (x.str[1:]).astype(np.float32)) # reindex allAAPL_newColOrder [Open, High, Low, Volume, Close/Last] allYearAAPL allYearAAPL.reindex(columnsallAAPL_newColOrder) # 保存为AAPL2018allYearData.csv allYearAAPL.to_csv(补充数据1925102007/AAPL2018allYearData.csv) # 展示 allYearAAPLOpenHighLowVolumeClose/LastDate2018-01-0242.54000143.07500142.31499910160216043.0649992018-01-0343.13250043.63750142.99000211784416043.0574992018-01-0443.13499843.36750043.0200008937060043.257500..................2018-12-2738.95999939.19250137.51750220643540039.0374982018-12-2839.37500039.63000138.63750116696240039.0574992018-12-3139.63250039.84000039.11999913799756039.435001 251 rows × 5 columns 3.8.3 预测分析 # 标准化数据输入 X_train, X_test, y_train, y_test, scaler formatData(allYearAAPL) # 建模 history, model LSTMModelGenerate(X_train, X_test, y_train, y_test) # 损失函数绘图 drawLossGraph(history, titleLSTM Loss Graph for 2018 All Year AAPL Stock Prices, num6)# 预测 y_predict model.predict(X_test)[:,0] # 反归一化 allYearAAPL pd.read_csv(补充数据1925102007/AAPL2018allYearData.csv) col_n allYearAAPL.shape[1]-2 inv_yPredict, inv_yTest, inv_yTrain invTransformMulti(scaler, y_predict, y_test, y_train, col_n) # 绘图 predictGraph(inv_yTrain, inv_yPredict, inv_yTest, timelabelsallYearAAPL[Date].values, titlePrediction Graph of 2018 All Year AAPL Stock Prices, num7)# 均方误差 mse mean_squared_error(inv_yTest, inv_yPredict) print(2018全年纯技术指标AAPL股票数据预测结果的均方误差MSE为 , mse)3.8.4 结果分析由Fig7. Prediction Graph of 2018 All Year AAPL Stock Prices、2018全年纯技术指标AAPL股票数据预测结果的均方误差和2.5 不含情感特征的AAPL股票数据预测的对比实验比较得知在增加股票的时间序列数据后即由原本2018-07-022018-12-31扩充至2018-01-01~2018-12-31纯技术指标预测的精度大幅提升LSTM模型的拟合效果极佳。由此推断Fig3.和Fig5.即未增添数据前的AAPL含情感特征预测图和纯技术指标预测图的预测结果精度低且随时间推移预测结果严重偏离真值的原因在于样本数目不足导致LSTM模型训练不到位。接下来将添加补充数据后的2018全年AAPL股票数据融合情感特征进行含情感特征的股票数据预测以验证这一推断。 3.9 2018全年含情感特征的股票数据预测实验 3.9.1 情感特征数据聚合 # 文件读取 allYearAAPL_withEmos pd.read_csv(补充数据1925102007/AAPL2018allYearData.csv) allPosAndNeg pd.read_csv(补充数据1925102007/allPosAndNeg.csv) # 合并 allYearAAPL_withEmos allYearAAPL_withEmos.merge(allPosAndNeg, howinner, left_onDate, right_ondate).drop(date, axis1) # 序列化时间索引date allYearAAPL_withEmos[Date] pd.DatetimeIndex(allYearAAPL_withEmos[Date]) allYearAAPL_withEmos.set_index(Date, inplaceTrue) # reindex allYearAAPLwithEmos_newColOrder [Open,High,Low,Volume,pos,neg,Close/Last] allYearAAPL_withEmos allYearAAPL_withEmos.reindex(columnsallYearAAPLwithEmos_newColOrder) # 保存 allYearAAPL_withEmos.to_csv(补充数据1925102007/AAPL2018allYearData_withEmos.csv) # 展示 allYearAAPL_withEmosOpenHighLowVolumeposnegClose/LastDate2018-01-0543.360043.842543.2625943597200.0416670.04347843.75002018-01-0843.587543.902543.4825820954800.0000000.00000043.58752018-01-0943.637543.765043.3525861288000.0000000.09090943.5825........................2018-12-2139.215039.540037.40753819916000.0000000.00000037.68252018-12-2437.037537.887536.64751486769200.0000000.00000036.70752018-12-2637.075039.307536.68002325354000.0909090.09090939.2925 245 rows × 7 columns 3.9.2 预测分析 # 标准化数据输入 X_train, X_test, y_train, y_test, scaler formatData(allYearAAPL_withEmos) # 建模 history, model LSTMModelGenerate(X_train, X_test, y_train, y_test) # 损失函数绘图 drawLossGraph(history, titleLSTM Loss Graph for 2018 All Year AAPL Stock Prices with Emotions, num8)# 预测 y_predict model.predict(X_test)[:,0] # 反归一化 allYearAAPL_withEmos pd.read_csv(补充数据1925102007/AAPL2018allYearData_withEmos.csv) col_n allYearAAPL_withEmos.shape[1]-2 inv_yPredict, inv_yTest, inv_yTrain invTransformMulti(scaler, y_predict, y_test, y_train, col_n) # 绘图 predictGraph(inv_yTrain, inv_yPredict, inv_yTest, timelabelsallYearAAPL_withEmos[Date].values, titlePrediction Graph of 2018 All Year AAPL Stock Prices with Emotions, num9)# 均方误差 mse mean_squared_error(inv_yTest, inv_yPredict) print(2018全年含情感特征的AAPL股票数据预测结果的均方误差MSE为 , mse)2018全年含情感特征的AAPL股票数据预测结果的均方误差MSE为 1.55267913.9.3 结果分析模型训练损失图对比Fig2. LSTM Loss Graph for Stock Prices with Emotions和Fig8. LSTM Loss Graph for 2018 All Year AAPL Stock Prices with Emotions发现使用2018全年AAPL含情感特征的股票数据训练LSTM模型在约10次左右epochs时收敛而部分AAPL含情感特征的股票数据训练则需要约20次左右epochs才能收敛。表明随训练样本的增加LSTM模型使损失函数收敛所需的迭代次数更少且拟合效果更佳。预测结果图对比Fig7. Prediction Graph of 2018 All Year AAPL Stock Prices和Fig9. Prediction Graph of 2018 All Year AAPL Stock Prices with Emotions即只含纯技术指标的和加入情感特征后的2018全年AAPL股票数据预测结果图发现二者差异甚微。但通过二者MSE值不难发现MSE (2018全年含情感特征的AAPL股票数据) MSE (2018全年纯技术指标AAPL股票数据)表明在总体样本量扩大让评论情感特征数据的时间能够覆盖所有股票技术指标的情况下向纯技术指标的股票数据中添加情感特征数据后能够增加对股票收盘价close的预测精度。 MSE (2018全年含情感特征的AAPL股票数据) 1.5526791 MSE (2018全年纯技术指标AAPL股票数据) 1.7402486 4. 结论与总结本实验探究了情感结构化特征数据在LSTM股票预测模型中的影响。利用Pandas对所给数据进行预处理数据载入、清洗与准备、规整、时间序列处理、数据聚合等确保数据的可用性。再借助NLTK和LM金融词库对非结构化文本信息进行情感分析并将所得结构化数据融入纯技术指标的股票数据中。分析各股票指标的相关性实现数据降维提升模型训练速度。基于Keras的以MSE为误差评价方法的LSTM模型分别使用含有情感和不含情感的部分股票数据和2018全年股票数据实现对股票收盘价Close的预测。实验结果表明LSTM模型预测股票收盘价Close时在训练样本量较少的情况下无论有无情感数据的融入预测值随时间的推移严重偏离真值即预测精度较低而情感数据的融入让预测值变得更加灵敏涨跌情况更符合真值但预测精度有所下降。然而当训练样本充足时不仅预测精度大幅提升而且因融入了情感特征数据使得预测灵敏度适当增加导致总体预测精度再次增长。 5. 参考文献 [1] Wes McKinney. 利用Python进行数据分析[M]. 机械工业出版社. 2013 [2] 洪志令, 吴梅红. 股票大数据挖掘实战——股票分析篇[M]. 清华大学出版社. 2020 [3] 杨妥, 李万龙, 郑山红. 融合情感分析与SVM_LSTM模型的股票指数预测. 软件导刊, 2020(8):14-18. [4] Francesca Lazzeri. Machine Learning for Time Series Forecasting with Python[M]. Wiley. 2020 数据集下载百度云- https://pan.baidu.com/s/1tC1AFx0kMHPUGobvqf47pg 华大云盘- https://pan.hqu.edu.cn/share/a474d56c6b6557f7a7fd0e0eb7 密码- ued8

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/922420.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！