美国队长3:内战

There are plenty of reasons why one would want to find solitude in the wilderness, from the therapeutic effects of being immersed in nature, to not wanting to contribute to trail degradation and soil erosion on busier trails.

人们有很多理由想要在旷野找到孤独，从沉浸在大自然中的治疗效果到不想在繁忙的小径上造成小径的退化和土壤侵蚀。

Now more than ever the reprieve of the outdoors is greatly needed. But in a post-COVID 19 world, where it can be practically impossible to maintain proper social distancing measures when passing hikers on a narrow trail, it is especially important to find less frequented trails to hike.

现在比以往任何时候都更需要户外缓刑。但是在19后COVID的世界中，在狭窄的步道上经过远足者时，几乎不可能维持适当的社会疏远措施，因此寻找不那么频繁的远足径尤为重要。

I set out on a mission to use data science and machine learning to find the best little-known trails in America. You can check out the code on my github if you want to jump into the nitty gritty, or read on for analysis and a list of the hidden gems in your state!

我的任务是使用数据科学和机器学习来找到美国鲜为人知的最佳路径。您可以在我的github上签出代码，如果想跳入更多细节，或者继续阅读以进行分析以及您所在州的隐藏宝石清单！

该方法 (The Approach)

If you’re anything like me, before you go anywhere or buy anything, you’re going to read all the reviews. When looking for trails to hike, a popular medium for discovering where to go is AllTrails.com.

如果您像我一样，在去任何地方或购买任何东西之前，您需要阅读所有评论。当您寻找远足小径时， AllTrails.com是找到目的地的一种流行媒介。

When I first approached this project, I wanted to answer the question, “What makes a trail good?” That is, what combination of features and statistics about a trail would lead to it having a high overall rating?

当我第一次接触这个项目时，我想回答一个问题：“什么让步道更好？” 就是说，特征和统计信息的组合如何才能使它具有较高的总体评价？

What I pretty quickly found out though, is that across the 35,000 trails I scraped and analyzed, basically all of them were rated “pretty good” — that is, with an average user rating of 4.2 out of 5 stars and standard deviation of less than 0.6, it was really hard to distinguish which trails were excellent, and which were just okay, from their 5-star rating alone.

不过，我很快发现，在我抓取和分析的35,000条路径中，基本上所有路径都被评为“相当好”，也就是说，平均用户评分为5颗星中的4.2颗，标准偏差小于0.6，真的很难从它们的5星评级中区分出哪些是优秀的，哪些还可以。

What there was huge variation in across all the trails though, was their popularity as represented by the total number of reviews each trail had. While the vast majority of trails had only 100 or so reviews, a select few had several thousand! What was making these trails so popular?

但是，所有路径之间的差异都很大，它们的受欢迎程度由每个路径的评论总数表示。虽然绝大多数足迹只有100条左右的评论，但很少的一条只有数千条！是什么让这些足迹如此受欢迎？

I thus pivoted to try to predict not the rating of a trail, but instead determine, via a data-driven model, the relationship between the various features of a given trail and its popularity. In finding commonalities, I could then apply that model to unpopular trails, to find which ones check all the same boxes and are likely to be great, even though they haven’t been discovered yet.

因此，我转而尝试不预测路线的等级，而是通过数据驱动模型确定给定路线的各种特征与其受欢迎程度之间的关系。在寻找共性时，我可以将该模型应用于不受欢迎的线索，以找出哪些会选中所有相同的框，即使它们尚未被发现，也可能很棒。

方法 (Methodology)

) With Selenium and Beautiful Soup, scrape AllTrails.com to obtain trail data about 35,000 trails in the United States. This included information about the length of the hike, its elevation gain, its location, and a list of all of the natural features (such as waterfall, wild flowers, paving) the trail had.
)使用Selenium和Beautiful Soup，抓取AllTrails.com以获取有关美国35,000条路径的路径数据。其中包括有关远足时间，海拔提升，位置以及所有自然特征(例如瀑布，野花，铺路)的列表的信息。
) Clean this data and create a Pandas DataFrame. This included one-hot encoding dummy variables for all of categorical feature columns.
)清理此数据并创建一个Pandas DataFrame。其中包括所有分类要素列的一键编码伪变量。
) Utilize the VADER Sentiment Analysis module to analyze the text reviews via simple Natural Language Processing for each trail and determine a mean composite score.
)利用VADER情绪分析模块通过简单的自然语言处理对每条线索进行文本评论分析，并确定平均综合得分。
) Use linear regression modeling methodologies including Statsmodels OLS to determine the relationship between a trail’s features and its’ popularity.
)使用包括Statsmodels OLS在内的线性回归建模方法来确定路径特征与其受欢迎程度之间的关系。
) Perform feature engineering and regularization via LassoCV to remove multicollinearity amongst those features and optimize the model.
)通过LassoCV执行特征工程和正则化，以消除这些特征之间的多重共线性并优化模型。
) Apply that model to trails that are described as “lightly trafficked”, to find trails which would be expected to be popular based on their combination of features, but just haven’t been discovered yet.
)将该模型应用于描述为“轻度贩运”的路径，以根据其功能组合查找预期会流行的路径，但尚未发现。

发现 (Findings)

A linear regression model was fit to the trail’s stats with the number of reviews (and hence, popularity) serving as the target variable. The model yielded a list of the most influential features on a trail on it being popular. These included there being a fee, having a high sentiment analysis score, it being rocky, and having a scramble and no shade, amongst others.

线性回归模型适合于线索的统计数据，其中评论数(因此受欢迎程度)用作目标变量。该模型列出了受欢迎的路径上最有影响力的功能。这些包括收费， 情感分析得分高 ， 不算困难 ，争夺和没有阴影 ，等等。

I interpret those important features like this:

我将解释以下重要特征：

A fee: If the most popular trails have a fee to use, this indicates they are likely located inside National Parks. As many National Parks are closed due to COVID, or may be very busy, it is even more important to find alternatives.
收费：如果最受欢迎的步道需要付费，则表明它们可能位于国家公园内。由于许多国家公园因COVID而关闭，或者可能非常繁忙，因此寻找替代方案显得尤为重要。
Sentiment analysis score: Since all trails have roughly the same score out of 5 stars, its hard to gather a lot of reliable information about their quality from this rating alone. By using natural language processing to analyze the written text reviews themselves, I was able to gain an actual useful metric in determining how people actually feel about the trail. The higher the score (on a scale of -1=very negative to +1=very positive), the stronger people felt positively toward the trail, which was super useful in finding hidden gems.
情感分析得分 ：由于所有足迹在5星中的得分大致相同，因此仅凭此评分就很难收集有关其质量的大量可靠信息。通过使用自然语言处理本身来分析书面评论，我能够获得一个实际有用的指标来确定人们对这条路的实际感觉。分数越高(从-1 =非常负到+1 =非常正)，人们对步道的感觉越强，这对于发现隐藏的宝石非常有用。
Rocky/scramble/no shade: What this says to me is that the very popular trails take place above tree line! It’s on those more difficult hikes with higher elevation gain that you encounter these features. And with higher elevation, you’ll likely get better views! As it turns out, people love these tougher trails.
崎//无序/无阴影 ：这对我说的是，非常受欢迎的步道发生在林线上方！在遇到这些功能的情况下，就是那些具有更高仰角增益的较困难的远足。随着海拔的升高，您可能会获得更好的视野！事实证明，人们喜欢这些艰难的路。

The R² of this model was optimized to 0.19. Though this isn’t a very high score, you can see below that this is because the relationship between trail features and popularity simply isn’t linear. The residuals plot below showing the difference between the predicted popularity values and actual values demonstrates this pretty clearly (if this were linearly dependent, residuals would all fall in a fairly horizontal bar around 0!) So what’s actually determining a trail’s popularity if not it having all the right features of a popular trail?

该模型的R²优化为0.19。尽管这并不是一个很高的分数，但是您可以在下面看到这是因为足迹特征和受欢迎程度之间的关系不是线性的。下面的残差图显示了预测的流行度值与实际值之间的差异，很清楚地证明了这一点(如果线性相关，则残差都将落在0附近的相当水平的条形中！)流行路线的所有正确功能？

My key finding was that AllTrail’s algorithm shows the trails with the most reviews first and foremost, which leads to a form of recursive confirmation bias. If all trails have roughly the same rating, users will turn to the reviews to determine whether a trail is good, will choose to do one with a lot of reviews, hence feeding in to the loop of making the very few busiest trails even busier. Meanwhile, other similar trails may have plenty of opportunity but go neglected.

我的主要发现是，AllTrail的算法首先显示了具有最多评论的路径，这导致了递归确认偏差的形式。如果所有路径的评分大致相同，则用户将转向评论来确定一条路径是否良好，并选择对一条路径进行大量评论，从而进入使最繁忙的路径变得更加繁忙的循环。同时，其他类似的路线可能有很多机会，但被忽略了。

那么，什么使小道受欢迎呢？ (So What Makes a Trail Popular?)

There are tens of thousands of hikes listed on AllTrails.com, but their search algorithm always offers viewers the most popular hikes first. Trails with the most reviews get the most hikes, and hence even more reviews; while lesser known trails may be just a good, but are harder to find on the website, and hard to know for sure whether they’ll be a good trail if they have so few ratings.

AllTrails.com上列出了数以万计的远足，但他们的搜索算法始终始终为观众提供最受欢迎的远足。评论最多的步道获得最多的加息，因此获得更多评论；虽然鲜为人知的足迹可能只是一个好选择，但很难在网站上找到，并且如果它们的评分太少，很难确定它们是否会是一个好的足迹。

So what makes a trail popular? Ultimately, AllTrails does.

那么，什么使小道受欢迎呢？最终， AllTrails做到了。

It’s time we break out of that feedback loop, and find some amazing alternative hikes where we can avoid the crowds. But how will you know if a trail is going to be worth your time? Well, I used Machine Learning to do that work for you.

现在该是我们打破这种反馈循环的时候了，找到一些令人惊奇的替代远足方案，我们可以避开人群。但是，您怎么知道一条小路是否值得您花时间呢？好吧，我使用机器学习为您完成了这项工作。

I fit the best model on a subset of trails which were designated as being “lightly trafficked”, and the R² for these trails was 0.08. This was actually encouraging, considering that these are specifically a selection of trails which aren’t popular, but according to this, given their features, should be.

我将最佳模型应用于被指定为“轻度贩运”的部分路径，这些路径的R²为0.08。这实际上是令人鼓舞的，考虑到这是专门选择的路径不属于流行的，但根据这一点，由于其特点，应该是。

A potential area of future work for this project could be fitting a polynomial features model instead of a linear one. Early exploration into this method yielded a promising R² improvement to 0.26, but did induce some feature collinearity by duplicating features, that would need to be feature engineered out. I’m looking forward to continuing this work once I have more machine learning tools at my disposal! But I’m absolutely thrilled to present you with this list of the best lesser-known trails in America as my very first end-to-end data science project.

该项目未来工作的潜在领域可能是拟合多项式特征模型而不是线性模型。对该方法的早期探索使R²改善到了0.26，但确实通过复制特征引起了某些特征共线性，这需要进行特征设计。一旦我拥有更多可用的机器学习工具，我期待继续这项工作！但是，作为我的第一个端到端数据科学项目，我非常高兴向您介绍这份美国鲜为人知的最佳路径。

远足径 (Hike The Trails)

Check out the Hidden Gems in your State below!

在下面查看您所在州的隐藏宝石！

翻译自: https://towardsdatascience.com/hidden-gems-finding-the-best-secret-trails-in-america-d9203e8ad073

美国队长3:内战

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/388251.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

Java入门第三季——Java中的集合框架（中）：MapHashMap

1 package com.imooc.collection;2 3 import java.util.HashSet;4 import java.util.Set;5 6 /**7 * 学生类8 * author Administrator9 * 10 */ 11 public class Student { 12 13 public String id; 14 15 public String name; 16 17 public Set<…

【译】 WebSocket 协议第八章——错误处理（Error Handling）

概述本文为 WebSocket 协议的第八章，本文翻译的主要内容为 WebSocket 错误处理相关内容。错误处理（协议正文） 8.1 处理 UTF-8 数据错误当终端按照 UTF-8 的格式来解析一个字节流，但是发现这个字节流不是 UTF-8 编码&#xff0c…

升级xcode5.1 iOS 6.0后以前的横屏项目变为了竖屏

升级xcode5.1 iOS 6.0后以前的横屏项目变为了竖屏，以下为解决办法： 在AppDelegate 的初始化方法 - (BOOL)application:(UIApplication *)application didFinishLaunchingWithOptions:(NSDictionary *)launchOptions中将 [window addSubview: viewCon…

动漫数据推荐系统

Simple, TfidfVectorizer and CountVectorizer recommendation system for beginner.简单的TfidfVectorizer和CountVectorizer推荐系统，适用于初学者。目标 (The Goal) Recommendation system is widely use in many industries to suggest items to customers. F…

Wait Event SQL*Net more data to client

oracle 官方给的说法是 C.3.152 SQL*Net more data to client The server process is sending more data/messages to the client. The previous operation to the client was also a send. Wait Time: The actual time it took for the send to complete 意味着server process…

1.3求根之牛顿迭代法

目录目录前言（一）牛顿迭代法的分析1.定义2.条件3.思想4.误差（二）代码实现1.算法流程图2.源代码（三）案例演示1.求解：\(f(x)x^3-x-10\)2.求解：\(f(x)x^2-1150\)3.求解：\(f…

libzbar.a armv7

杨航最近在学IOS http://download.csdn.net/download/lzwxyz/5546365 我现在用的是这个：http://www.federicocappelli.net/2012/10/05/zbar-library-for-iphone-5-armv7s/ 点它的HERE开始下载下载的libzbar.a库，如何查看 …

Alex Hanna博士：Google道德AI小组研究员

Alex Hanna博士是社会学家和研究科学家，致力于Google的机器学习公平性和道德AI。 (Dr. Alex Hanna is a sociologist and research scientist working on machine learning fairness and ethical AI at Google.) Before that, she was an Assistant Professor at th…

三位对我影响最深的老师

我感觉，教过我的老师们，不论他们技术的好坏对我都是有些许影响的。但是让人印象最深的好像只有寥寥几位。第一位就是小学六年级下册教过我的语文老师。他是临时从一个贫困小学调任过来的，不怎么管班级，班里同学都在背地里说他不会…

安全开发 | 如何让Django框架中的CSRF_Token的值每次请求都不一样

前言用过Django 进行开发的同学都知道，Django框架天然支持对CSRF攻击的防护，因为其内置了一个名为CsrfViewMiddleware的中间件，其基于Cookie方式的防护原理，相比基于session的方式，更适合目前前后端分离的业务场景&am…

UNITY3D 脑袋顶血顶名

杨航最近在学Unity3D using UnityEngine; using System.Collections; public class NPC : MonoBehaviour { //主摄像机对象 public Camera camera; //NPC名称 private string name "我是doud…

一个项目的整个测试流程

最近一直在进行接口自动化的测试工作，同时对于一个项目的整个测试流程进行了梳理，希望能对你有用~~~ 需求分析： 整体流程图： 需求提取 -> 需求分析 -> 需求评审 -> 更新后的测试需求跟踪xmind 分析流程： 1. 需…

python度量学习_Python的差异度量

python度量学习Hi folks, welcome back to my new edition of the blog, thank you so much for your love and support, I hope you all are doing well. In today’s learning, we will try to understand about variance and the measures involved in it. Although the blo…