美国队长3:内战_隐藏的宝石:寻找美国最好的秘密线索

美国队长3:内战

There are plenty of reasons why one would want to find solitude in the wilderness, from the therapeutic effects of being immersed in nature, to not wanting to contribute to trail degradation and soil erosion on busier trails.

人们有很多理由想要在旷野找到孤独,从沉浸在大自然中的治疗效果到不想在繁忙的小径上造成小径的退化和土壤侵蚀。

Now more than ever the reprieve of the outdoors is greatly needed. But in a post-COVID 19 world, where it can be practically impossible to maintain proper social distancing measures when passing hikers on a narrow trail, it is especially important to find less frequented trails to hike.

现在比以往任何时候都更需要户外缓刑。 但是在19后COVID的世界中,在狭窄的步道上经过远足者时,几乎不可能维持适当的社会疏远措施,因此寻找不那么频繁的远足径尤为重要。

I set out on a mission to use data science and machine learning to find the best little-known trails in America. You can check out the code on my github if you want to jump into the nitty gritty, or read on for analysis and a list of the hidden gems in your state!

我的任务是使用数据科学和机器学习来找到美国鲜为人知的最佳路径。 您可以在我的github上签出代码,如果想跳入更多细节,或者继续阅读以进行分析以及您所在州的隐藏宝石清单!

该方法 (The Approach)

If you’re anything like me, before you go anywhere or buy anything, you’re going to read all the reviews. When looking for trails to hike, a popular medium for discovering where to go is AllTrails.com.

如果您像我一样,在去任何地方或购买任何东西之前,您需要阅读所有评论。 当您寻找远足小径时, AllTrails.com是找到目的地的一种流行媒介。

When I first approached this project, I wanted to answer the question, “What makes a trail good?” That is, what combination of features and statistics about a trail would lead to it having a high overall rating?

当我第一次接触这个项目时,我想回答一个问题:“什么让步道更好?” 就是说,特征和统计信息的组合如何才能使它具有较高的总体评价?

What I pretty quickly found out though, is that across the 35,000 trails I scraped and analyzed, basically all of them were rated “pretty good” — that is, with an average user rating of 4.2 out of 5 stars and standard deviation of less than 0.6, it was really hard to distinguish which trails were excellent, and which were just okay, from their 5-star rating alone.

不过,我很快发现,在我抓取和分析的35,000条路径中,基本上所有路径都被评为“相当好”,也就是说,平均用户评分为5颗星中的4.2颗,标准偏差小于0.6,真的很难从它们的5星评级中区分出哪些是优秀的,哪些还可以。

Image for post

What there was huge variation in across all the trails though, was their popularity as represented by the total number of reviews each trail had. While the vast majority of trails had only 100 or so reviews, a select few had several thousand! What was making these trails so popular?

但是,所有路径之间的差异都很大,它们的受欢迎程度由每个路径的评论总数表示。 虽然绝大多数足迹只有100条左右的评论,但很少的一条只有数千条! 是什么让这些足迹如此受欢迎?

Image for post

I thus pivoted to try to predict not the rating of a trail, but instead determine, via a data-driven model, the relationship between the various features of a given trail and its popularity. In finding commonalities, I could then apply that model to unpopular trails, to find which ones check all the same boxes and are likely to be great, even though they haven’t been discovered yet.

因此,我转而尝试不预测路线的等级,而是通过数据驱动模型确定给定路线的各种特征与其受欢迎程度之间的关系。 在寻找共性时,我可以将该模型应用于不受欢迎的线索,以找出哪些会选中所有相同的框,即使它们尚未被发现,也可能很棒。

方法 (Methodology)

  1. ) With Selenium and Beautiful Soup, scrape AllTrails.com to obtain trail data about 35,000 trails in the United States. This included information about the length of the hike, its elevation gain, its location, and a list of all of the natural features (such as waterfall, wild flowers, paving) the trail had.

    )使用Selenium和Beautiful Soup,抓取AllTrails.com以获取有关美国35,000条路径的路径数据。 其中包括有关远足时间,海拔提升,位置以及所有自然特征(例如瀑布,野花,铺路)的列表的信息。
  2. ) Clean this data and create a Pandas DataFrame. This included one-hot encoding dummy variables for all of categorical feature columns.

    )清理此数据并创建一个Pandas DataFrame。 其中包括所有分类要素列的一键编码伪变量。
  3. ) Utilize the VADER Sentiment Analysis module to analyze the text reviews via simple Natural Language Processing for each trail and determine a mean composite score.

    )利用VADER情绪分析模块通过简单的自然语言处理对每条线索进行文本评论分析,并确定平均综合得分。
  4. ) Use linear regression modeling methodologies including Statsmodels OLS to determine the relationship between a trail’s features and its’ popularity.

    )使用包括Statsmodels OLS在内的线性回归建模方法来确定路径特征与其受欢迎程度之间的关系。
  5. ) Perform feature engineering and regularization via LassoCV to remove multicollinearity amongst those features and optimize the model.

    )通过LassoCV执行特征工程和正则化,以消除这些特征之间的多重共线性并优化模型。
  6. ) Apply that model to trails that are described as “lightly trafficked”, to find trails which would be expected to be popular based on their combination of features, but just haven’t been discovered yet.

    )将该模型应用于描述为“轻度贩运”的路径,以根据其功能组合查找预期会流行的路径,但尚未发现。

发现 (Findings)

A linear regression model was fit to the trail’s stats with the number of reviews (and hence, popularity) serving as the target variable. The model yielded a list of the most influential features on a trail on it being popular. These included there being a fee, having a high sentiment analysis score, it being rocky, and having a scramble and no shade, amongst others.

线性回归模型适合于线索的统计数据,其中评论数(因此受欢迎程度)用作目标变量。 该模型列出了受欢迎的路径上最有影响力的功能。 这些包括收费情感分析得分高不算困难争夺没有阴影 ,等等。

I interpret those important features like this:

我将解释以下重要特征:

  • A fee: If the most popular trails have a fee to use, this indicates they are likely located inside National Parks. As many National Parks are closed due to COVID, or may be very busy, it is even more important to find alternatives.

    收费 :如果最受欢迎的步道需要付费,则表明它们可能位于国家公园内。 由于许多国家公园因COVID而关闭,或者可能非常繁忙,因此寻找替代方案显得尤为重要。

  • Sentiment analysis score: Since all trails have roughly the same score out of 5 stars, its hard to gather a lot of reliable information about their quality from this rating alone. By using natural language processing to analyze the written text reviews themselves, I was able to gain an actual useful metric in determining how people actually feel about the trail. The higher the score (on a scale of -1=very negative to +1=very positive), the stronger people felt positively toward the trail, which was super useful in finding hidden gems.

    情感分析得分 :由于所有足迹在5星中的得分大致相同,因此仅凭此评分就很难收集有关其质量的大量可靠信息。 通过使用自然语言处理本身来分析书面评论,我能够获得一个实际有用的指标来确定人们对这条路的实际感觉。 分数越高(从-1 =非常负到+1 =非常正),人们对步道的感觉越强,这对于发现隐藏的宝石非常有用。

  • Rocky/scramble/no shade: What this says to me is that the very popular trails take place above tree line! It’s on those more difficult hikes with higher elevation gain that you encounter these features. And with higher elevation, you’ll likely get better views! As it turns out, people love these tougher trails.

    崎//无序/无阴影 :这对我说的是,非常受欢迎的步道发生在林线上方! 在遇到这些功能的情况下,就是那些具有更高仰角增益的较困难的远足。 随着海拔的升高,您可能会获得更好的视野! 事实证明,人们喜欢这些艰难的路。

The R² of this model was optimized to 0.19. Though this isn’t a very high score, you can see below that this is because the relationship between trail features and popularity simply isn’t linear. The residuals plot below showing the difference between the predicted popularity values and actual values demonstrates this pretty clearly (if this were linearly dependent, residuals would all fall in a fairly horizontal bar around 0!) So what’s actually determining a trail’s popularity if not it having all the right features of a popular trail?

该模型的R²优化为0.19。 尽管这并不是一个很高的分数,但是您可以在下面看到这是因为足迹特征和受欢迎程度之间的关系不是线性的。 下面的残差图显示了预测的流行度值与实际值之间的差异,很清楚地证明了这一点(如果线性相关,则残差都将落在0附近的相当水平的条形中!)流行路线的所有正确功能?

Image for post

My key finding was that AllTrail’s algorithm shows the trails with the most reviews first and foremost, which leads to a form of recursive confirmation bias. If all trails have roughly the same rating, users will turn to the reviews to determine whether a trail is good, will choose to do one with a lot of reviews, hence feeding in to the loop of making the very few busiest trails even busier. Meanwhile, other similar trails may have plenty of opportunity but go neglected.

我的主要发现是,AllTrail的算法首先显示了具有最多评论的路径,这导致了递归确认偏差的形式。 如果所有路径的评分大致相同,则用户将转向评论来确定一条路径是否良好,并选择对一条路径进行大量评论,从而进入使最繁忙的路径变得更加繁忙的循环。 同时,其他类似的路线可能有很多机会,但被忽略了。

那么,什么使小道受欢迎呢? (So What Makes a Trail Popular?)

There are tens of thousands of hikes listed on AllTrails.com, but their search algorithm always offers viewers the most popular hikes first. Trails with the most reviews get the most hikes, and hence even more reviews; while lesser known trails may be just a good, but are harder to find on the website, and hard to know for sure whether they’ll be a good trail if they have so few ratings.

AllTrails.com上列出了数以万计的远足,但他们的搜索算法始终始终为观众提供最受欢迎的远足。 评论最多的步道获得最多的加息,因此获得更多评论; 虽然鲜为人知的足迹可能只是一个好选择,但很难在网站上找到,并且如果它们的评分太少,很难确定它们是否会是一个好的足迹。

So what makes a trail popular? Ultimately, AllTrails does.

那么,什么使小道受欢迎呢? 最终, AllTrails做到了。

It’s time we break out of that feedback loop, and find some amazing alternative hikes where we can avoid the crowds. But how will you know if a trail is going to be worth your time? Well, I used Machine Learning to do that work for you.

现在该是我们打破这种反馈循环的时候了,找到一些令人惊奇的替代远足方案,我们可以避开人群。 但是,您怎么知道一条小路是否值得您花时间呢? 好吧,我使用机器学习为您完成了这项工作。

I fit the best model on a subset of trails which were designated as being “lightly trafficked”, and the R² for these trails was 0.08. This was actually encouraging, considering that these are specifically a selection of trails which aren’t popular, but according to this, given their features, should be.

我将最佳模型应用于被指定为“轻度贩运”的部分路径,这些路径的R²为0.08。 这实际上是令人鼓舞的,考虑到这是专门选择的路径不属于流行的,但根据这一点,由于其特点,应该是。

A potential area of future work for this project could be fitting a polynomial features model instead of a linear one. Early exploration into this method yielded a promising R² improvement to 0.26, but did induce some feature collinearity by duplicating features, that would need to be feature engineered out. I’m looking forward to continuing this work once I have more machine learning tools at my disposal! But I’m absolutely thrilled to present you with this list of the best lesser-known trails in America as my very first end-to-end data science project.

该项目未来工作的潜在领域可能是拟合多项式特征模型而不是线性模型。 对该方法的早期探索使R²改善到了0.26,但确实通过复制特征引起了某些特征共线性,这需要进行特征设计。 一旦我拥有更多可用的机器学习工具,我期待继续这项工作! 但是,作为我的第一个端到端数据科学项目,我非常高兴向您介绍这份美国鲜为人知的最佳路径。

远足径 (Hike The Trails)

Check out the Hidden Gems in your State below!

在下面查看您所在州的隐藏宝石!

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

翻译自: https://towardsdatascience.com/hidden-gems-finding-the-best-secret-trails-in-america-d9203e8ad073

美国队长3:内战

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388251.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Java入门第三季——Java中的集合框架(中):MapHashMap

1 package com.imooc.collection;2 3 import java.util.HashSet;4 import java.util.Set;5 6 /**7 * 学生类8 * author Administrator9 * 10 */ 11 public class Student { 12 13 public String id; 14 15 public String name; 16 17 public Set<…

动漫数据推荐系统

Simple, TfidfVectorizer and CountVectorizer recommendation system for beginner.简单的TfidfVectorizer和CountVectorizer推荐系统&#xff0c;适用于初学者。 目标 (The Goal) Recommendation system is widely use in many industries to suggest items to customers. F…

1.3求根之牛顿迭代法

目录 目录前言&#xff08;一&#xff09;牛顿迭代法的分析1.定义2.条件3.思想4.误差&#xff08;二&#xff09;代码实现1.算法流程图2.源代码&#xff08;三&#xff09;案例演示1.求解&#xff1a;\(f(x)x^3-x-10\)2.求解&#xff1a;\(f(x)x^2-1150\)3.求解&#xff1a;\(f…

Alex Hanna博士:Google道德AI小组研究员

Alex Hanna博士是社会学家和研究科学家&#xff0c;致力于Google的机器学习公平性和道德AI。 (Dr. Alex Hanna is a sociologist and research scientist working on machine learning fairness and ethical AI at Google.) Before that, she was an Assistant Professor at th…

安全开发 | 如何让Django框架中的CSRF_Token的值每次请求都不一样

前言 用过Django 进行开发的同学都知道&#xff0c;Django框架天然支持对CSRF攻击的防护&#xff0c;因为其内置了一个名为CsrfViewMiddleware的中间件&#xff0c;其基于Cookie方式的防护原理&#xff0c;相比基于session的方式&#xff0c;更适合目前前后端分离的业务场景&am…

Kubernetes的共享GPU集群调度

问题背景 全球主要的容器集群服务厂商的Kubernetes服务都提供了Nvidia GPU容器调度能力&#xff0c;但是通常都是将一个GPU卡分配给一个容器。这可以实现比较好的隔离性&#xff0c;确保使用GPU的应用不会被其他应用影响&#xff1b;对于深度学习模型训练的场景非常适合&#x…

django-celery定时任务以及异步任务and服务器部署并且运行全部过程

Celery 应用Celery之前&#xff0c;我想大家都已经了解了&#xff0c;什么是Celery&#xff0c;Celery可以做什么&#xff0c;等等一些关于Celery的问题&#xff0c;在这里我就不一一解释了。 应用之前&#xff0c;要确保环境中添加了Celery包。 pip install celery pip instal…

网页视频15分钟自动暂停_在15分钟内学习网页爬取

网页视频15分钟自动暂停什么是网页抓取&#xff1f; (What is Web Scraping?) Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website. This information is collected and then exported into a format that …

前嗅ForeSpider教程:创建模板

今天&#xff0c;小编为大家带来的教程是&#xff1a;如何在前嗅ForeSpider中创建模板。主要内容有&#xff1a;模板的概念&#xff0c;模板的配置方式&#xff0c;模板的高级选项&#xff0c;具体内容如下&#xff1a; 一&#xff0c;模板的概念 模板列表的层级相当于网页跳转…

django 性能优化_优化Django管理员

django 性能优化Managing data from the Django administration interface should be fast and easy, especially when we have a lot of data to manage.从Django管理界面管理数据应该快速简便&#xff0c;尤其是当我们要管理大量数据时。 To improve that process and to ma…

3D场景中选取场景中的物体。

杨航最近在学Unity3D&#xfeff;&#xfeff;&#xfeff;&#xfeff;在一些经典的游戏中&#xff0c;需要玩家在一个3D场景中选取场景中的物体。例如《仙剑奇侠传》&#xff0c;选择要攻击的敌人时、为我方角色增加血量、为我方角色添加状态&#xff0c;通常我们使用鼠标来选…

canva怎么使用_使用Canva进行数据可视化项目的4个主要好处

canva怎么使用(Notes: All opinions are my own. I am not affiliated with Canva in any way)(注意&#xff1a;所有观点均为我自己。我与Canva毫无关系) Canva is a very popular design platform that I thought I would never use to create the deliverable for a Data V…

如何利用Shader来渲染游戏中的3D角色

杨航最近在学Unity3D&#xfeff;&#xfeff; 本文主要介绍一下如何利用Shader来渲染游戏中的3D角色&#xff0c;以及如何利用Unity提供的Surface Shader来书写自定义Shader。 一、从Shader开始 1、通过Assets->Create->Shader来创建一个默认的Shader&#xff0c;并取名…

Css单位

尺寸 颜色 转载于:https://www.cnblogs.com/jsunny/p/9866679.html

ai驱动数据安全治理_JupyterLab中的AI驱动的代码完成

ai驱动数据安全治理As a data scientist, you almost surely use a form of Jupyter Notebooks. Hopefully, you have moved over to the goodness of JupyterLab with its integrated sidebar, tabs, and more. When it first launched in 2018, JupyterLab was great but fel…

【Android】Retrofit 2.0 的使用

一、概述 Retrofit是Square公司开发的一个类型安全的Java和Android 的REST客户端库。来自官网的介绍&#xff1a; A type-safe HTTP client for Android and JavaRest API是一种软件设计风格&#xff0c;服务器作为资源存放地。客户端去请求GET,PUT, POST,DELETE资源。并且是无…

Mysql常用命令(二)

对数据库的操作 增 create database db1 charset utf8; 查 # 查看当前创建的数据库 show create database db1; # 查看所有的数据库 show databases; 改 alter database db1 charset gbk; 删 drop database db1; 对表的操作 use db1; #切换文件夹select database(); #查看当前所…

python中定义数据结构_Python中的数据结构—简介

python中定义数据结构You have multiples algorithms, the steps of which require fetching the smallest value in a collection at any given point of time. Values are assigned to variables but are constantly modified, making it impossible for you to remember all…

Unity3D 场景与C# Control进行结合

杨航最近在自学Unity3D&#xff0c;打算使用这个时髦、流行、强大的游戏引擎开发一个三维业务展示系统&#xff0c;不过发现游戏的UI和业务系统的UI还是有一定的差别&#xff0c;很多的用户还是比较习惯WinForm或者WPF中的UI形式&#xff0c;于是在网上搜了一下WinForm和Unity3…

数据质量提升_合作提高数据质量

数据质量提升Author Vlad Rișcuția is joined for this article by co-authors Wayne Yim and Ayyappan Balasubramanian.作者 Vlad Rișcuția 和合著者 Wayne Yim 和 Ayyappan Balasubramanian 共同撰写了这篇文章 。 为什么要数据质量&#xff1f; (Why data quality?) …