如何实施成功的数据清理流程

干净的数据是发现和洞察力的基础。 如果数据很脏,您的团队为分析,培养和可视化数据而付出的巨大努力完全是在浪费时间。 当然,肮脏的数据并不是新的。 它早在计算机变得普及之前就困扰着决策。 现在,计算机技术已普及到日常生活中,这个问题变得更加复杂。 (Clean data is the foundation of discovery and insight. The extreme effort your team puts forth to analyze, cultivate and visualize data is a complete waste of time if the data is dirty. Of course, dirty data isn’t new. It has plagued decisions long before computers became commonplace. And now that computer technology is pervasive in everyday life, the problem has only compounded.)

The first thing a company needs to determine is whether or not they have dirty data in their midst. Fortunately, this is easily done. The answer is ‘Yes.’ Everyone has dirty data and that means you have dirty data. Now that we’re over that hurdle, the next two questions we must pose are more challenging to answer, “Which data is dirty?” and “How do we clean our data?”

公司需要确定的第一件事是他们中间是否有脏数据。 幸运的是,这很容易做到。 答案是“是”。 每个人都有脏数据,这意味着您有脏数据。 现在我们已经克服了这一障碍,接下来我们必须提出的两个问题更具挑战性:“哪些数据是脏数据?” 和“我们如何清理数据?”

Over the years I’ve seen many companies and teams compose a herculean effort to clean their data. They involve large dedicated teams, major project schedules, and months or years of effort. All of them have ended in much the same way, failure. At least the original task of the company’s data being clean at the end of the project was a failure.

多年来,我已经看到许多公司和团队付出了巨大的努力来清理数据。 他们涉及庞大的敬业团队,主要的项目计划以及数月或数年的工作。 所有这些都以几乎相同的方式结束,即失败。 至少在项目结束时清理公司数据的原始任务是失败的。

The truth is the shape of the project tends to change. It morphs from a linear project with a beginning and end, followed by an expected deliverable of clean data into a cyclical process that will persist indefinitely. However, the ultimate goal of clean data for analysis and insights is achieved.

事实是,项目的形状倾向于改变。 它从具有开始和结束的线性项目变形,然后是预期可提供的干净数据交付到将无限期持续的周期性过程。 但是,实现了用于分析和洞察的干净数据的最终目标。

失败的例子 (An example of failure…)

Before I explain the steps involved in a successful cyclical process for cleaning data, let’s take a moment to explore the reasons gargantuan data cleaning projects like the example presented above will always fail. The linear project approach to cleaning data has an inherent assumption leading us to repeated failure. It assumes the data you are cleaning is somewhat static in nature. New data will enter your system, but the new data will be entered correctly and not contain errors or be dirty. But if dirty data is somehow introduced into the source systems again, the processes you have put in place account for all possible forms of dirty data issues that could come your way in the future.

在我解释成功的周期性数据清理过程中涉及的步骤之前,让我们花点时间来探究诸如上述示例之类的庞大数据清理项目将始终失败的原因。 用于清理数据的线性项目方法具有一个固有的假设,导致我们反复失败。 假设您要清除的数据本质上是静态的。 新数据将进入您的系统,但是新数据将被正确输入并且不包含错误或脏数据。 但是,如果将脏数据以某种方式再次引入到源系统中,那么您已经建立的流程将解决所有将来可能出现的脏数据问题。

This is not a realistic picture of how source systems work. For example, let’s say we have software used by the human resources department. It allows the HR staff to enter employee names and track their progress through all the on-boarding requirements from initial date of hire to the employee being fully trained for their position.

这不是源系统如何工作的真实描述。 例如,假设我们拥有人力资源部门使用的软件。 它使人力资源员工可以输入员工姓名,并跟踪从入职之初到接受全面职位培训的所有入职要求的进度。

This software has been written (or at least customized) by a 3rd party software development team your company used to tailor the software to its particular on-boarding processes. The software saves all data into a SQL transactional database, as we would expect. One of the fields in the database stores each employee’s score for each of their assessments throughout the on-boarding process. Of course, if a particular assessment has not been completed, no score would be entered. Your initial review of the data reveals the incomplete assessments contain an empty string in the score field whereas the completed assessments contain an actual score value in the field.

该软件已由贵公司用来根据其特定的入职流程定制软件的第三方软件开发团队编写(或至少已定制)。 如我们所料,该软件将所有数据保存到SQL事务数据库中。 数据库中的一个字段在整个入职过程中存储每个员工的每个评估的分数。 当然,如果尚未完成特定评估,则不会输入任何分数。 您对数据的初步检查显示,未完成的评估在“分数”字段中包含一个空字符串,而已完成的评估在该字段中包含实际分数。

Your team implements cleaning transformations to ensure these empty string values are changed into NULL values in your analytics data tables so that you can easily find and ignore them during your calculations. Problem solved! Or so you think…

您的团队执行清理转换,以确保在分析数据表中将这些空字符串值更改为NULL值,以便您在计算期间可以轻松找到并忽略它们。 问题解决了! 还是您认为...

Months after your data cleaning efforts are complete, one of the HR employees finds an issue in the HR software itself and they raise the concern with the software development team. The 3rd party team chooses to resolve the problem by changing the behavior of their software to no longer use an empty string value for incomplete assessment scores, but they now use a zero as the incomplete placeholder. At first, this seems to solve the problems for HR, but soon their reports from the analysts begin reflecting extremely low score averages across the onboarding groups.

完成数据清理工作后的几个月,其中一名HR员工发现HR软件本身存在问题,并引起了软件开发团队的关注。 第三方团队选择通过更改其软件的行为来解决问题,不再对不完整的评估分数使用空字符串值,但现在使用零作为不完整的占位符。 最初,这似乎解决了人力资源方面的问题,但是不久之后,他们从分析师那里得到的报告就开始反映出新入职人员的平均得分非常低。

When your team is finally asked to investigate, you realize the issue comes from the averaging methods being used. Since empty strings were initially the incomplete indicators and you converted them to NULLS, the average of the scores could be completed with a simple average function because it ignores NULLS. It does not ignore zeros. Therefore, the averages are now including the zeros from the incomplete assessments.

当您的团队最终被要求调查时,您意识到问题出在所使用的平均方法上。 由于空字符串最初是不完整的指示符,并且已将它们转换为NULLS,因此可以使用简单的平均值函数来完成分数的平均值,因为它忽略了NULLS。 它不会忽略零。 因此,现在的平均值包括了来自不完整评估的零。

Back to the drawing board!

回到绘图板!

定期清洁过程将起作用 (A cyclical cleaning process will work)

While the above example is a simple one and real life problems are much more complex, it clearly demonstrates the problems inherent with linear data cleaning approaches.

尽管上面的示例很简单,但是现实生活中的问题要复杂得多,但它清楚地说明了线性数据清除方法固有的问题。

Instead, we constantly change and add to the processes affecting the data throughout our business and therefore, we are continually changing our circumstances introducing new opportunities for problematic data. And to make matters worse, we not only change the processes, systems, sources, fields, and numerous other elements involved with the collection of data, we are adding new data at an ever increasing rate.

取而代之的是,我们不断更改并增加影响整个业务范围内数据的流程,因此,我们正在不断更改环境,从而为出现问题的数据提供了新的机会。 更糟的是,我们不仅更改了流程,系统,源,字段以及与数据收集有关的许多其他元素,而且我们以不断增加的速度添加新数据。

By the time you’ve created a method for cleaning the data and implemented it, many or all of these factors have changed — many times over.

到您创建一种清理数据的方法并实现它时,许多或所有这些因素已经改变了很多次。

The linear data cleaning path is doomed to failure and should be abandoned. Instead, let’s begin with a systemized version of the cyclical process everyone ends up adopting in the end. Instead of wasting the time, energy and resources on the linear version and then landing a disorganized cyclical method of data cleaning compounded with your team’s feeling of failure like everyone else, you can structure the cyclical process correctly from the beginning and everyone will be successful, know why they are successful, while doing their work with the feeling of success. Doesn’t that sound better?

线性数据清理路径注定要失败,应该放弃。 相反,让我们从每个人最终最终采用的循环过程的系统化版本开始。 与其浪费时间,精力和资源,不如使用线性版本,而是采用无序的数据清理周期性方法,再加上团队与其他所有人一样的失败感,您可以从一开始就正确地构建周期性过程,每个人都会成功,知道他们为什么成功,同时带着成功的感觉做他们的工作。 听起来更好吗?

Image for post
Rod Castor杆脚轮

像ER Triage一样思考您的清洁过程 (Think of Your Cleaning Process like ER Triage)

The cyclical process to data cleaning is rather simple. It’s composed of 5 stages similar to that of a Hospital Emergency Room. While hospitals vary in their exact implementation of ER procedures, the same basic stages apply. The first phase is triage. In this phase, medical professionals assess the patient, assign the patient a priority based on their severity, and assign them to the proper group for treatment. The second phase is the treatment phase. It consists of actually treating the patient and assigning any required follow up that needs to occur both prior to the patient’s discharge and after the patient’s discharge.

数据清理的循环过程非常简单。 它由与医院急诊室相似的5个阶段组成。 尽管医院对ER程序的确切实施方式有所不同,但适用相同的基本阶段。 第一阶段是分类。 在此阶段,医学专业人员会评估患者,根据患者的严重程度为其分配优先级,然后将他们分配给适当的人群进行治疗。 第二阶段是治疗阶段。 它由实际治疗患者和分配在患者出院前和患者出院后需要进行的所有必要随访组成。

Using these ER and triage processes as a guide, let’s consider what cyclical data cleaning looks like.

以这些ER和分类处理流程为指导,让我们考虑一下周期性数据清除的外观。

分析—一名患者走进急诊室 (Analyze — A Patient Walks into the ER)

People don’t just show up to the ER to hang out or get a cup of coffee. They are there for a reason. The context of the visit is unmistakable even if the cause of their ailment is yet to be diagnosed. The same is true for data cleaning.

人们不仅会出现在急诊室闲逛或喝杯咖啡。 他们在那里是有原因的。 即使尚未确诊患病的原因,探访的内容也是显而易见的。 数据清理也是如此。

One of the difficulties with the more traditional linear approach to data cleaning is the abstract nature of the situation. Let’s think about this for a minute. If we’re looking at data in our database, but not actually trying to analyze anything for the company (ie. monthly sales trends or attrition) dirty data may or maybe not be obvious. But once we begin calculating revenue, profit, churn, and the other typical items our business is asking to know, dirty data rears its ugly head. Data, out of context, can easily mask itself as clean data. So, in the linear approach, we often miss many data fields that actually contain dirty data. The resulting “clean” data at the end of a linear cleaning project will need to be revisited the first time an analyst discovers the numbers in the profit column simply do not add up.

更传统的线性数据清理方法的困难之一是情况的抽象性质。 让我们考虑一下。 如果我们正在查看数据库中的数据,但实际上并未尝试分析公司的任何数据(例如,每月的销售趋势或损耗),则脏数据可能会或可能不会很明显。 但是,一旦我们开始计算收入,利润,客户流失以及我们的业务要求了解的其他典型项目,肮脏的数据就会浮出水面。 脱离上下文的数据可以轻松地将其自身屏蔽为干净的数据。 因此,在线性方法中,我们经常会错过许多实际上包含脏数据的数据字段。 线性清洁项目结束时产生的“清洁”数据将需要在分析师首次发现“利润”列中的数字完全不加起来时重新审查。

Using the cyclical process of data cleaning, we begin with analysis. Just go ahead and turn the analysts loose in the data. Tell them to make requests of the data engineers, work their analytical magic, and not be bashful about raising questions when something in the data doesn’t look correct. This is help and context our cleaning process desperately needs in order to correctly and holistically clean the data we currently have.

使用数据清理的循环过程,我们开始进行分析。 只是继续进行下去,就可以使分析人员松动数据。 告诉他们向数据工程师提出要求,运用他们的分析魔术,不要对当数据中某些内容看起来不正确时提出的问题保持警惕。 这是我们清洁过程中迫切需要的帮助和背景,以正确,全面地清洁我们当前拥有的数据。

Whenever an analyst finds something concerning the data, this becomes a reason to send the patient to the ER. It’s the stomach pain or shortness of breath or high fever that brings the data to you for cleaning. Of course, once an analyst sends a “data” patient to your ER, your data triage staff must be ready to perform.

每当分析人员发现与数据有关的信息时,这便成为将患者送往急诊室的原因。 胃痛,呼吸急促或发烧是数据带给您清洗的原因。 当然,一旦分析师将“数据”患者发送到您的急诊室,您的数据分类人员必须准备好执行。

评估-伸出你的舌头,说“啊”。 (Assess — Stick out your tongue and say ‘Ahhh.’)

When someone walks into the emergency room, the first thing the medical team does is assess the patient. They take their temperature, blood pressure, itemize a list of the medications the patient is taking, get a description of the symptoms, and so on. The same is true when beginning to assess a “data” patient.

当有人走进急诊室时,医疗团队要做的第一件事就是评估患者。 他们会记录自己的体温,血压,逐项列出患者正在服用的药物,对症状进行描述等。 开始评估“数据”患者时也是如此。

Image for post
Rod Castor杆脚轮

Of course in a medical situation, most of this information is compared to the known range of normal. Nurses and doctors know if your temperature or blood pressure is higher than the acceptable range. Those normals have already been established. But in a data situation, the normals may need to be established before you can verify the patient’s health.

当然,在医疗情况下,大多数信息都将与已知的正常范围进行比较。 护士和医生知道您的体温或血压是否高于可接受范围。 这些法线已经建立。 但是在数据情况下,可能需要先建立法线,然后才能验证患者的健康状况。

Your “data” triage team will assess the patient sent to them by the analyst (the referring physician). In this stage of our cyclical process, your team will work alongside the analyst and possibly other business team members to verify the data is indeed dirty. You will also want to assess the business cases using this data and its impact on the company or subsections of the company.

您的“数据”分类小组将评估由分析员(主诊医生)发送给他们的患者。 在我们的循环过程的这一阶段,您的团队将与分析师以及其他业务团队成员一起工作,以验证数据确实是肮脏的。 您还将需要使用此数据及其对公司或公司子部门的影响来评估业务案例。

Our first example of questions to ask in this phase are “What should this data look like: summed or averaged or as a dimension or whatever?” Without a proper understanding of what the data should be, we have a very high likelihood of making changes that do not clean the data. We may merely change the data into another form of dirty data.

在此阶段,我们要问的第一个问题示例是:“这些数据应该是什么样的:求和或求平均值或作为维或其他任何形式?” 如果对数据应该是什么没有正确的了解,我们很可能会做出不清除数据的更改。 我们可能只是将数据更改为另一种脏数据形式。

Next, you will need to understand the impact and importance of the data. Does the CEO or CFO use this data to make market decisions, product decisions, report company progress to the street? Is this data used by customer care to better your clients’ experience? Does marketing use this data to plan their next ad strategy? Is this data stored and used for trend analysis later this year for the board of directors? There are endless possibilities, but the impact and importance of the data will help you properly prioritize this patient.

接下来,您将需要了解数据的影响和重要性。 CEO或CFO是否使用这些数据来制定市场决策,产品决策,向街道报告公司进展? 客户服务会使用这些数据来改善客户体验吗? 市场营销会使用此数据来计划其下一个广告策略吗? 这些数据是否已存储并于今年晚些时候用于董事会的趋势分析? 可能性无穷无尽,但是数据的影响和重要性将有助于您适当地确定患者的优先级。

分配优先级-即使紧急情况也具有严重程度 (Assign Priority — Even emergencies have degrees of severity)

Even patients in the ER have a varying degree of urgency. One patient may be feeling nauseous and another having extreme chest and arm pain. Yet another may be unconscious or suffering from burns. While they all need medical attention, there are only so many medical professionals present to assist them and the worst situation will merit the top priority. Of course, many factors are used to determine the severity of someone’s ailment and the consequential priority assigned. The same is true for “data” patients.

即使是急诊室的患者也有不同程度的紧迫感。 一名患者可能会感到恶心,而另一名患者则感到极度的胸部和手臂疼痛。 还有一个人可能失去知觉或遭受灼伤。 尽管他们都需要医疗照顾,但只有这么多的医疗专业人员在场为他们提供帮助,最糟糕的情况将是当务之急。 当然,许多因素可用于确定某人疾病的严重程度和相应的优先级。 “数据”患者也是如此。

Once you understand, in context, what the data should be and its importance and impact, you will need to assign it a priority. All teams have limited resources and your data cleaning team is no different. This exposes yet another problem with the linear cleaning approach. When you linearly clean data every identified occurrence of dirty data tends to get equal priority because it becomes part of a large project that will consider the data clean at the end of the project. In today’s world, most teams have very limited resources. Undertaking all problems with equal priority significantly delays the team releasing the most important and impactful results.

一旦了解了上下文中的数据内容及其重要性和影响,就需要为它分配优先级。 所有团队的资源都有限,您的数据清理团队也不例外。 这暴露了线性清洁方法的另一个问题。 当您线性清理数据时,每次识别出的脏数据都将具有相同的优先级,因为它成为大型项目的一部分,该大型项目将在项目结束时考虑对数据进行清理。 在当今世界,大多数团队的资源非常有限。 优先处理所有问题会大大延迟团队发布最重要和最有意义的结果。

Image for post
Rod Castor杆脚轮

Your data team will need to establish its own rules for prioritization. Perhaps anything specifically coming from a customer’s request or trouble ticket jumps to the top of the list. Or maybe the C-Suite requests receive top priority. Every company culture has its own dynamic and your team should work together with company leaders to determine the best course of action for prioritizing requests. However, once that criteria is defined, your team should use the context, impact, and importance gathered in the assessment along with the established criteria for prioritization and assign each request accordingly.

您的数据团队将需要建立自己的优先级规则。 也许来自客户的请求或故障单的任何特定内容都跳到了列表的顶部。 也许C-Suite请求会获得最高优先级。 每种公司文化都有其自身的活力,您的团队应与公司负责人一起确定最佳的行动方案,以便对请求进行优先排序。 但是,一旦定义了该标准,您的团队就应该使用评估中收集的上下文,影响和重要性以及已建立的优先级标准,并相应地分配每个请求。

确定适当的流程和团队以清理数据 (Determine the Proper Processes and Team(s) to Clean the Data)

Hospitals often employee an ER doctor for each shift. This doctor will see every patient and address every condition coming into the ER during her shift. However, should a specific ER patient come in that requires a specialist, the shift doctor will work to stabilize the patient and then assign him to a department. The doctor on call for that specific department will take over this particular case. If a particular patient requires multiple specialists, multiple departments can be assigned and assist in the overall treatment plan.

医院经常为每个班次雇用一名急诊医生。 这位医生将在轮班期间为每位患者提供诊治,并处理进入急诊室的各种疾病。 但是,如果需要一名急诊室的特定急诊患者,则轮班医生将努力稳定患者,然后将其分配到科室。 该特定部门的待命医生将接管此特殊情况。 如果特定患者需要多个专家,则可以分配多个部门并协助制定总体治疗计划。

The practice of assigning a generalist or possibly even multiple specialists is the same when cleaning data. You must conclude the best teams to involve for the best outcome. In order to do this, your team of data specialists, perhaps the generalists need to learn more about the dirty data in question.

清理数据时,指派通才或什至可能由多个专家组成的做法是相同的。 您必须总结最好的团队,以取得最佳结果。 为了做到这一点,您的数据专家团队(也许是通才)需要了解有关有问题的脏数据的更多信息。

Begin by determining the sources for this data and any transformations the data undergoes before it’s ultimately saved in the database for the analyst to use. Again, your team may need to secure the assistance of data engineers, data wranglers, database administrators, or more business team members to properly assess the patient. You might even enlist the help of a software developer if you determine the application related to the source data is incorrectly saving the data to the database. Do not be afraid to work with others and ask for assistance.

首先确定此数据的来源以及数据最终经过存储在数据库中以供分析人员使用之前进行的任何转换。 同样,您的团队可能需要获得数据工程师,数据管理员,数据库管理员或更多业务团队成员的帮助,以正确评估患者。 如果您确定与源数据相关的应用程序将数据错误地保存到数据库中,则甚至可以寻求软件开发人员的帮助。 不要害怕与他人合作并寻求帮助。

By outlining the source systems for the data and any ETL, you can more easily and quickly acquire the necessary resources from the correct teams. If it’s accounting data, you may need to engage the data team for those servers. Or perhaps if it’s the geoscience data, you need to reach out to some of the geo-scientists to understand the ETL process they helped design with the data engineers. You get the idea.

通过概述数据和任何ETL的源系统,您可以更轻松,快速地从正确的团队那里获取必要的资源。 如果是会计数据,则可能需要聘请这些服务器的数据团队。 或者,如果是地球科学数据,则需要联系一些地球科学家,以了解他们与数据工程师一起帮助设计的ETL过程。 你明白了。

Performing this upfront work, places your “data” patient in the best possible care and ensures the quickest “recovery” or cleaning time.

执行此前期工作,可以使“数据”患者得到最好的护理,并确保最快的“恢复”或清洁时间。

确定清理数据的必要步骤 (Determine the Necessary Steps to Clean the Data)

With the proper team in place, the treatment of the patient can now begin. The assembled team will need to decide how to best clean the data for use and storage in the database used by the analysts. Of course, this could be a fast process or it could take some time depending on the complexity of existing ETL or the dirty nature of the source systems involved.

有了适当的团队,现在就可以开始对患者的治疗了。 组装后的团队将需要决定如何最好地清理数据以供分析人员使用和存储在数据库中。 当然,这可能是一个快速的过程,或者可能需要一些时间,具体取决于现有ETL的复杂性或所涉及源系统的肮脏性质。

Image for post
Rod Castor杆脚轮

Once the necessary steps have been identified, tested, and agreed upon, they need to be documented and, of course, implemented. It’s always best to use a development, test, production environment architecture and first implement the change into development. Then promote to the testing environment and only once everything is verified as correct, promote the final solution to the production environment. But these environments and steps to deployment differ from company to company and you’ll need to follow your organization’s outlined process.

一旦确定,测试并同意了必要的步骤,就需要将它们记录在案,并加以实施。 始终最好使用开发,测试,生产环境架构,并首先将更改实施到开发中。 然后升级到测试环境,只有在所有内容都被验证为正确之后,才将最终解决方案推广到生产环境。 但是这些环境和部署步骤因公司而异,因此您需要遵循组织概述的流程。

自动化您刚刚实施的清洁步骤 (Automate the Cleaning Steps you just Implemented)

In a hospital, once the patient has been treated and is considered well enough for discharge, they still may have follow-up appointments or tasks they need to perform like taking medication. With data cleaning, multiple follow-up tasks may also be necessary, but the one follow-up task that is always required involves automation.

在医院中,一旦患者接受了治疗并且被认为可以很好地出院,他们仍可能需要进行随访约会或完成需要执行的任务,例如服药。 通过数据清理,可能还需要执行多个后续任务,但是始终需要执行的一项后续任务涉及自动化。

No matter what your company’s change process is, the one thing you must do is make the changes persistent through automation. Automation comes in many forms. You may need to change the existing ETL process or introduce an automated process that cleans the data post ETL. Automation can be achieved through any language or system that works for you and your company: SQL, Python, C#, SAS, and the list goes on. A common automation system for companies using Microsoft products is SQL Server Integration Services (or SSIS). The scheduled execution of these tasks can be as simple as cron or Microsoft Task Manager or SQL Agent. It doesn’t necessarily need to be sophisticated. But it needs to be automated.

无论公司的变更过程是什么,您必须做的一件事就是通过自动化使变更持久化。 自动化有多种形式。 您可能需要更改现有的ETL流程或引入自动流程以清理ETL后的数据。 可以通过适用于您和您公司的任何语言或系统来实现自动化:SQL,Python,C#,SAS,并且清单不胜枚举。 对于使用Microsoft产品的公司来说,常见的自动化系统是SQL Server Integration Services(或SSIS)。 这些任务的计划执行可以像cron或Microsoft Task Manager或SQL Agent一样简单。 它不一定需要很复杂。 但是它需要自动化。

If you allow the cleaning process to remain manual, you will very quickly overwhelm your team with recurring manual work and hope of taking on new data cleaning efforts will be forfeited.

如果您让清理过程保持手动状态,那么您很快就会因重复进行的手动工作而使您的团队不堪重负,而放弃进行新的数据清理工作的希望将会丧失。

I’ve said this elsewhere, but the quickest way to render your team useless is to overwhelm them with recurring manual work. All exploratory, cleaning and initial data wrangling work is manual. But once the process is cleanly defined, it must be automated if your team has any hope of continuing to impact your company’s insights and decisions.

我在其他地方已经说过了,但是使您的团队变得毫无用处的最快方法是用重复的手动工作来压倒他们。 所有探索,清洁和初始数据整理工作都是手动操作。 但是,一旦对流程进行了明确的定义,如果您的团队有希望继续影响您公司的见解和决策,则必须将其自动化。

重复 (Repeat)

Now that you’ve taken the cleaning process all the way through analysis, identifying, assessing, prioritizing, team assignment, establishing a cleaning process, and automating the cleaning process — now it’s time to repeat the process. Analysts will confirm the result of your team’s work and be thankful, but they will also send you new “data” patients and the process starts again for these new patients who are ill and need your healing touch.

现在,您已经通过分析,确定,评估,确定优先级,分配团队,建立清洁过程并自动执行清洁过程来进行整个清洁过程,现在是重复该过程的时候了。 分析师将确认您团队的工作结果并表示感谢,但他们还将向您发送新的“数据”患者,并且针对这些新生病且需要您进行康复治疗的患者,该过程将重新开始。

最后提醒您正式建立您的周期性数据清除过程… (A final reminder to formally establish your cyclical data cleaning process…)

The doctors and nurses saving lives in the hospital emergency room aren’t just winging it. They’ve been training for years. They can practically perform their work in their sleep because it has been ingrained in them through hours of formalized training. Not only have they been diligently trained in medical procedures and knowledge, the policies and procedures used in the ER to triage a patient, admit a patient, prioritize a patient, all the way through treating the patient have been thoroughly studied and formalized in order to give any patient coming through their doors the best chance of survival.

在医院急诊室里挽救生命的医生和护士不仅在旁舍。 他们已经训练了多年。 他们几乎可以在睡眠中完成工作,因为经过数小时的正规培训,他们已经根深蒂固了。 他们不仅经过了严格的医疗程序和知识方面的培训,而且急诊室中用于分诊患者,接纳患者,确定患者优先级的政策和程序也得到了透彻的研究和正规化,以便给任何从门口进来的患者最大的生存机会。

Medical personnel and hospital administrators know that a strong, formalized plan leaves much less chance of error and that leads to greater success. The exact same benefits from formalizing a plan are true for data cleaning.

医务人员和医院管理人员知道,强有力的正规计划可以减少出错的机会,从而可以带来更大的成功。 正规化计划所带来的完全相同的好处对于数据清理来说是正确的。

As outlined in the opening of this article, large linear based projects for data cleaning typically end up with this same cyclical process in the end. The linear process failed them and because the work must still be accomplished, the team tackles each problem as it comes to light. But many times arriving at this same process after a failed linear attempt leaves the process in an informal state. It’s really just performed in an ad hoc manner.

正如本文开头所概述的那样,用于数据清理的大型线性项目通常最终会以相同的循环过程结束。 线性过程使他们失败了,并且由于必须完成工作,因此团队要解决所发现的每个问题。 但是,在一次失败的线性尝试之后,许多次到达相同的过程会使过程处于非正式状态。 它实际上只是临时执行的。

Avoid this pitfall!

避免这种陷阱!

Without formalization, the process outcomes will ebb and flow between success and failure. The successful cleaning of each newly discovered dirty data issue will always be up for grabs. Your team will work inconsistently and their motivation for the job will come and go. Even if your ad hoc approach begins to work well over time as your team gels, it’s upended every time a team member leaves or a new member is hired. And here’s why…there’s no plan to follow.

如果没有形式化,过程结果将在成功与失败之间起伏不定。 成功清除每个新发现的脏数据问题将始终备受关注。 您的团队工作会前后不一致,他们的工作动力会不断变化。 即使您的临时方法随着团队的发展逐渐发挥作用,但每次团队成员离职或雇用新成员时,这种方法都会被颠覆。 这就是为什么……没有可遵循的计划。

Some days your team will knock it out of the park when they are feeling good about what they do. But other days the recurrent work and recollection of failure from the initial data cleaning project will push them to question their work and the nature of their job.

有时候,当您对自己的工作感到满意时,您的团队会将其赶出公园。 但是其他日子,反复的工作和从最初的数据清理项目中收集到的故障将促使他们质疑他们的工作和工作性质。

“Why can’t we create a process to fix all of this instead of constantly facing a fire every day?”

“为什么我们不能创建一个解决所有这些问题的流程,而不是每天都不断面对火灾?”

“Why doesn’t management care that we are already overworked? They just keep sending us more requests.”

“为什么管理层不关心我们已经工作过度了? 他们只是继续向我们发送更多请求。”

“When will we ever catch up?”

“我们什么时候能赶上?”

If you will take the time and give the effort to formalize the cyclical process of cleaning your data, your team will have a roadmap to follow and the other departments in your organization will have a guide to properly interact with your team. In many ways, it’s just perspective, but the formalization of the process removes the ambiguity and gives purpose to the work being done. Formalizing is a necessary step to ensuring a consistently successful outcome to data cleaning for your team.

如果您愿意花时间并努力使清理数据的周期性过程正式化,那么您的团队将有一个发展路线图,组织中的其他部门将有一个指南与您的团队进行正确的交互。 在许多方面,它只是透视图,但是流程的形式化消除了歧义,并为正在完成的工作赋予了目的。 形式化是确保团队成功获得一致的数据清理结果的必要步骤。

Rod Castor helps companies Get Analytics Right! He works with both international organizations and small businesses to start or improve their efforts in data analytics, data science, tech strategy, and tech leadership. In addition to consulting, Rod also enjoys public speaking, teaching, and writing. You can discover more about Rod and his work at rodcastor.com and appliedai.us.

Rod Castor 帮助公司正确完成分析! 他与国际组织和小型企业合作,以开始或改善他们在数据分析,数据科学,技术战略和技术领导力方面的工作。 除了提供咨询服务外,Rod还喜欢公开演讲,教学和写作。 你可以发现更多关于罗德和他的工作 rodcastor.com appliedai.us

翻译自: https://towardsdatascience.com/how-to-implement-a-successful-data-cleaning-process-701e565e6575

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388280.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

通才与专家_那么您准备聘请数据科学家了吗? 通才还是专家?

通才与专家Throughout my 10-year career, I have seen people often spend their time and energy in passionate debates about what data science can deliver, and what data scientists do or do not do. I submit that these are the wrong questions to focus on when y…

ubuntu opengl 安装

安装相应的库: sudo apt-get install build-essential libgl1-mesa-dev sudo apt-get install freeglut3-dev sudo apt-get install libglew-dev libsdl2-dev libsdl2-image-dev libglm-dev libfreetype6-dev 实例: #include "GL/glut.h" void…

分享一病毒源代码,破坏MBR,危险!!仅供学习参考,勿运行(vc++2010已编译通过)

我在编译的时候,杀毒软件提示病毒并将其拦截,所以会导致编译不成功。 1>D:\c工程\windows\windows\MBR病毒.cpp : fatal error C1083: 无法打开编译器中间文件:“C:\Users\lenovo\AppData\Local\Temp\_CL_953b34fein”: Permission denied 1> 1>…

数据科学家 数据工程师_数据科学家实际上赚了多少钱?

数据科学家 数据工程师目录 (Table of Contents) Introduction 介绍 Junior Data Scientist 初级数据科学家 Mid-Level Data Scientist 中级数据科学家 Senior Data Scientist 资深数据科学家 Additional Compensation 额外补偿 Summary 摘要 介绍 (Introduction) The lucrativ…

spotify歌曲下载_使用Spotify数据预测哪些“ Novidades da semana”歌曲会成为热门歌曲

spotify歌曲下载TL; DR (TL;DR) Spotify is my favorite digital music service and I’m very passionate about the potential to extract meaningful insights from data. Therefore, I decided to do this article to consolidate my knowledge of some classification mod…

(第三周)周报

此作业要求https://edu.cnblogs.com/campus/nenu/2018fall/homework/2143 1.本周PSP 总计:1422 min 2.本周进度条 (1)代码累积折线图 (2)博文字数累积折线图 4.PSP饼状图 转载于:https://www.cnblogs.com/gongylx/p/9761852.html

功能测试代码python_如何使您的Python代码更具功能性

功能测试代码pythonFunctional programming has been getting more and more popular in recent years. Not only is it perfectly suited for tasks like data analysis and machine learning. It’s also a powerful way to make code easier to test and maintain.近年来&am…

layou split 属性

layou split:true - 显示侧分栏 转载于:https://www.cnblogs.com/jasonlai2016/p/9764450.html

C#Word转Html的类

C#Word转Html的类/**//******************************************************************** created: 2007/11/02 created: 2:11:2007 23:13 filename: D:C#程序练习WordToChmWordToHtml.cs file path: D:C#程序练习WordToChm file bas…

分库分表的几种常见形式以及可能遇到的难题

前言 在谈论数据库架构和数据库优化的时候,我们经常会听到“分库分表”、“分片”、“Sharding”…这样的关键词。让人感到高兴的是,这些朋友所服务的公司业务量正在(或者即将面临)高速增长,技术方面也面临着一些挑战。…

线性回归和将线拟合到数据

Linear Regression is the Supervised Machine Learning Algorithm that predicts continuous value outputs. In Linear Regression we generally follow three steps to predict the output.线性回归是一种监督机器学习算法,可预测连续值输出。 在线性回归中&…

小米盒子4 拆解图解_我希望当我开始学习R时会得到的盒子图解指南

小米盒子4 拆解图解Customizing a graph to transform it into a beautiful figure in R isn’t alchemy. Nonetheless, it took me a lot of time (and frustration) to figure out how to make these plots informative and publication-quality. Rather than hoarding this …

蓝牙一段一段_不用担心,它在那里存在了一段时间

蓝牙一段一段You’re sitting in a classroom. You look around and see your friends writing something down. It seems they are taking the exam, and they know all the answers (even Johnny who, how to say it… wasn’t the brilliant one). You realize that your ex…

普通话测试系统_普通话

普通话测试系统Traduzido/adaptado do original por Vincius Barqueiro a partir do texto original “Writing Alt Text for Data Visualization”, escrito por Amy Cesal e publicado no blog Nightingale.Traduzido / adaptado由 VinciusBarqueiro 提供原始 文本“为数据可…

美国队长3:内战_隐藏的宝石:寻找美国最好的秘密线索

美国队长3:内战There are plenty of reasons why one would want to find solitude in the wilderness, from the therapeutic effects of being immersed in nature, to not wanting to contribute to trail degradation and soil erosion on busier trails.人们有很多理由想要…

Java入门第三季——Java中的集合框架(中):MapHashMap

1 package com.imooc.collection;2 3 import java.util.HashSet;4 import java.util.Set;5 6 /**7 * 学生类8 * author Administrator9 * 10 */ 11 public class Student { 12 13 public String id; 14 15 public String name; 16 17 public Set<…

动漫数据推荐系统

Simple, TfidfVectorizer and CountVectorizer recommendation system for beginner.简单的TfidfVectorizer和CountVectorizer推荐系统&#xff0c;适用于初学者。 目标 (The Goal) Recommendation system is widely use in many industries to suggest items to customers. F…

1.3求根之牛顿迭代法

目录 目录前言&#xff08;一&#xff09;牛顿迭代法的分析1.定义2.条件3.思想4.误差&#xff08;二&#xff09;代码实现1.算法流程图2.源代码&#xff08;三&#xff09;案例演示1.求解&#xff1a;\(f(x)x^3-x-10\)2.求解&#xff1a;\(f(x)x^2-1150\)3.求解&#xff1a;\(f…

Alex Hanna博士:Google道德AI小组研究员

Alex Hanna博士是社会学家和研究科学家&#xff0c;致力于Google的机器学习公平性和道德AI。 (Dr. Alex Hanna is a sociologist and research scientist working on machine learning fairness and ethical AI at Google.) Before that, she was an Assistant Professor at th…

安全开发 | 如何让Django框架中的CSRF_Token的值每次请求都不一样

前言 用过Django 进行开发的同学都知道&#xff0c;Django框架天然支持对CSRF攻击的防护&#xff0c;因为其内置了一个名为CsrfViewMiddleware的中间件&#xff0c;其基于Cookie方式的防护原理&#xff0c;相比基于session的方式&#xff0c;更适合目前前后端分离的业务场景&am…