熊猫直播 使用什么sdk_没什么可花的-但是16项基本操作才能让您开始使用熊猫

熊猫直播 使用什么sdk

Python has become the go-to programming language for many data scientists and machine learning researchers. One essential data processing tool for them to make this choice is the pandas library. For sure, the pandas library is so versatile that it can be used for almost all initial data manipulations to get the data ready for conducting statistical analyses or building machine learning models.

对于许多数据科学家和机器学习研究人员来说,Python已经成为编程语言。 供他们选择的一种重要数据处理工具是pandas库。 可以肯定的是,pandas库的用途非常广泛,几乎可以用于所有初始数据操作,从而为进行统计分析或构建机器学习模型做好准备。

However, for the same versatility, it can be overwhelming to start to use it comfortably. If you’re struggling on how to get it started, this article is right for you. Instead of covering too much to lose the points, the goal of the article is to provide an overview of key operations that you want to use in your daily data processing tasks. Each key operation is accompanied with some highlights of essential parameters to consider.

但是,对于相同的多功能性,开始舒适地使用它可能会让人不知所措。 如果您正在努力入门,那么这篇文章很适合您。 本文的目的不是概述过多的问题,而是概述您要在日常数据处理任务中使用的关键操作。 每个按键操作都带有一些要考虑的基本参数。

Certainly, the first step is to get the pandas installed, which you can use pip or conda by following the instruction here. In terms of coding environment, I recommend Visual Studio Code, JupyterLab, or Google Colab, all of which requires little efforts to get it set up. When the installation is done, in your preferred coding environment, you should be able to import pandas to your project.

当然,第一步是安装熊猫,您可以按照此处的说明使用pipconda 。 在编码环境方面,我建议使用Visual Studio Code,JupyterLab或Google Colab,所有这些都无需花费很多精力即可进行设置。 安装完成后,在您喜欢的编码环境中,您应该能够将熊猫导入到您的项目中。

import pandas as pd

If you run the above code without encountering any errors, you’re good to go.

如果您在运行上述代码时没有遇到任何错误,那就很好了。

1.读取外部数据 (1. Read External Data)

In most cases, we read data from external sources. If our data are in the spreadsheet-like format, the following functions should serve the purposes.

在大多数情况下,我们从外部来源读取数据。 如果我们的数据采用电子表格格式,则应使用以下功能。

# Read a comma-separated filedf = pd.read_csv("the_data.csv")# Read an Excel spreadsheetdf = pd.read_excel("the_data.xlsx")
  • The header should be handled correctly. By default, the reading will assume the first line of data to be the column names. If they are no headers, you have to specify it (e.g., header=None).

    标头应正确处理。 默认情况下,读数将假定第一行数据为列名。 如果没有标题,则必须指定它(例如, header=None )。

  • If you’re reading a tab-delimited file, you can use read_csv by specifying the tab as the delimiter (e.g., sep=“\t”).

    如果要读取制表符分隔的文件,则可以通过将制表符指定为分隔符来使用read_csv (例如, sep=“\t” )。

  • When you read a large file, it’s a good idea by reading a small portion of the data. In this case, you can set the number of rows to be read (e.g., nrows=1000).

    读取大文件时,最好读取一小部分数据。 在这种情况下,您可以设置要读取的行数(例如nrows=1000 )。

  • If your data involve dates, you can consider setting arguments to make the dates right, such as parse_dates, and infer_datetime_format.

    如果数据涉及日期,则可以考虑设置参数以使日期正确,例如parse_datesinfer_datetime_format

2.创建系列 (2. Create Series)

During the process of cleaning up your data, you may need to create Series yourself. In most cases, you’ll simply pass an iterable to create a Series object.

在清理数据的过程中,您可能需要自己创建Series 。 在大多数情况下,您只需传递一个iterable即可创建Series对象。

# Create a Series from an iterableintegers_s = pd.Series(range(10))# Create a Series from a dictionary object
squares = {x: x*x for x in range(1, 5)}
squares_s = pd.Series(squares)
  • You can assign a name to the Series object by setting the name argument. This name will become the name if it becomes part of a DataFrame object.

    您可以通过设置name参数为Series对象分配name 。 如果该名称成为DataFrame对象的一部分,它将成为名称。

  • You can also assign index to the Series (e.g., setting the index argument) if you find it more useful than the default 0-based index. Note that the index’s length should match your data’s length.

    如果发现index比默认的从0开始的索引更有用,则还可以将索引分配给Series (例如,设置index参数)。 请注意,索引的长度应与数据的长度匹配。

  • If you create a Series object from a dict, the keys will become the index.

    如果从dict创建Series对象,则键将成为索引。

3.构造DataFrame (3. Construct DataFrame)

Oftentimes, you need to create DataFrame objects using Python built-in objects, such as lists and dictionaries. The following code snippet highlights two common use scenarios.

通常,您需要使用Python内置对象(例如列表和字典)创建DataFrame对象。 下面的代码片段重点介绍了两种常见的使用场景。

# Create a DataFrame from a dictionary of lists as valuesdata_dict = {'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}
data_df0 = pd.DataFrame(data_dict)# Create a DataFrame from a listdata_list = [[1, 4, 7], [2, 5, 8], [3, 6, 9]]
data_df1 = pd.DataFrame(data_list, columns=tuple('abc'))
  • The first one uses a dict object. Its keys will become column names while its values will become values for the corresponding columns.

    第一个使用dict对象。 其键将成为列名,而其值将成为对应列的值。

  • The second one uses a list object. Unlike the previous method, the constructed DataFrame will use the data row-wise, which means that each inner list will become a row of the created DataFrame object.

    第二个使用列表对象。 与以前的方法不同,构造的DataFrame将按行使用数据,这意味着每个内部列表将成为创建的DataFrame对象的一行。

4. DataFrame概述 (4. Overview of DataFrame)

When we have the DataFrame to work with, we may want to take a look at the dataset at the 30,000 feet level. There are several common ways that you can use, as shown below.

当我们使用DataFrame时,我们可能要看一下30,000英尺高度的数据集。 可以使用几种常用方法,如下所示。

# Find out how many rows and columns the DataFrame hasdf.shape# Take a quick peak at the beginning and the end of the datadf.head()
df.tail()# Get a random sampledf.sample(5)# Get the information of the datasetdf.info()# Get the descriptive stats of the numeric valuesdf.describe()
  • It’s important to check both the head and the tail especially if you work with a large amount of data, because you want to make sure that all of your data have been read completely.

    检查头部和尾部非常重要,尤其是在处理大量数据时,因为要确保已完全读取所有数据。
  • The info() function will give you an overview of the columns in terms of data types and item counts.

    info()函数将为您提供有关数据类型和项目计数的列概述。

  • It’s also interesting to get a random sample of your data to check your data’s integrity (i.e., sample() function).

    获取数据的随机样本以检查数据的完整性(例如sample()函数)也很有趣。

5.重命名列 (5. Rename Columns)

You notice that some columns of your data don’t make too much sense or the name is too long to work with and you want to rename these columns.

您会注意到数据的某些列没有太大意义,或者名称太长而无法使用,并且您想重命名这些列。

# Rename columns using mappingdf.rename({'old_col0': 'col0', 'old_col1': 'col1'}, axis=1)# Rename columns by specifying columns directlydf.rename(columns={'old_col0': 'col0', 'old_col1': 'col1'})
  • You need to specify axis=1 to rename columns if you simply provide a mapping object (e.g., dict).

    如果仅提供映射对象(例如dict ),则需要指定axis=1来重命名列。

  • Alternatively, you can specify the mapping object to the columns argument explicitly.

    另外,您可以显式地指定到columns参数的映射对象。

  • The rename function will create a new DataFrame by default. If you want to rename the DataFrame inplace, you need to specify inplace=True.

    rename功能将默认创建一个新的DataFrame。 如果要重命名DataFrame inplace ,则需要指定inplace=True

6.排序数据 (6. Sort Data)

To make your data more structured, you need to sort the DataFrame object.

为了使数据更加结构化,您需要对DataFrame对象进行排序。

# Sort datadf.sort_values(by=['col0', 'col1'])
  • By default, the sort_values function will sort your rows (axis=0). In most cases, we use columns as the sorting keys.

    默认情况下, sort_values函数将对行进行排序( axis=0 )。 在大多数情况下,我们使用列作为排序键。

  • The sort_values function will create a sorted DataFrame object by default. To change it, use inplace=True.

    默认情况下, sort_values函数将创建一个排序的DataFrame对象。 要更改它,请使用inplace=True

  • By default, the sorting is based on ascending for all sorting keys. If you want to use descending, specify ascending=False. If you want to have mixed orders (e.g., some keys are ascending and some descending), you can create a list of boolean values to match the number of keys, like by=[‘col0’, ‘col1’, ‘col2’], ascending=[True, False, True].

    默认情况下,排序基于所有排序键的升序。 如果要使用降序,请指定ascending=False 。 如果要混合使用顺序(例如,某些键是升序,某些键是降序),则可以创建一个布尔值列表以匹配键的数量,例如by=['col0', 'col1', 'col2'], ascending=[True, False, True]

  • The original index will go with their old data rows. In many cases, you need to re-index. Instead of calling reset_index function directly, you can specify the ignore_index to be True, which will reset the index for you after the sorting is completed.

    原始索引将与它们的旧数据行一起显示。 在许多情况下,您需要重新编制索引。 您可以将ignore_index指定为True ,而不是直接调用rese t _index函数,它将在排序完成后为您重置索引。

7.处理重复 (7. Deal With Duplicates)

It’s a common scenario in real-life datasets that they contain duplicate records, either by human mistake or database glitch. We want to remove these duplicates because they can cause unexpected problems later on.

在现实的数据集中,这种情况很常见,它们包含重复记录,可能是由于人为错误或数据库故障引起的。 我们要删除这些重复项,因为它们可能在以后导致意外问题。

# To examine whether there are duplicates using all columnsdf.duplicated().any()# To examine whether there are duplicates using particular columnsdf.duplicated(['col0', 'col1']).any()

The above functions will return you a boolean value telling you whether any duplicate records exist in your dataset. To find out the exact number of duplicate records, you can use the sum() function by taking advantage of the returned Series object of boolean values from duplicated() function (Python treats True as a value of 1), as show below. One additional thing to note is that when the argument keep is set to be False, it will mark any duplicates as True. Suppose that there are three duplicate records, when keep=False, both records will be marked as True (i.e., being duplicated). When keep= “first” or keep=“last”, only the first or the last record marked as True.

上面的函数将返回一个布尔值,告诉您数据集中是否存在重复的记录。 要找出重复记录的确切数目,可以利用sum()函数,方法是利用来自duplicated()函数的布尔值返回的Series对象(Python将True的值视为1),如下所示。 需要注意的另一件事是,当参数keep设置为False ,它将所有重复项标记为True 。 假设有三个重复记录,当keep=False ,两个记录都将被标记为True (即,被重复)。 当keep= “first”keep=“last” ,仅第一个或最后一个记录标记为True

# To find out the number of duplicatesdf.duplicated().sum()
df.duplicated(keep=False).sum()

To actually view the duplicated records, you need to select data from the original dataset using the generated duplicated Series object, as shown below.

要实际查看重复的记录,您需要使用生成的重复的Series对象从原始数据集中选择数据,如下所示。

# Get the duplicate recordsduplicated_indices = df.duplicated(['col0', 'col1'], keep=False)duplicates = df.loc[duplicated_indices, :].sort_values(by=['col0', 'col1'], ignore_index=True)
  • We get all duplicate records by setting the keep argument to be False.

    通过将keep参数设置为False可以获得所有重复记录。

  • To better view the duplicate records, you may want to sort the generated DataFrame using the same set of keys.

    为了更好地查看重复的记录,您可能希望使用同一组键对生成的DataFrame进行排序。

Once you have a good idea about the duplicate records with your dataset, you can drop them like below.

一旦对数据集中的重复记录有了一个好主意,就可以像下面这样删除它们。

# Drop the duplicate recordsdf.drop_duplicates(['col0', 'col1'], keep="first", inplace=True, ignore_index=True)
  • By default, the kept record will be the first of the duplicates.

    默认情况下,保留的记录将是重复记录中的第一个。
  • You’ll need to specify inplace=True if you want to update the DataFrame inplace. BTW: many other functions have this option, the discussion of which will be skipped most of the time if not all.

    如果要就地更新DataFrame,则需要指定inplace=True 。 顺便说一句:许多其他功能都有此选项,如果不是全部,大部分时间都将跳过其讨论。

  • As with the sort_values() function, you may want to reset index afterwards by specifying the ignore_index argument (a new feature in Pandas 1.0).

    sort_values()函数一样,您之后可能希望通过指定ignore_index参数(Pandas 1.0中的一项新功能)来重置索引。

8.处理丢失的数据 (8. Handle Missing Data)

Missing data are common in real-life datasets, which can be due to measures not available or simply human data entry mistakes resulting in meaningless data that are deemed to be missing. To have an overall idea of how many missing values your dataset has, you’ve seen that the info() function tells us how many non-null values each column has. We can get information about data missingness in a more structured fashion, as shown below.

丢失数据在现实生活数据集中是很常见的,这可能是由于无法采取的措施或者仅仅是由于人类数据输入错误导致了被认为丢失的毫无意义的数据所致。 为了全面了解数据集有多少个缺失值,您已经了解到info()函数告诉我们每列有多少个非空值。 我们可以以更结构化的方式获取有关数据丢失的信息,如下所示。

# Find out how many missing values for each columndf.isnull().sum()# Find out how many missing values for the entire datasetdf.isnull().sum().sum()
  • The isnull() function creates a DataFrame of the same shape as your original DataFrame with each value indicating the original value to be missing (True) or not (False). As a related note, you can use the notnull() function if you want to generate a DataFrame indicating the non-null values.

    notull isnull()函数创建的形状与原始DataFrame形状相同的DataFrame ,每个值指示缺失的原始值( True )或不存在( False )。 作为相关说明,如果要生成指示非空值的notnull() ,则可以使用notnull()函数。

  • As mentioned previously, a True value in Python is arithmetically equal to 1. The sum() function will compute the sum of these boolean values for each column (by default, it’s calculating the sum column-wise), which reflect the number of missing values.

    如前所述,Python中的True值在算术上等于sum()函数将为每一列计算这些布尔值的总和(默认情况下,它按列计算总和),这反映了缺失的数量价值观。

By having some idea about the missingness of your dataset, we usually want to deal with them. Possible solutions include drop the records with any missing values or fill them with applicable values.

通过对数据集的缺失有所了解,我们通常希望对其进行处理。 可能的解决方案包括删除任何缺少的值的记录或使用适用的值填充它们。

# Drop the rows with any missing valuesdf.dropna(axis=0, how="any")# Drop the rows without 2 or more non-null valuesdf.dropna(thresh=2)# Drop the columns with all values missingdf.dropna(axis=1, how="all")
  • By default, the dropna() function works column-wise (i.e., axis=0). If you specify how=“any”, rows with any missing values will be dropped.

    默认情况下, dropna()函数按列工作(即axis=0 )。 如果指定how=“any” ,则将删除所有缺少值的行。

  • When you set the thresh argument, it requires that the row (or the column when axis=1) have the number of non-missing values.

    设置thresh参数时,它要求该行(或axis=1时的列)具有非缺失值的数量。

  • As many other functions, when you set axis=1, you’re performing operations column-wise. In this case, the above function call will remove the columns for those who have all of their values missing.

    与其他许多功能一样,当您设置axis=1 ,您将按列执行操作。 在这种情况下,上面的函数调用将删除那些缺少所有值的列。

Besides the operation of dropping data rows or columns with missing values, it’s also possible to fill the missing values with some values, as shown below.

除了删除具有缺失值的数据行或数据列的操作外,还可以用一些值填充缺失值,如下所示。

# Fill missing values with 0 or any other value is applicabledf.fillna(value=0)# Fill the missing values with customized mapping for columnsdf.fillna(value={"col0": 0, "col1": 999})# Fill missing values with the next valid observationdf.fillna(method="bfill")# Fill missing values with the last valid observationdf.fillna(method="ffill")
  • To fill the missing values with specified values, you set the value argument either with a fixed value for all or you can set a dict object which will instruct the filling based on each column.

    要用指定的值填充缺失值,可以将value参数设置为全部固定值,也可以设置dict对象,该对象将根据每一列指示填充。

  • Alternatively, you can fill the missing values by using existing observations surrounding the missing holes, either back fill or forward fill.

    或者,您可以通过使用围绕缺失Kong的现有观测值来填充缺失值,即回填或正向填充。

9.分组描述统计 (9. Descriptive Statistics by Group)

When you conduct machine learning research or data analysis, it’s often necessary to perform particular operations with some grouping variables. In this case, we need to use the groupby() function. The following code snippet shows you some common scenarios that apply.

当您进行机器学习研究或数据分析时,通常需要使用一些分组变量来执行特定的操作。 在这种情况下,我们需要使用groupby()函数。 以下代码段显示了一些适用的常见方案。

# Get the count by group, a 2 by 2 exampledf.groupby(['col0', 'col1']).size()# Get the mean of all applicable columns by groupdf.groupby(['col0']).mean()# Get the mean for a particular columndf.groupby(['col0'])['col1'].mean()# Request multiple descriptive statsdf.groupby(['col0', 'col1']).agg({
'col2': ['min', 'max', 'mean'],
'col3': ['nunique', 'mean']
})
  • By default, the groupby() function will return a GroupBy object. If you want to convert it to a DataFrame, you can call the reset_index() on the object. Alternatively, you can specify the as_index=False in the groupby() function call to create a DataFrame directly.

    默认情况下, groupby()函数将返回一个GroupBy对象。 如果要将其转换为DataFrame ,则可以在对象上调用reset_index() 。 另外,您可以在groupby()函数调用中指定as_index=False以直接创建DataFrame

  • The size() is useful if you want to know the frequency of each group.

    如果您想知道每个组的频率,则size()很有用。

  • The agg() function allows you to generate multiple descriptive statistics. You can simply pass a set of function names, which will apply to all columns. Alternatively, you can pass a dict object with functions to apply to specific columns.

    agg()函数使您可以生成多个描述性统计信息。 您可以简单地传递一组函数名称,该名称将应用于所有列。 另外,您可以传递一个dict对象,该对象具有要应用于特定列的函数。

10.宽到长格式转换 (10. Wide to Long Format Transformation)

Depending on how the data are collected, the original dataset may be in the “wide” format — each row represents a data record with multiple measures (e.g., different time points for a subject in a research study). If we want to convert the “wide” format to the “long” format (e.g., each time point becomes a data row and thus a subject has multiple rows), we can use the melt() function, as shown below.

取决于数据的收集方式,原始数据集可能采用“宽”格式-每行代表具有多种度量(例如,研究对象的不同时间点)的数据记录。 如果我们想将“宽”格式转换为“长”格式(例如,每个时间点变成一个数据行,因此一个主体有多个行),则可以使用melt()函数,如下所示。

Wide to Long Transformation
从宽到长的转变
  • The melt() function is essentially “unpivoting” a data table (we’ll talk about pivoting next). You specify the id_vars to be the columns that are used as identifiers in the original dataset.

    melt()函数本质上是“取消透视”数据表(我们接下来将讨论透视)。 您将id_vars指定为原始数据集中用作标识符的列。

  • The value_vars argument is set using the columns that contain the values. By default, the columns will become the values for the var_name column in the melted dataset.

    使用包含值的列设置value_vars参数。 默认情况下,这些列将成为融化数据集中var_name列的值。

11.从长格式到宽格式的转换 (11. Long to Wide Format Transformation)

The opposite operation to the melt() function is called pivoting, which we can realize with the pivot() function. Suppose that the created “wide” format DataFrame is called df_long. The following function shows you how we can convert the wide format to the long format — basically reverse the process that we did in the previous section.

melt()函数相反的操作称为pivoting,我们可以通过pivot()函数来实现。 假设创建的“宽”格式DataFrame称为df_long 。 以下功能向您展示了如何将宽格式转换为长格式-基本上逆转了上一节中的过程。

Long to Wide Transformation
从长到宽的转变

Besides the pivot() function, a closely related function is the pivot_table() function, which is more general than the pivot() function by allowing duplicate index or columns (see here for a more detailed discussion).

pivot_table() pivot()函数外,一个密切相关的函数是pivot_table()函数,它比pivot()函数更通用,它允许重复的索引或列(请参见此处以获取更详细的讨论)。

12.选择数据 (12. Select Data)

When we work with a complex dataset, we need to select a subset of the dataset for particular operations based on some criteria. If you select some columns, the following code shows you how to do it. The selected data will include all the rows.

当我们处理复杂的数据集时,我们需要根据一些条件为特定操作选择数据集的子集。 如果选择一些列,则以下代码向您展示如何执行此操作。 所选数据将包括所有行。

# Select a columndf_wide['subject']# Select multiple columnsdf_wide[['subject', 'before_meds']]

If you want to select certain rows with all columns, do the following.

如果要选择具有所有列的某些行,请执行以下操作。

# Select rows with a specific conditiondf_wide[df_wide['subject'] == 100]

What if you want to select certain rows and columns, we should consider using the iloc or loc methods. The major difference between these methods is that the iloc method uses 0-based index, while the loc method uses labels.

如果要选择某些行和列,应该考虑使用ilocloc方法。 这些方法之间的主要区别在于iloc方法使用基于0的索引,而loc方法使用标签。

Data Selection
资料选择
  • The above pairs of calls create the same output. For clarity, only one output is listed.

    上面的调用对创建相同的输出。 为了清楚起见,仅列出了一个输出。
  • When you use slice objects with the iloc, the stop index isn’t included, just as regular Python slice objects. However, the slice objects include the stop index in the loc method. See Lines 15–17.

    当您将切片对象与iloc一起使用iloc ,不包括stop索引,就像常规的Python切片对象一样。 但是,切片对象在loc方法中包含停止索引。 参见第15-17行。

  • As noted in Line 22, when you use a boolean array, you need to use the actual values (using the values method, which will return the underlying numpy array). If you don’t do that, you’ll probably encounter the following error: NotImplementedError: iLocation based boolean indexing on an integer type is not available.

    如第22行所述,当您使用布尔数组时,需要使用实际值(使用values方法,这将返回基础的numpy数组)。 如果不这样做,则可能会遇到以下错误: NotImplementedError: iLocation based boolean indexing on an integer type is not available

  • The use of labels in loc methods happens to be the same as index in terms of selecting rows, because the index has the same name as the index labels. In other words, iloc will always use 0-based index based on the position regardless of the numeric values of the index.

    就选择行而言,在loc方法中使用标签恰好与索引相同,因为索引的名称与索引标签的名称相同。 换句话说, iloc将始终基于位置使用基于0的索引,而不管索引的数值如何。

13.使用现有数据的新列(映射并应用) (13. New Columns Using Existing Data (map and apply))

Existing columns don’t always present the data in the format we want. Thus, we often need to generate new columns using existing data. Two functions are particularly useful in this case: map() and apply(). There are too many possible ways that we can use them to create new columns. For instance, the apply() function can have a more complex mapping function and it can create multiple columns. I’ll just show you two most use common cases with the following rules of thumb. Let’s keep our goal simple — just create one column with either use case.

现有列并不总是以我们想要的格式显示数据。 因此,我们经常需要使用现有数据来生成新列。 在这种情况下,两个函数特别有用: map()apply() 。 我们可以使用太多方式来创建新列。 例如, apply()函数可以具有更复杂的映射函数,并且可以创建多个列。 我将通过以下经验法则向您展示两个最常用的案例。 让我们保持目标简单-只需使用任一用例创建一列。

  • If your data conversion involves just one column, simply use the map() function on the column (in essence, it’s a Series object).

    如果您的数据转换仅涉及一列,则只需在该列上使用map()函数(本质上是一个Series对象)。

  • If your data conversion involves multiple columns, use the apply() function.

    如果您的数据转换涉及多个列,请使用apply()函数。

Map and Apply
映射并应用
  • In both cases, I used lambda functions. However, you can use regular functions. It’s also possible to provide a dict object for the map() function, which will map the old values to the new values based on the key-value pairs with keys being the old values and the values being the new values.

    在两种情况下,我都使用lambda函数。 但是,您可以使用常规功能。 也可以为map()函数提供一个dict对象,该对象将根据键值对将旧值映射到新值,其中键为旧值,而值为新值。

  • For the apply() function, when we create a new column, we need to specify axis=1, because we’re accessing data row-wise.

    对于apply()函数,当我们创建新列时,我们需要指定axis=1 ,因为我们要逐行访问数据。

  • For the apply() function, the example shown is intended for demonstration purposes, because I could’ve used the original column to do a simpler arithmetic subtraction like this: df_wide[‘change’] = df_wide[‘before_meds’] —df_wide[‘after_meds’].

    对于apply()函数,所示示例仅用于演示目的,因为我可以使用原始列来进行如下更简单的算术减法: df_wide['change'] = df_wide['before_meds'] —df_wide['after_meds']

14.串联与合并 (14. Concatenation and Merging)

When we have multiple datasets, it’s necessary to put them together from time to time. There are two common scenarios. The first scenario is when you have datasets of similar shape, either sharing the same index or same columns, you can consider concatenating them directly. The following code shows you some possible concatenations.

当我们有多个数据集时,有必要不时将它们放在一起。 有两种常见方案。 第一种情况是,当您拥有形状相似的数据集(共享相同的索引或相同的列)时,可以考虑直接将它们连接起来。 以下代码显示了一些可能的连接。

# When the data have the same columns, concatenate them verticallydfs_a = [df0a, df1a, df2a]
pd.concat(dfs_a, axis=0)# When the data have the same index, concatenate them horizontallydfs_b = [df0b, df1b, df2b]
pd.concat(dfs_b, axis=1)
  • By default, the concatenation performs an “outer” join, which means that if there are any non-overlapping index or columns, all of them will be kept. In other words, it’s like creating a union of two sets.

    默认情况下,串联执行“外部”联接,这意味着如果存在任何不重叠的索引或列,则将全部保留它们。 换句话说,这就像创建两个集合的并集。
  • Another thing to remember is that if you need to concatenate multiple DataFrame objects, it’s recommended that you create a list to store these objects, and perform concatenation just once by avoiding generating intermediate DataFrame objects if you perform concatenation sequentially.

    要记住的另一件事是,如果需要连接多个DataFrame对象,建议您创建一个列表来存储这些对象,并通过顺序执行串联操作避免生成中间DataFrame对象,从而只执行一次连接。

  • If you want to reset the index for the concatenated DataFrame, you can set ignore_index=True argument.

    如果要重置串联的DataFrame的索引,则可以设置ignore_index=True参数。

The other scenario is to merge datasets that have one or two overlapping identifiers. For instance, one DataFrame has id number, name and gender, and the other has id number and transaction records. You can merge them using the id number column. The following code shows you how to merge them.

另一种情况是合并具有一个或两个重叠标识符的数据集。 例如,一个DataFrame具有ID号,名称和性别,而另一个具有ID号和交易记录。 您可以使用ID号列合并它们。 以下代码显示了如何合并它们。

# Merge DataFrames that have the same merging keysdf_a0 = pd.DataFrame(dict(), columns=['id', 'name', 'gender'])
df_b0 = pd.DataFrame(dict(), columns=['id', 'name', 'transaction'])
merged0 = df_a0.merge(df_b0, how="inner", on=["id", "name"])# Merge DataFrames that have different merging keysdf_a1 = pd.DataFrame(dict(), columns=['id_a', 'name', 'gender'])
df_b1 = pd.DataFrame(dict(), columns=['id_b', 'transaction'])
merged1 = df_a1.merge(df_b1, how="outer", left_on="id_a", right_on="id_b")
  • When both DataFrame objects share the same key or keys, you can simply specify them (either one or multiple is fine) using the on argument.

    当两个DataFrame对象共享一个或多个相同的键时,您可以使用on参数简单地指定它们(一个或多个都可以)。

  • When they have different names, you can specify which one for the left DataFrame and which one for the right DataFrame.

    当他们有不同的名称,你可以指定一个左数据框和一个合适的数据帧

  • By default, the merging will use the inner join method. When you want to have other join methods (e.g., left, right, outer), you set the proper value for the how argument.

    默认情况下,合并将使用内部连接方法。 当您要使用其他联接方法(例如,左,右,外)时,可以为how参数设置适当的值。

15.放置列 (15. Drop Columns)

Although you can keep all the columns in the DataFrame by renaming them without any conflict, sometimes you’d like to drop some columns to keep the dataset clean. In this case, you should use the drop() function.

尽管您可以通过重命名所有列将其保留在DataFrame中,而不会发生任何冲突,但是有时您还是希望删除一些列以保持数据集的整洁。 在这种情况下,应使用drop()函数。

# Drop the unneeded columnsdf.drop(['col0', 'col1'], axis=1)
  • By default, the drop() function uses labels to refer to columns or index, and thus you may want to make sure that the labels are contained in the DataFrame object.

    默认情况下, drop()函数使用标签来引用列或索引,因此您可能需要确保标签包含在DataFrame对象中。

  • To drop index, you use axis=0. If you drop columns, which I find them to be more common, you use axis=1.

    要删除索引,请使用axis=0 。 如果删除列(我发现它们更常见),则使用axis=1

  • Again, this operation creates a DataFrame object, and if you prefer changing the original DataFrame, you specify inplace=True.

    同样,此操作将创建一个DataFrame对象,并且如果您希望更改原始DataFrame ,则可以指定inplace inplace=True

16.写入外部文件 (16. Write to External Files)

When you want to communicate data with your collaborators or teammates, you need to write your DataFrame objects to external files. In most cases, the comma-delimited files should serve the purposes.

当您想与合作者或队友交流数据时,您需要将DataFrame对象写入外部文件。 在大多数情况下,以逗号分隔的文件应能达到目的。

# Write to a csv file, which will keep the indexdf.to_csv("filename.csv")# Write to a csv file without the indexdf.to_csv("filename.csv", index=False)# Write to a csv file without the headerdf.to_csv("filename.csv", header=False)
  • By default, the generated file will keep the index. You need to specify index=False to remove the index from the output.

    默认情况下,生成的文件将保留索引。 您需要指定index=False才能从输出中删除索引。

  • By default, the generated file will keep the header (e.g., column names). You need to specify header=False to remove the headers.

    默认情况下,生成的文件将保留标题(例如,列名)。 您需要指定header=False来删除标题。

结论 (Conclusion)

In this article, we reviewed the basic operations that you’ll find them useful to get you started with the pandas library. As indicated by the article’s title, these techniques aren’t intended to handle the data in a fancy way. Instead, they’re all basic techniques to allow you to process the data in the way you want. Later on, you can probably find fancier ways to get some operations done.

在本文中,我们回顾了基本操作,您会发现它们对开始使用pandas库很有用。 如文章标题所示,这些技术并非旨在以一种奇特的方式处理数据。 相反,它们都是允许您以所需方式处理数据的基本技术。 稍后,您可能会找到更理想的方法来完成一些操作。

翻译自: https://towardsdatascience.com/nothing-fancy-but-16-essential-operations-to-get-you-started-with-pandas-5b0c2f649068

熊猫直播 使用什么sdk

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/387847.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

萌新一手包App前后端开发日记(一)

从事Android移动端也有些日子了,还记得一开始选择这份工作,是憧憬着有朝一日能让亲朋好友用上自己开发的软件,但日子久了才发现,并不是所有的公司,所有的项目的适用群体都是“亲朋好友”,/无奈脸 摊手。当…

方差,协方差 、统计学的基本概念

一、统计学的基本概念 统计学里最基本的概念就是样本的均值、方差、标准差。首先,我们给定一个含有n个样本的集合,下面给出这些概念的公式描述: 均值: 标准差: 方差: 均值描述的是样本集合的中间点&#xf…

关系型数据库的核心单元是_核中的数据关系

关系型数据库的核心单元是Nucleoid is an open source (Apache 2.0), a runtime environment that provides logical integrity in declarative programming, and at the same time, it stores declarative statements so that it doesn’t require external database, in shor…

MongoDB第二天

集合的操作: db.表名称 show tables / collection db.表名.drop() 文档的操作: 插入数据 db.表名.insert({"name":"jerry"}) db.insertMany([{"name":"sb",...}]) var ul {"name":"sb"} db.sb.insert(ul) db.sb.…

Python 主成分分析PCA

Python 主成分分析PCA 主成分分析&#xff08;PCA&#xff09;是一种基于变量协方差矩阵对数据进行压缩降维、去噪的有效方法&#xff0c;PCA的思想是将n维特征映射到k维上&#xff08;k<n&#xff09;&#xff0c;这k维特征称为主元&#xff0c;是旧特征的线性组合&#xf…

小程序 国际化_在国际化您的应用程序时忘记的一件事

小程序 国际化The hidden bugs waiting to be found by your international users您的国际用户正在等待发现的隐藏错误 While internationalizing our applications, we focus on the things we can see: text, tool-tips, error messages, and the like. But, hidden in our …

三. 性能测试领域

能力验证&#xff1a; 概念&#xff1a;系统能否在A条件下具备B能力 应用&#xff1a;为客户进行系统上线后的验收测试&#xff0c;作为第三方对一个已经部署系统的性能验证 特点&#xff1a;需要在已确定的环境下运行 需要根据典型场景设计测试方案和用例 一个典型场景包括操…

PCA主成分分析Python实现

作者&#xff1a;拾毅者 出处&#xff1a;http://blog.csdn.net/Dream_angel_Z/article/details/50760130 Github源码&#xff1a;https://github.com/csuldw/MachineLearning/tree/master/PCA PCA&#xff08;principle component analysis&#xff09; &#xff0c;主成分分…

scp

将文件或目录从本地通过网络拷贝到目标端。拷贝目录要带 -r 参数 格式&#xff1a;scp 本地用户名IP地址:文件名1 远程用户名IP地址:文件名 2 例&#xff1a; scp media.repo root192.168.20.32:/etc/yum.repos.d/ 将远程主机文件或目录拷贝到本机&#xff0c;源和目的参数调换…

robo 3t连接_使用robo 3t studio 3t连接到地图集

robo 3t连接Robo 3T (formerly Robomongo) is a graphical application to connect to MongoDB. The newest version now includes support for TLS/SSL and SNI which is required to connect to Atlas M0 free tier clusters.Robo 3T(以前称为Robomongo )是用于连接MongoDB的…

JavaWeb--JavaEE

一、JavaEE平台安装1、升级eclipseEE插件2、MyEclipse二、配置Eclipse工作空间1.字体设置 2.工作空间编码 UTF-83.JDK版本指定 4.集成Tomcat Server运行环境5.配置server webapps目录 端口号 启动时间等三、创建第一个Web项目1.创建 Web Project2.设置 tomcat、创建web.xml3.目…

软件需求规格说明书通用模版_通用需求挑战和机遇

软件需求规格说明书通用模版When developing applications there will be requirements that are needed on more than one application. Examples of such common requirements are non-functional, cookie consent and design patterns. How can we work with these types of…

python版PCA(主成分分析)

python版PCA&#xff08;主成分分析&#xff09; 在用统计分析方法研究这个多变量的课题时&#xff0c;变量个数太多就会增加课题的复杂性。人们自然希望变量个数较少而得到的信息较多。在很多情形&#xff0c;变量之间是有一定的相关关系的&#xff0c;当两个变量之间有一定…

干货|Spring Cloud Bus 消息总线介绍

2019独角兽企业重金招聘Python工程师标准>>> 继上一篇 干货&#xff5c;Spring Cloud Stream 体系及原理介绍 之后&#xff0c;本期我们来了解下 Spring Cloud 体系中的另外一个组件 Spring Cloud Bus (建议先熟悉 Spring Cloud Stream&#xff0c;不然无法理解 Spr…

一类动词二类动词三类动词_基于http动词的完全无效授权技术

一类动词二类动词三类动词Authorization is a basic feature of modern web applications. It’s a mechanism of specifying access rights or privileges to resources according to user roles. In case of CMS like applications, it needs to be equipped with advanced l…

主成份分析(PCA)详解

主成分分析法&#xff08;Principal Component Analysis&#xff09;大多在数据维度比较高的时候&#xff0c;用来减少数据维度&#xff0c;因而加快模型训练速度。另外也有些用途&#xff0c;比如图片压缩&#xff08;主要是用SVD&#xff0c;也可以用PCA来做&#xff09;、因…

thinkphp5记录

ThinkPHP5 隐藏index.php问题 thinkphp模板输出cookie,session中… 转载于:https://www.cnblogs.com/niuben/p/10056049.html

portainer容器可视化管理部署简要笔记

参考链接&#xff1a;https://www.portainer.io/installation/ 1、单个宿主机部署in Linux&#xff1a;$ docker volume create portainer_data$ docker run -d -p 9000:9000 -v /var/run/docker.sock:/var/run/docker.sock -v portainer_data:/data portainer/portainer 2、单…

证明您履历表经验的防弹五步法

How many times have you gotten the question “Tell me more about your work experience at …” or “Describe an experience when you had to overcome a technical challenge”? Is your answer solid and bullet-proof every single time you have to respond? If no…

2018-2019-1 20165231 实验四 外设驱动程序设计

博客链接&#xff1a;https://www.cnblogs.com/heyanda/p/10054680.html 转载于:https://www.cnblogs.com/Yhooyon/p/10056173.html