真核微生物基因组质量评估工具EukCC的安装和详细使用方法

 介绍:

GitHub - EBI-Metagenomics/EukCC: Tool to estimate genome quality of microbial eukaryotes

 

安装:

docker:

docker pull  microbiomeinformatics/eukcc

推荐conda 环境:

conda install -c conda-forge -c bioconda "eukcc>=2"
# mamba更快
mamba install -c conda-forge -c bioconda "eukcc>=2"pip install eukcc

数据库配置,docker记得映射目录

mkdir eukccdb
cd eukccdb
wget http://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/eukcc2_db_ver_1.1.tar.gz
tar -xzvf eukcc2_db_ver_1.1.tar.gz

数据库下载地址:Index of /pub/databases/metagenomics/eukcc

 

下载数据库注意版本,一般选版本2吧, 

链接:https://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/eukcc2_db_ver_1.2.tar.gz 

 https://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/eukcc_db_v1.1.tar.gz

还有个副产品,diamond的数据库,不过好像看不出是diamond的哪个版本生成的,用的时候不好用的话就用下载的数据库再生成一遍吧。 

https://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/uniref50_20200213_tax.dmnd

 如果不知道数据库位置,或者软件找不到位置,那就简单吧,设置DB目录

export EUKCC2_DB=/path/to/.../eukcc2_db_ver_1.1

 

快速开始

#EukCC on a single MAG
#We assume that you did set you $EUKCC2_DB to the correct location. If not please use the --db flag to pass the database to EukCC.eukcc single --out outfolder --threads 8 bin.fa
#EukCC will then run on 8 threads. You can pass nucleotide fastas or proteomes to EukCC. It will automatically try to detect if it has to predict proteins or not.#By default it will never use more than a single threads for placing the genomes in the reference tree, to save memory.#EukCC on a folder of bins
eukcc folder --out outfolder --threads 8 bins
#EukCC will assume that the folder contains files with the suffix .fa. If that is not the case please adjust the parameter.

序列拼接流程

双端序列需要先构建bam索引

cat binfolder/*.fa > pseudo_contigs.fasta
bwa index pseudo_contigs.fasta
bwa mem -t 8 pseudo_contigs.fasta reads_1.fastq.gz reads_2.fastq.gz  |samtools view -q 20 -Sb - |samtools sort -@ 8 -O bam - -o alignment.bam
samtools index alignment.bam

利用py脚本生成关联表 

binlinks.py  --ANI 99 --within 1500 \--out linktable.csv binfolder alignment.bam

If you have multiple bam files, pass all of them to the script (e.g. *.bam).

You will obtain a three column file (bin_1,bin_2,links).

拼接bins

eukcc folder \--out outfolder \--threads 8  \--links linktable.csv \binfolder

EukCC 首先将分别对所有bins进行运行。随后,它会识别那些至少达到50%完整度但尚未超过100-improve_percent的中等质量bins。接下来,它会找出那些通过至少100对端读配对与这些中等质量bins相连接的bins。若经过合并后bin的质量评分有所提高,则该bin将会被合并。

已合并的bins可以在输出文件夹中找到。

警示

blinks.py

#!/usr/bin/env python3
import pysam
from Bio import SeqIO
from collections import defaultdict
import os
import argparse
import logging
import csvdef is_in(read, contig_map, within=1000):if read.reference_name not in contig_map.keys():return Falseif read.reference_start <= within or read.reference_end <= within:return Trueelif read.reference_start > (contig_map[read.reference_name] - within) or read.reference_end > (contig_map[read.reference_name] - within):return Trueelse:return Falsedef keep_read(read, contig_map, within=1000, min_ANI=98, min_cov=0):ani = ((read.query_alignment_length - read.get_tag("NM"))/ float(read.query_alignment_length)* 100)cov = read.query_alignment_length / float(read.query_length) * 100if ani >= min_ANI and cov >= min_cov and is_in(read, contig_map, within) is True:return Trueelse:return Falsedef contig_map(bindir, suffix=".fa"):m = {}for f in os.listdir(bindir):if f.endswith(suffix) is False:continuepath = os.path.join(bindir, f)with open(path, "r") as handle:for record in SeqIO.parse(handle, "fasta"):m[record.name] = len(record.seq)return mdef bin_map(bindir, suffix=".fa"):contigs = defaultdict(str)contigs_per_bin = defaultdict(int)for f in os.listdir(bindir):if f.endswith(suffix) is False:continuepath = os.path.join(bindir, f)binname = os.path.basename(f)with open(path, "r") as handle:for record in SeqIO.parse(handle, "fasta"):contigs[record.name] = binnamecontigs_per_bin[binname] += 1return contigs, contigs_per_bindef read_pair_generator(bam):"""Generate read pairs in a BAM file or within a region string.Reads are added to read_dict until a pair is found.From: https://www.biostars.org/p/306041/"""read_dict = defaultdict(lambda: [None, None])for read in bam.fetch():if not read.is_paired or read.is_secondary or read.is_supplementary:continueqname = read.query_nameif qname not in read_dict:if read.is_read1:read_dict[qname][0] = readelse:read_dict[qname][1] = readelse:if read.is_read1:yield read, read_dict[qname][1]else:yield read_dict[qname][0], readdel read_dict[qname]def read_bam_file(bamf, link_table, cm, within, ANI):samfile = pysam.AlignmentFile(bamf, "rb")# generate link tablelogging.info("Parsing Bam file. This can take a few moments")for read, mate in read_pair_generator(samfile):if keep_read(read, cm, within, min_ANI=ANI) and keep_read(mate, cm, within, min_ANI=ANI):# fill in the tablelink_table[read.reference_name][mate.reference_name] += 1if read.reference_name != mate.reference_name:link_table[mate.reference_name][read.reference_name] += 1return link_tabledef main():# set arguments# arguments are passed to classesparser = argparse.ArgumentParser(description="Evaluate completeness and contamination of a MAG.")parser.add_argument("bindir", type=str, help="Run script on these bins")parser.add_argument("bam",type=str,help="Bam file(s) with reads aligned against all contigs making up the bins",nargs="+",)parser.add_argument("--out","-o",type=str,required=False,help="Path to output table (Default: links.csv)",default="links.csv",)parser.add_argument("--ANI", type=float, required=False, help="ANI of matching read", default=99)parser.add_argument("--within",type=int,required=False,help="Within this many bp we need the read to map",default=1000,)parser.add_argument("--contigs","-c",action="store_true",default=False,help="Instead of bins print contigs",)parser.add_argument("--quiet","-q",dest="quiet",action="store_true",default=False,help="Silcence most output",)parser.add_argument("--debug","-d",action="store_true",default=False,help="Debug and thus ignore safety",)args = parser.parse_args()# define logginglogLevel = logging.INFOif args.quiet:logLevel = logging.WARNINGelif args.debug:logLevel = logging.DEBUGlogging.basicConfig(format="%(asctime)s %(message)s",datefmt="%d-%m-%Y %H:%M:%S: ",level=logLevel,)bindir = args.bindircm = contig_map(bindir)bm, contigs_per_bin = bin_map(bindir)logging.debug("Found {} contigs".format(len(cm)))link_table = defaultdict(lambda: defaultdict(int))bin_table = defaultdict(lambda: defaultdict(int))# iterate all bam filesfor bamf in args.bam:link_table = read_bam_file(bamf, link_table, cm, args.within, args.ANI)logging.debug("Created link table with {} entries".format(len(link_table)))# generate bin tablefor contig_1, dic in link_table.items():for contig_2, links in dic.items():bin_table[bm[contig_1]][bm[contig_2]] += linkslogging.debug("Created bin table with {} entries".format(len(bin_table)))out_data = []logging.debug("Constructing output dict")if args.contigs:for contig_1, linked in link_table.items():for contig_2, links in linked.items():out_data.append({"bin_1": bm[contig_1],"bin_2": bm[contig_2],"contig_1": contig_1,"contig_2": contig_2,"links": links,"bin_1_contigs": contigs_per_bin[bm[contig_1]],"bin_2_contigs": contigs_per_bin[bm[contig_2]],})else:for bin_1, dic in bin_table.items():for bin_2, links in dic.items():out_data.append({"bin_1": bin_1, "bin_2": bin_2, "links": links})logging.debug("Out data has {} rows".format(len(out_data)))# resultslogging.info("Writing output")with open(args.out, "w") as fout:if len(out_data) > 0:cout = csv.DictWriter(fout, fieldnames=list(out_data[0].keys()))cout.writeheader()for row in out_data:cout.writerow(row)else:logging.warning("No rows to write")if __name__ == "__main__":main()

scripts/filter_euk_bins.py

#!/usr/bin/env python3
#
# This file is part of the EukCC (https://github.com/openpaul/eukcc).
# Copyright (c) 2019 Paul Saary
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, version 3.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
# provides all file operation functions
# used inthis package
import os
import argparse
import subprocess
import logging
import tempfile
import gzip
from multiprocessing import Pool# backup fasta handler, so we can use readonly directories
class fa_class:def __init__(self, seq, name, long_name):self.seq = seqself.name = nameself.long_name = long_namedef __str__(self):return self.seqdef __len__(self):return len(self.seq)def Fasta(path):"""Iterator for fasta files"""entry = Falsewith open(path) as fin:for line in fin:if line.startswith(">"):if entry is not False:entry.seq = "".join(entry.seq)yield entry# define new entrylong_name = line.strip()[1:]name = long_name.split()[0]entry = fa_class([], name, long_name)else:entry.seq.append(line.strip())# yield last oneentry.seq = "".join(entry.seq)yield entrydef gunzip(path, tmp_dir):"""Gunzip a file for EukRep"""if path.endswith(".gz"):fna_path = os.path.join(tmp_dir, "contigs.fna")logging.debug("Going to unzip fasta into {}".format(fna_path))with gzip.open(path, "r") as fin, open(fna_path, "w") as fout:for line in fin:fout.write(line.decode())path = fna_pathlogging.debug("Done unzipping {}".format(fna_path))return pathclass EukRep:"""Class to call and handle EukRep data"""def __init__(self, fasta, eukout, bacout=None, minl=1500, tie="euk"):self.fasta = fastaself.eukout = eukoutself.bacout = bacoutself.minl = minlself.tie = tiedef run(self):# command list will be calledcmd = ["EukRep","--min",str(self.minl),"-i",self.fasta,"--seq_names","-ff","--tie",self.tie,"-o",self.eukout,]if self.bacout is not None:cmd.extend(["--prokarya", self.bacout])subprocess.run(cmd, check=True, shell=False)self.read_result()def read_result(self):self.euks = self.read_eukfile(self.eukout)self.bacs = set()if self.bacout is not None:self.bacs = self.read_eukfile(self.bacout)def read_eukfile(self, path):lst = set()with open(path) as infile:for line in infile:lst.add(line.strip())return lstclass bin:def __init__(self, path, eukrep):self.e = eukrepself.bname = os.path.basename(path)self.path = os.path.abspath(path)def __str__(self):return "{} euks: {} bacs: {}".format(self.bname, self.table["euks"], self.table["bacs"])def stats(self):"""read bin content and figure genomic composition"""logging.debug("Loading bin")fa_file = Fasta(self.path)stats = {"euks": 0, "bacs": 0, "NA": 0, "sum": 0}# loop and compute statslogging.debug("Make per bin stats")for seq in fa_file:if seq.name in self.e.euks:stats["euks"] += len(seq)elif seq.name in self.e.bacs:stats["bacs"] += len(seq)else:stats["NA"] += len(seq)stats["sum"] = sum([v for k, v in stats.items()])self.table = statsdef decide(self, eukratio=0.2, bacratio=0.1, minbp=100000, minbpeuks=1000000):"""rule to handle decision logic"""keep = Trueallb = self.table["sum"]if self.table["euks"] < minbpeuks:keep = Falselogging.info(f"Eukaryotic DNA amount only {self.table['euks']} instead of target {minbpeuks}")elif self.table["euks"] / allb <= eukratio:keep = Falselogging.info(f"Eukaryotic DNA ratio not higher than {eukratio}")elif self.table["bacs"] / allb >= bacratio:keep = Falselogging.info(f"Bacterial DNA content higher than {bacratio}")elif self.table["sum"] < minbp:keep = Falselogging.info("We did not find at least %d bp of DNA", minbp)self.keep = keepif __name__ == "__main__":parser = argparse.ArgumentParser()parser.add_argument("--output", help="path for the output table", default="assignment.csv", type=str)parser.add_argument("bins", nargs="+", help="all bins to classify", type=str)parser.add_argument("--threads","-t",type=int,help="How many bins should be run in parallel (Default: 1)",default=1,)parser.add_argument("--minl",type=int,help="define minimal length of contig for EukRep \to classify (default: 1500)",default=1500,)parser.add_argument("--eukratio",type=float,help="This ratio of eukaryotic DNA to all DNA has to be found\at least (default: 0, ignore)",default=0,)parser.add_argument("--bacratio",type=float,help="discard bins with bacterial ratio of higher than\(default: 1, ignore)",default=1,)parser.add_argument("--minbp",type=float,help="Only keep bins with at least n bp of dna\(default: 8000000)",default=8000000,)parser.add_argument("--minbpeuks",type=float,help="Only keep bins with at least n bp of Eukaryotic dna\(default: 5000000)",default=5000000,)parser.add_argument("--rerun", action="store_true", help="rerun even if output exists", default=False)parser.add_argument("--quiet", action="store_true", help="supress information", default=False)parser.add_argument("--debug", action="store_true", help="Make it more verbose", default=False)args = parser.parse_args()# define logginglogLevel = logging.INFOif args.quiet:logLevel = logging.WARNINGelif args.debug:logLevel = logging.DEBUGlogging.basicConfig(format="%(asctime)s %(message)s", datefmt="%m/%d/%Y %H:%M:%S: ", level=logLevel)def evaluate_bin(path):if not os.path.exists(path):logging.error("Can not find file {}".format(path))exit(1)logging.info("Launch on {}".format(path))with tempfile.TemporaryDirectory(prefix="filter_EukRep_") as tmp_dir:logging.debug("Using tmp folder: {}".format(tmp_dir))eukfile = os.path.join(tmp_dir, "euks.fna")bacfile = os.path.join(tmp_dir, "bacs.fna")# EukRep can not deal with Gzipped Fasta files, so we unzip it in case it is a Gzip filepath = gunzip(path, tmp_dir)# Launching EukReplogging.debug(f"Starting EukRep on {path}")eukrep_result = EukRep(path, eukfile, bacfile, minl=args.minl)eukrep_result.run()b = bin(path, eukrep_result)b.stats()b.decide(eukratio=args.eukratio, bacratio=args.bacratio, minbp=args.minbp, minbpeuks=args.minbpeuks)return b# multithreading poolpool = Pool(processes=args.threads)results = pool.map(evaluate_bin, args.bins)pool.close()pool.join()with open(args.output, "w") as outfile:outfile.write("path,binname,passed,bp_eukaryote,bp_prokaryote,bp_unassigned,bp_sum\n")for b in results:outfile.write(f"{b.path},{b.bname},{b.keep},{b.table['euks']},{b.table['bacs']},{b.table['NA']},{b.table['sum']}\n")

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/602836.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

OpenHarmony应用构建工具Hvigor的构建流程

前言 OpenHarmony 应用和服务使用 Hvigor 作为工程的构建工具。本篇文章将介绍 Hvigor 的构建流程&#xff0c;通过修改脚本配置使 Hvigor 执行自定义任务。 Hvigor 的构建流程 加载命令行参数和环境变量&#xff1b;初始化项目结构&#xff0c;创建 Project 和 Module 实例…

Guava Cache 异步刷新技巧

前言 Guava Cache是一款非常优秀的本地缓存框架&#xff0c;提供简洁易用的 API 供开发者使用。 这篇文章&#xff0c;我们聊聊如何使用 Guava Cache 异步刷新技巧带飞系统性能 。 1 基本用法 首先&#xff0c;在 Java 应用中添加 maven 依赖&#xff1a; <dependency&g…

我发现了一个还行的生成图片的网站(新人登录可领30金币)

这个网站是一个生成图片的在线工具&#xff0c;它提供了多种功能和选项&#xff0c;让用户可以轻松地创建各种风格和类型的图片。它的界面简洁易用&#xff0c;新用户登录后还可以获得30个金币的奖励。这些金币可以用来解锁更多的高级功能和特效。用户可以选择不同的背景、字体…

彻底认识Unity ui设计中Space - Overlay、Screen Space - Camera和World Space三种模式

文章目录 简述Screen Space - Overlay优点缺点 Screen Space - Camera优点缺点 World Space优点缺点 简述 用Unity中开发了很久&#xff0c;但是对unity UI管理中Canvas组件的Render Mode有三种主要类型&#xff1a;Screen Space - Overlay、Screen Space - Camera和World Spa…

mysql-数据库DDL操作

之前已经学习了安装mysql服务端还有进行了一些关于数据库安全的设置&#xff0c;现在开始学习创建数据库和数据表以及进行修改。 MySQL的DDL&#xff08;Data Definition Language&#xff09;语句用于定义或更改数据库结构&#xff0c;包括创建、修改或删除表、视图、索引等数…

详细平稳解

1.详细平衡 定义&#xff1a;一个在高斯白噪声激励下的动力学系统在状态空间中如果用如下运动方程描述&#xff1a; d d t X j \frac{d}{dt}\mathbf{X}_{j} dtd​Xj​ f j ( X ) f_{j}(\mathbf{X}) fj​(X) ∑ l 1 m g j l ( X ) W l ( t ) \sum_{l1}^{m}g_{jl}(\mathbf{X})W…

Open CASCADE学习|入门Hello world

目录 1、新建项目 2、写代码 3、配置 3.1配置头文件 3.2配置静态库文件 3.3配置动态库文件 4、编译运行 1、新建项目 新建一个Win32控制台应用程序&#xff0c;取名为HelloWorld&#xff0c;如下图所示&#xff1a; 2、写代码 测试所用的代码如下&#xff1a; // Use T…

Redis:原理速成+项目实战——Redis实战6(封装缓存工具(高级写法)缓存总结)

&#x1f468;‍&#x1f393;作者简介&#xff1a;一位大四、研0学生&#xff0c;正在努力准备大四暑假的实习 &#x1f30c;上期文章&#xff1a;Redis&#xff1a;原理速成项目实战——Redis实战5&#xff08;互斥锁、逻辑过期解决缓存击穿问题&#xff09; &#x1f4da;订…

C# Entity Framework 中不同的数据的加载方式

延迟加载 延迟加载是指在访问导航属性时&#xff0c;Entity Framework 会自动查询数据库并加载相关数据。这种方式在我们需要访问导航属性时比较方便&#xff0c;因为我们无需手动加载相关数据&#xff0c;而且只会在需要时才会进行查询&#xff0c;从而减少了不必要的开销。但…

Python基础-07(for循环、range()函数)

文章目录 前言一、for循环1.for循环结构2.参数 end&#xff08;使其输出时变为横向&#xff09; 二、range()函数1.range(常数)2.range(起始值&#xff0c;结束值)3.range(起始值&#xff0c;结束值&#xff0c;步长)4.例子 总结 前言 此章介绍循环结构中最常用的循环&#xf…

Redisson与SQL乐观锁:实现接口幂等性的终极指南与实战演示

Redisson与SQL乐观锁&#xff1a;实现接口幂等性的终极指南与实战演示 Redisson与SQL乐观锁&#xff1a;实现接口幂等性的终极指南与实战演示 接口幂等性.md

497 蓝桥杯 成绩分析 简单

497 蓝桥杯 成绩分析 简单 //C风格解法1&#xff0c;*max_element&#xff08;&#xff09;与*min_element&#xff08;&#xff09;求最值 //时间复杂度O(n)&#xff0c;通过率100% #include <bits/stdc.h> using namespace std;using ll long long; const int N 1e4 …

java基于VUE3+SSM框架的在线宠物商城+vue论文

摘 要 信息数据从传统到当代&#xff0c;是一直在变革当中&#xff0c;突如其来的互联网让传统的信息管理看到了革命性的曙光&#xff0c;因为传统信息管理从时效性&#xff0c;还是安全性&#xff0c;还是可操作性等各个方面来讲&#xff0c;遇到了互联网时代才发现能补上自古…

【langchain】在单个文档知识源的上下文中使用langchain对GPT4All运行查询

In the previous post, Running GPT4All On a Mac Using Python langchain in a Jupyter Notebook, 我发布了一个简单的演练&#xff0c;让GPT4All使用langchain在2015年年中的16GB Macbook Pro上本地运行。在这篇文章中&#xff0c;我将提供一个简单的食谱&#xff0c;展示我们…

【Docker基础三】Docker安装Redis

下载镜像 根据自己需要下载指定版本镜像&#xff0c;所有版本看这&#xff1a;Index of /releases/ (redis.io) 或 https://hub.docker.com/_/redis # 下载指定版本redis镜像 docker pull redis:7.2.0 # 查看镜像是否下载成功 docker images 创建挂载目录 # 宿主机上创建挂…

element-ui table height 属性导致界面卡死

问题: 项目上&#xff0c;有个点击按钮弹出抽屉的交互, 此时界面卡死 原因分析: 一些场景下(父组件使用动态单位/弹窗、抽屉中使用), element-ui 的 table 会循环计算高度值, 导致界面卡死 github 上的一些 issues 和解决方案: Issues ElemeFE/element GitHub 官方讲是升…

bootstrap教程

bootstrap教程 大家好&#xff0c;我是免费搭建查券返利机器人赚佣金就用微赚淘客系统3.0的小编&#xff0c;也是冬天不穿秋裤&#xff0c;天冷也要风度的程序猿&#xff01;今天&#xff0c;让我们一同进入前端开发的世界&#xff0c;探索一款备受欢迎的前端框架——Bootstra…

修改安卓apk设置为安卓主屏幕(launcher)

修改安卓apk 将apk可以设置安卓主屏幕 原理&#xff1a; 将打包好的apk文件进行拆包增加配置文件在重新编译回apk包 需要得相关文件下载 解包 apktool :https://pan.baidu.com/s/1oyCIYak_MHDJCvDbHj_qEA?pwd5j2xdex2jar&#xff1a;https://pan.baidu.com/s/1Nc-0vppVd0G…

2024年学习计划

2024-2-29号完成 机器视觉基础知识学习&#xff0c;并可以处理视觉工作中的需求。 2024-3月份学习SCARA机械手应用开发SCARA机器人-埃斯顿自动化 - ESTUN 2024-4月份继续学习python 好了&#xff0c;今年可以完成这三个目标就满足了 好好学习&#xff0c;天天向上。每天进步…

【数据库系统概论】数据库并发控制机制——并发控制的主要技术之封锁(Locking)

系统文章目录 数据库的四个基本概念&#xff1a;数据、数据库、数据库管理系统和数据库系统 数据库系统的三级模式和二级映射 数据库系统外部的体系结构 数据模型 关系数据库中的关系操作 SQL是什么&#xff1f;它有什么特点&#xff1f; 数据定义之基本表的定义/创建、修改和…