Rapid Average Nucleotide Identity Calculation With FastANI
Maybe your like
Background
The Average Nucleotide Identity (ANI) is a convenient way to express the similarity of two microbial genomes in a single number. It’s the mean nucleotide identity of orthologous gene pairs shared between two genomes. This value is used to delineate species. A typical percentage threshold for species boundary is 95% ANI. It can be calculated by aligning the matching segments and finding the average identity. The alignment can be done quickly with MUMmer (ANIm method) but there are other methods too.
Say you want to compare many genomes together to make a distance matrix and species tree. You could have hundreds of samples and you need to perform an all vs. all comparison. For increasing numbers of genomes the number of pairwise comparisons goes up dramatically. If n is the number of samples, then n*(n-1)/2 comparisons must be done. So even if your method only takes 10 seconds per comparison, for 200 genomes you will have to do 19,900 comparisons and that would take 55 hours!
FastANI
FastANI is a program for fast alignment-free computation of whole-genome Average Nucleotide Identity (ANI). FastANI supports complete and draft genome assemblies. It avoids explicit sequence alignments altogether and uses k-mer sampling. This is much faster and therefore scales better. It is also still quite accurate. You can run the program as follows. The reference and query contain lists of all the files to compare and are the same for all vs. all comparisons.
fastANI --ql reference.txt --rl query.txt -o fastani.out -t 10
FastANI can be parallelized for multi-core processors. It’s done by splitting the reference genomes in several equal-size parts and searching query genome(s) against each part of the reference database independently.
Example of scaling of FastANI with the number of genomes is shown below. Bacterial genome assemblies were used. Each run was executed in parallel with 10 threads on a Ryzen 2600X cpu. In this case the time increased approximately with no. genomes1.5. This is not a thorough benchmark but is certainly much faster than the ANIm method.
Python code
The Python code used to perform this test is given below for reference.
import os, glob, time, subprocess, shutil def run_test(n): #names are extracted from a dataframe here and used to make query #and reference lists names = list(asm.Assembly_nover)[:n] l=[] for f in glob.glob('assemblies/*.fa.gz'): n=os.path.basename(f).split('.')[0] if n in names: l.append(f) with open('query.txt', 'w') as infile: infile.write('\n'.join(l)) shutil.copyfile('query.txt','reference.txt') #run cmd = 'fastANI --ql reference.txt --rl query.txt -o fastani.out -t 10' st=time.time() subprocess.check_output(cmd,shell=True) t = time.time()-st return t times=[] step=5 n=2 #run test with increasing increments of genomes while n<=100: t=run_test(n) print (n,t) times.append((n,t)) n+=step step+=2Read in FastANI result file to get a distance matrix of the identities.
def get_fastani(filename): """Get fastANI results into pairwise matrix""" import re df = pd.read_csv(filename,sep='\t',names=['query','ref','ident','x','y']) #these two lines are only needed to extract the label from the file names df['query'] = df['query'].apply(lambda x: re.split(r"[\./]+",x)[1]) df['ref'] = df['ref'].apply(lambda x: re.split(r"[\./]+",x)[1]) x = pd.pivot_table(df,index='query',values='ident',columns=['ref']) return x get_fastani('fastani.out')Links
- FastANI
- High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries
- pyani
- category:
- bioinformatics 55
- tags:
- genomics 28
- microbiology 12
- ani 1
Tag » What Is Average Nucleotide Identity
-
Taxonomic Affiliation Of New Genomes Should Be Verified Using ...
-
[PDF] Average Nucleotide Identity (ANI)
-
A Large-scale Evaluation Of Algorithms To Calculate ... - PubMed
-
ANI Calculator
-
Whole-Genome Analyses: Average Nucleotide Identity - ScienceDirect
-
Using Average Nucleotide Identity To Improve Taxonomic ...
-
OrthoANI: An Improved Algorithm And Software For Calculating ...
-
High Throughput ANI Analysis Of 90K Prokaryotic Genomes Reveals ...
-
ANI Calculator - Kostas Lab
-
A Large-scale Evaluation Of Algorithms To Calculate ... - Springer
-
Improving Phylogenies Based On Average Nucleotide Identity ...
-
IMG Webinar: ANI (Average Nucleotide Identity) In IMG - YouTube
-
Using Average Nucleotide Identity (ANI) To Evaluate Microsporidia ...