Skip to main content

A tool to analyse gene family evolution from orthoxml

Project description


Family-Analyzer is outdated and have been replace by pyHam available at .

Family-Analyzer: summarize gene family evolution from orthoxml


Family-Analyzer is a tool to further analyze the hierarchical orthologous groups from an orthoXML file. More informations on the schema of orthoxml and some examples are available at

Family-Analyzer report to the user a summary of the evolutionary history acting on the gene families. The summary reports with respect to one or two levels taxonomic levels what happens after respectively between the specified taxonimic levels which genes were maintained, got lost, duplicated, were gained in that period.


Family-Analyzer is written in python3, with little external dependencies, i.e. currently only the lxml library. The setup script should resolve these dependencies automatically. Consider using pip to install the package directly from a checked out git repo

pip install -e </path/to/family-analyzer-repo/>

Running Family-Analyzer

So far running the family analyzer on a specific dataset is relatively easy. The main entry point for it is the ‘main’ section in familyanalyzer/

If this script is called with -h as argument, it gives a short description of the required and optional arguments and what they are used for. Here is what the usage output reports as of now. Since this is still work in progress, make sure the current usage did not change.

python familyanalyzer/ -h

usage: [-h] [--xreftag XREFTAG] [--show_levels] [-r]
                         [--taxonomy TAXONOMY] [--propagate_top]
                         [--store_augmented_xml STORE_AUGMENTED_XML]
                         [--compare_second_level COMPARE_SECOND_LEVEL]
                         orthoxml level species [species ...]

Analyze Hierarchical OrthoXML families.

positional arguments:
  orthoxml              path to orthoxml file to be analyzed
  level                 taxonomic level at which analysis should be done
  species               (list of) species to be analyzed. Note that only genes
                        of the selected species are reported. In order for the
                        output to make sense, the selected species all must be
                        part of the linages specified in 'level' (and

optional arguments:
  -h, --help            show this help message and exit
  --xreftag XREFTAG     xref tag of genes to report. OrthoXML allows to store
                        multiple ids and xref annotations per gene as
                        attributes in the species section. If not set, the
                        internal (purely numerical) ids are reported.
  --show_levels         print the levels and species found in the orthoXML
                        file and quit
  -r, --use-recursion   DEPRECATED: Use recursion to sample families that are
                        a subset of the query
  --taxonomy TAXONOMY   Taxonomy used to reconstruct intermediate levels. Has
                        to be either 'implicit' (default) or a path to a file
                        in Newick format. The taxonomy might be
                        multifurcating. If set to 'implicit', the taxonomy is
                        extracted from the input OrthoXML file. The orthoXML
                        level do not have to cover all the levels for all
                        families. In order to infer gene losses Family-
                        Analyzer needs to infer these skipped levels and
                        reconcile each family with the complete taxonomy.
  --propagate_top       propagate taxonomy levels up to the toplevel. As an
                        illustration, consider a gene family in an eukaryotic
                        analysis that has only mammalian genes. Its topmost
                        taxonomic level will therefor be 'Mammalia' and an
                        ancestral gene was gained at that level. However, if
                        '--propagete-top' is set, the family is assumed to
                        have already be present in the topmost taxonomic
                        level, i.e. Eukaryota in this example, and non-
                        mammalian species have all lost this gene.
  --show_taxonomy       write the taxonomy used to standard out.
  --store_augmented_xml STORE_AUGMENTED_XML
                        filename to which the input orthoxml file with
                        augmented annotations is written. The augmented
                        annotations include for example the additional
                        taxonomic levels of orthologGroup and unique HOG IDs.
  --compare_second_level COMPARE_SECOND_LEVEL
                        Compare secondary level with primary one, i.e. report
                        what happend between the secondary and primary level
                        to the individual histories. Note that the Second
                        level needs to be younger than the primary.

Code organisation

OrthoXMLParser: class which holds the orthoxml file and gives access to its

data and keeps internal mappings to speed up lookups.

Taxonomy: class wich provides a basic navigation through the species taxonomy.

Objects will be constructed using the TaxonomyFactory and can be either based on the orthoxml or a newick tree.

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

familyanalyzer-0.7.3.tar.gz (35.8 kB view hashes)

Uploaded Source

Built Distribution

familyanalyzer-0.7.3-py3-none-any.whl (32.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page