Manipulate and mine Relion star files.

Project description

starparser

Use this package to manipulate Relion star files, including counting, modifying, plotting, and sifting the data. At the very least, this is a useful alternative to awk commands, which can get awkward. Below is a description of the command-line options with some examples. Alternatively, use the starparser modules in your own python scripts or within Relion.

Installation
Important notes
Command-line options
Limitations
Relion GUI usage
Scripting
Examples
License

Installation

Set up a fresh conda environment with Python >= 3.6: conda create -n sp python=3.6 and activate it with conda activate sp.
Install starparser: pip install starparser

Important notes

Your input file needs to be a standard Relion .star file with an optics table, followed by another data table (e.g. particle table), followed by a list with tab-delimited columns (i.e. it does not work on *_model.star files). Typical files include run_data.star, run_itxxx_data.star, movies.star, etc.
If the star file lacks an optics table, such as those from Relion 3.0, just add the --opticsless option to parse it.
The term particles here refers to rows in a star file, which may represent objects other than particles, such as movies in a movies.star file.
Some of the options below are already available in Relion with "relion_star_handler".

Command-line options

Usage:

starparser --i input.star [options]

Input

--i filename

Name of the input star file.

--f filename

Name of a second star file, if necessary.

Data mining

--extract

Find particles that match a column header (--c) and query (--q) and write them to a new star file (default output.star, or specified with --o).

--limit column/comparator/value

Extract particles that match a specific operator (lt for less than, gt for greater than). The argument to pass is "column/comparator/value" (e.g. _rlnDefocusU/lt/40000 for defocus values less than 40000).

--count

Count particles and display the result. Optionally, this can be used with --c and --q to only count a subset of particles that match the query (see the Querying options), otherwise counts all.

--count_mics

Count the number of unique micrographs. Optionally, this can be used with --c and --q to only count a subset of particles that match the query (see the Querying options), otherwise counts all.

--list_column column-name(s)

Write all values of a column to a file. For example, passing _rlnMicrographName will write all values to MicrographName.txt. To output multiple columns, separate the column names with a slash (for example, _rlnMicrographName/_rlnCoordinateX outputs MicrographName.txt and CoordinateX.txt). Optionally, this can be used with --c and --q to only consider values that match the query (see the Querying options), otherwise it lists all values.

--find_shared column-name

Find particles that are shared between the input star file and the one provided by --f based on the column provided here. Two new star files will be output, one with the shared particles and one with the unique particles.

--extract_if_nearby distance

For every particle in the input star file, check the nearest particle in a second star file provided by --f; particles that have a neighbor closer than the distance (in pixels) provided here will be written to particles_close.star, and those that don't will be written to particles_far.star. Particles that couldn't be matched to a neighbor will be skipped (i.e. if the second star file lacks particles in that micrograph). It will also output a histogram of nearest distances to Particles_distances.png (use --t to change filetype; see the Output options).

--extract_clusters threshold-distance/minimum-number

Extract particles that have a minimum number of neighbors within a given radius. For example, passing 400/4 extracts particles with at least 4 neighbors within 400 pixels.

--extract_indices

Extract particles with indices that match a list in a second file (specified by --f). The second file must be a single column list of numbers with values between 1 and the last particle index of the star file. The result is written to output.star (or specified with --o).

--extract_random number-of-particles

Get a random set of particles totaling the number provided here. Optionally, use --c and --q to extract a random set of each passed query in the specified column (see the Querying options); in this case, the output star files will have the name(s) of the query(ies). Otherwise, a random set from all particles will be written to output.star (or specified with --o).

--split number-of-files

Split the input star file into the number of star files passed here, making sure not to separate particles that belong to the same micrograph. The files will have the input file name with the suffix "_split-#". Note that they will not necessarily contain exactly the same number of particles.

--split_classes

Split the input star file into independent star files for each class. The files will have the names "Class_#.star".

--split_optics

Split the input star file into independent star files for each optics group. The files will have the names of the optics group.

--sort_by column-name

Sort the columns in ascending order according to the column passed here. Outputs a new file to output.star (or specified with --o). Add a slash followed by "n" if the column contains numeric values (e.g. _rlnClassNumber/n); otherwise, it will sort the values as text.

Modifying

--operate column-name[operator]value

Perform operation on all values of a column. The argument to pass is column[operator]value (without the brackets and without any spaces); operators include "*", "/", "+", and "-" (e.g. _rlnHelicalTrackLength*0.25). The result is written to a new star file (default output.star, or specified with --o). If your terminal throws an error, try surrounding the argument with quotations (e.g. "_rlnHelicalTrackLength*0.25").

--operate_columns column1[operator]column2=newcolumn

Perform operation between two columns and write to a new column. The argument to pass is column1[operator]column2=newcolumn (without the brackets and without any spaces); operators include "*", "/", "+", and "-" (e.g. _rlnCoordinateX+_rlnOriginX=_rlnShiftedX). If your terminal throws an error, try surrounding the argument with quotations (e.g. "_rlnCoordinateX+_rlnOriginX=_rlnShiftedX").

--remove_column column-name(s)

Remove column, renumber headers, and write to a new star file (default output.star, or specified with --o). E.g. _rlnMicrographName. To enter multiple columns, separate them with a slash: _rlnMicrographName/_rlnCoordinateX. Note that "relion_star_handler --remove_column" also does this.

--remove_particles

Remove particles that match a query (specified with --q) within a column header (specified with --c; see the Querying options), and write to a new star file (default output.star, or specified with --o).

--remove_duplicates column-name

Remove duplicate particles based on the column provided here (e.g. _rlnImageName).

--remove_mics_fromlist

Remove particles that belong to micrographs that have a match in a second file provided by --f, and write to a new star file (default output.star, or specified with --o). You only need to have the micrograph names and not necessarily the full paths in the second file.

--insert_column column-name

Insert a new column that doesn't already exist with the values found in the file provided by --f. The file should be a single column and should have an equivalent number to the star file. The result is written to a new star file (default output.star, or specified with --o).

--replace_column column-name

Replace all entries of a column with a list of values found in the file provided by --f. The file should be a single column and should have an equivalent number to the star file. This is useful when used in conjunction with --list_column, which outputs column values for easy editing before reinsertion with --replace_column. The result is written to a new star file (default output.star, or specified with --o).

--copy_column source-column/target-column

Replace all entries of a target column with those of a source column in the same star file. If the target column does not exist, a new column will be made. The argument to pass is source-column/target-column (e.g. _rlnAngleTiltPrior/_rlnAngleTilt). The result is written to a new star file (default output.star, or specified with --o)

--reset_column column-name/new-value

Change all values of a column to the one provided here. The argument to pass is column-name/new-value (e.g. _rlnOriginX/0). The result is written to a new star file (default output.star, or specified with --o)

--swap_columns column-name(s)

Swap columns from another star file (specified with --f). For example, pass _rlnMicrographName to swap that column. To enter multiple columns, separate them with a slash: _rlnMicrographName/_rlnCoordinateX. Note that the total number of particles should match. The result is written to a new star file (default output.star, or specified with --o).

--fetch_from_nearby distance/column-name(s)

Find the nearest particle in a second star file (specified with --f) and if it is within a threshold distance, retrieve its column value to replace the original particle column value. The argument to pass is distance/column-name(s) (e.g. 300/_rlnClassNumber or 100/_rlnAnglePsi/_rlnHelicalTubeID). Outputs to output.star (or specified with --o). Particles that couldn't be matched to a neighbor will be skipped (i.e. if the second star file lacks particles in that micrograph). The micrograph paths from _rlnMicrographName do not necessarily need to match, just the filenames need to.

--import_mic_values column-name(s)

For every particle, find the micrograph that it belongs to in a second star file (specified with --f) and replace the original column value with that of the second star file (e.g. _rlnOpticsGroup). This requires that the second star file only has one instance of each micrograph name (e.g. a micrographs_ctf.star file). To import multiple columns, separate them with a slash. The result is written to a new star file (default output.star, or specified with --o).

--import_particle_values column-name(s)

For every particle in the input star file, find the equivalent particle in a second star file (specified with --f) (i.e. those with equivalent _rlnImageName values) and replace the original column value with the one from the second star file. To import multiple columns, separate them with a slash.

--regroup particles-per-group

Regroup particles such that those with similar defocus values are in the same group (the number of particles per group is specified here) and write to a new star file (default output.star, or specified with --o). Any value can be entered. This is useful if there aren't enough particles in each micrograph to make meaningful groups. This only works if _rlnGroupNumber is being used in the star file rater than _rlnGroupName. Note that Subset selection in Relion should be used for regrouping if possible (which groups on the *_model.star intensity scale factors).

--new_optics optics-group-name

Provide a new optics group name. Use --c and --q to specify which particles belong to this optics group (see the Querying options). The optics values from the last entry of the optics table will be duplicated. The result is written to a new star file (default output.star, or specified with --o).

--relegate

Remove optics table and optics column and write to a new star file (default output.star, or specified with --o) so that it is compatible with Relion 3.0. Note that in some cases this will not be sufficient to be fully compatible with Relion 3.0 and you may have to use --remove_column to remove other bad columns (e.g. helix-specific columns). Note that to use starparser on Relion 3.0 star files, you need to pass the --opticsless option.

Plotting

--histogram column-name

Plot values of a column as a histogram. Optionally, use --c and --q to only plot a subset of particles (see the Querying options), otherwise it will plot all. The filename will be that of the column name. Use --t to change the filetype (see the Output options). The number of bins is calculated using the Freedman-Diaconis rule. Note that "relion_star_handler --hist_column" also does this.

--plot_orientations

Plot the particle orientations based on the _rlnAngleRot and _rlnAngleTilt columns on a Mollweide projection (longitude and latitude, respectively). Optionally, use --c and --q to only plot a subset of particles, otherwise it will plot all. The result will be saved to Particle_orientations.png. Use --t to change filetype (see the Output options).

--plot_class_iterations classes

Plot the number of particles per class for all iterations up to the one provided in the input (skips iterations 0 and 1). Pass "all" to plot all classes, or separate the classes that you want with a slash (e.g. 1/2/5). It can successfully handle filenames that have "_ct" in them if you've continued from intermediate jobs (only tested on a single continue). Use --t to change filetype (see the Output options).

--plot_class_proportions

Find the proportion of particle sets that belong to each class. At least two queries (--q, separated by slashes) must be provided along with the column to search in (--c) (See the Querying options). It will display the proportions in percentages and plot the result to Class_proportion.png. Use --t to change filetype (see the Output options).

--plot_coordinates number-of-micrographs

Plot the particle coordinates for the input star file for each micrograph in a multi-page pdf (red circles). The argument to pass is the total number of micrographs to plot (pass "all" to plot all micrographs, but it might take a long time if there are many). Make sure you are running it in the Relion directory so that the micrograph .mrc files can be properly sourced (or change the _rlnMicrographName column to absolute paths). Use --f to overlay the coordinates of a second star file (larger blue circles); in this case, the micrograph names should match between the two star files. Optionally, pass the desired size of the circle after a slash (e.g. 1/250 for 1 micrograph and a circle size of 250 pixels). The plots are written to Coordinates.pdf.

Querying

--c column-name(s)

Column query term(s). E.g. _rlnMicrographName. This is used to look for a specific query specified with --q. In cases where you can enter multiple columns, separate them with a slash: _rlnMicrographName/_rlnCoordinateX.

--q query(ies)

Particle query term(s) to look for in the values within the specified column. To enter multiple queries, separate them with a slash: 20200101/20200203. Use --e if the query(ies) should exactly match the values in the column.

--e

Pass this if you want an exact match of the values to the query(ies) provided by --q. For example, you must pass this if you want just to look for "1" and ignore "15" (which has a "1" in it).

Other

--opticsless

Pass this if the input star file lacks an optics group (more specifically: the star file has exactly one table), such as with Relion 3.0 files. It will create a dummy optics table before moving on. This option does not work with --plot_class_proportions or commands that require parsing a second file.

Output

--o filename

Output file name. Default is output.star.

--t filetype

File type of the plot that will be written. Choose between png, jpg, svg, and pdf. The default is png.

Limitations

The Freedman-Diaconis rule for histogram binning is not always appropriate.
Star files that lack a version header, which sometimes occurs with those generated outside of Relion, cannot be parsed. Temporary fix: add blank line # version 30001 blank line before each data table.
The --plot_coordinates circle size does not exactly match the requested value. If you need it to be exact, save the file as pdf with --t pdf and open the plot in illustrator to modify the circle size.
--opticsless does not work when the second star file (--f) lacks an optics table or when multiple star files are being read. There is little incentive to fix this since few still use Relion 3.0.
Data mining options do not check if the subset that was created has rendered one of the optics groups void; they retain all optics groups.
--split_optics does not renumber the optics groups that were greater than 1 back to 1, although this does not affect any behavior downstream in Relion and elsewhere.

Relion GUI Usage

Use the External commands tab to run starparser within Relion. You don't need the double dash -- in this case.

Relion-GUI-1

Relion-GUI-2

Relion-GUI-3

Scripting

To parse a star file for downstream use in a python script:

from starparser import fileparser
particles, metadata = fileparser.getparticles("file.star")

The particles DataFrame can be manipulated with pandas functions (see the example below). However, here are some examples of starparser options that are also available to use:

#Remove columns with delcolumn(particles,columns,metadata)
new_particles, new_metadata = columnplay.delcolumn(particles, ["_rlnMicrographName", "_rlnOpticsGroup"], metadata)

#Remove particles with delparticles(particles, columns, queries, queryexact)
new_particles = particleplay.delparticles(particles, ["_rlnMicrographName"], ["0207"], False)

#Remove duplicates with delduplicates(particles, column)
new_particles = particleplay.delduplicates(particles, "_rlnMicrographName")

#Operate on a column with operate(particles, column, operator, value) where operator is one of "multiply", "divide", "add", or "subtracts"
new_particles = columnplay.operate(particles, "_rlnHelicalTrackLength", "multiply", 0.25)

#Limit values with limit(particles, column, limit, operator) where operator is one of "lt" (less than) or "gt" (greater than)
new_particles = particleplay.limitparticles(particles, "_rlnDefocusU", 3000, "lt")

After manipulating the particles, you can write the star file:

fileparser.writestar(newparticles, metadata, "output.star")

A simple example showing how to iterate through micrographs and keep only one of three particles of a helix.

from starparser import fileparser

#import data to a pandas dataframe
particles, metadata = fileparser.getparticles("particles.star")

#group by micrographs
micrographs = particles.groupby(["_rlnMicrographName"])

keeplist = []

#iterate through the micrographs
for idm, micrograph in micrographs:

    #get the helices for the current micrograph
    helices = micrograph.groupby(["_rlnHelicalTubeID"])

    #iterate through the helices for this micrograph
    for idh, helix in helices:

        #get the indices for the particles
        indices = helix.index.tolist()

        #get the indices for one of every three particles in the helix
        keeplist.append(indices[::3])

#flatten the list; this is now the list of particles to keep
keeplist = [item for sublist in keeplist for item in sublist]

#write out a star file only containing those particles to keep
fileparser.writestar(particles[particles.index.isin(keeplist)], metadata, "particles_purged.star")

Examples

Plotting

Plot a histogram of defocus values.

starparser --i run_data.star --histogram _rlnDefocusU

→ Output figure to DefocusU.png: Defocus plot

Plot the particle orientation distribution.

starparser --i run_data.star --plot_orientations

→ Output figure to Particle_orientations.png: Orientation plot

Plot the number of particles per class for the 25 iterations of a Class3D job.

starparser --i run_it025_data.star --plot_class_iterations all

→ Output figure to Class_distribution.png: Particles per class plot

Plot the proportion of particles in each class that belong to particles with the term 200702 versus those with the term 200826 in the _rlnMicrographName column.

starparser --i run_it025_data.star --plot_class_proportions --c _rlnMicrographName --q 200702/200826

→ The percentage in each class will be displayed in terminal.

→ Output figure to Class_proportion.png: Class proportion plot

Overlay the coordinates of two star files.

starparser --i particles.star --f select_particles.star --plot_coordinates 1

→ Plotting coordinates from the star file (red circles) and second file (blue circles) for 1 micrograph.

→ Output figure to Coordinates.pdf:

Coordinates plot

Modifying

Remove columns

starparser --i run_data.star --o run_data_del.star --remove_column _rlnCtfMaxResolution/_rlnCtfFigureOfMerit

→ A new star file named run_data_del.star will be identical to run_data.star except will be missing those two columns. The headers in the particles table will be renumbered.

Remove a subset of particles

starparser --i run_data.star --o run_data_del.star --remove_particles --c _rlnMicrographName --q 200702/200715

→ A new star file named run_data_del.star will be identical to run_data.star except will be missing any particles that have the term 200702 or 2000715 in the _rlnMicrographName column. In this case, this was useful to remove particles from specific data-collection days that had the date in the filename.

Replace values in a column with those of a text file

starparser --i particles.star --replace_column _rlnOpticsGroup --f newoptics.txt --o particles_newoptics.star

→ A new star file named particles_newoptics.star will be output that will be identical to particles.star except for the _rlnOpticsGroup column, which will have the values found in newoptics.txt.

Swap columns

starparser --i run_data.star --f run_data_2.star --o run_data_swapped.star --swap_columns _rlnAnglePsi/_rlnAngleRot/_rlnAngleTilt/_rlnNormCorrection/_rlnLogLikeliContribution/_rlnMaxValueProbDistribution/_rlnNrOfSignificantSamples/_rlnOriginXAngst/_rlnOriginYAngst

→ A new star file named run_data_swapped.star will be output that will be identical to run_data.star except for the columns in the input, which will instead be swapped in from run_data_2.star. This is useful for sourcing alignments from early global refinements.

Regroup a star file

starparser --i run_data.star --o run_data_regroup200.star --regroup 200

→ A new star file named run_data_regroup200.star will be output that will be identical to run_data.star except for the _rlnGroupNumber or _rlnGroupName columns, which will be renumbered to have 200 particles per group.

Create a new optics group for a subset of particles

starparser --i run_data.star --o run_data_newoptics.star --new_optics myopticsname --c _rlnMicrographName --q 10090

→ A new star file named run_data_newoptics.star will be output that will be identical to run_data.star except that a new optics group called myopticsname will be created in the optics table and particles with the term 10090 in the _rlnMicrographName column will have modified _rlnOpticsGroup and/or _rlnOpticsName columns to match the new optics group.

Relegate a star file to be compatible with Relion 3.0

starparser --i run_data.star --o run_data_3p0.star --relegate

→ A new star file named run_data_3p0.star will be output that will be identical to run_data.star except will be missing the optics table and _rlnOpticsGroup column. The headers in the particles table will be renumbered accordingly.

Data mining

Extract a subset of particles

starparser --i run_data.star --o run_data_c1.star --extract --c _rlnClassNumber --q 1 --e

→ A new star file named run_data_c1.star will be output with only particles that belong to class 1. The --e option was passed to avoid extracting any class with the number 1 in it, such as "10", "11", etc.

Extract particles with limited defoci

starparser --i run_data.star --o run_data_under4um.star --limit _rlnDefocusU/lt/40000

→ A new star file named run_data_under4um.star will be output with only particles that have defocus estimations below 4 microns.

Count specific particles

starparser --i particles.star --o output.star --count --c _rlnMicrographName --q 200702/200715

→ There are 7726 particles that match ['200702', '200715'] in the specified columns (out of 69120, or 11.2%).

Count the number of micrographs

starparser --i run_data.star --count_mics

→ There are 7994 unique micrographs in this dataset.

Count the number of micrographs for specific particles

starparser --i run_data.star --count_mics --c _rlnMicrographName --q 200826

→ Creating a subset of 2358 particles that match ['200826'] in the columns ['_rlnMicrographName'] (or 3.4%)

→ There are 288 unique micrographs in this dataset.

List all items from a column in a text file

starparser --i run_data.star --list_column _rlnMicrographName

→ All entries of _rlnMicrographName will be written to MicrographName.txt in a single column.

List all items from multiple columns in independent text files

starparser --i run_data.star --list_column _rlnDefocusU/_rlnCoordinateX

→ All entries of _rlnDefocusU will be written to DefocusU.txt and all entries of _rlnCoordinateX will be written to CoordinateX.txt.

List all items from a column that match specific particles

starparser --i run_data.star --list_column _rlnDefocusU --c _rlnMicrographName --q 200826

→ Only _rlnDefocusU entries that have 200826 in _rlnMicrographName will be written to DefocusU.txt.

Compare particles between star files and extract those that are shared and unique

starparser --i run_data1.star --find_shared _rlnMicrographName --f run_data2.star

→ Two new star files will be created named shared.star and unique.star that will have only the particles that are unique to run_data1.star relative to run_data2.star (unique.star) and only the particles that are shared between them (shared.star) based on the _rlnMicrographName column.

Extract a random set of specific particles

starparser --i run_it025_data.star --extract_random 10000 --c _rlnMicrographName --q DOA3/OAA2

→ Two new star files will be created named DOA3_10000.star and OAA2_10000.star that will have a random set of 10000 particles that match DOA3 and OAA2 in the _rlnMicrographName column, respectively.

Split a star file

starparser --i particles.star --split 3

→ Three new star files called split_1.star, split_2.star, and split_3.star will be created with roughly equal numbers of particles. In this example, particles.star has 69120 particles and the split star files have 23053, 23042, and 23025 particles, respectively.

License

This project is licensed under the MIT License - see the LICENSE.txt file for details.

Project details

Release history Release notifications | RSS feed

1.58

Aug 20, 2024

1.57

May 11, 2024

1.56

Apr 20, 2024

1.55

Apr 8, 2024

1.54

Apr 8, 2024

1.53

Apr 8, 2024

1.52

Apr 7, 2024

1.51

Jan 27, 2024

1.50

Jan 27, 2024

1.49

Sep 18, 2023

1.48

Aug 28, 2023

1.47

Aug 26, 2023

1.46

Aug 26, 2023

1.45

Aug 26, 2023

1.44

Aug 26, 2023

1.43

Aug 26, 2023

1.42

Aug 26, 2023

1.41

Aug 11, 2023

1.40

Aug 11, 2023

1.39

Aug 11, 2023

1.38

Nov 20, 2021

1.37

Oct 28, 2021

1.36

Oct 22, 2021

1.35

Oct 22, 2021

1.34

Oct 12, 2021

1.33

Sep 1, 2021

1.32

Aug 28, 2021

1.31

Aug 11, 2021

1.30

Aug 9, 2021

1.29

Aug 9, 2021

1.28

Aug 8, 2021

This version

1.27

Aug 7, 2021

1.26

Aug 6, 2021

1.25

Aug 5, 2021

1.24

Aug 2, 2021

1.23

Jul 26, 2021

1.22

Jul 16, 2021

1.21

Jul 13, 2021

1.20

Jul 5, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

starparser-1.27.tar.gz (46.8 kB view hashes)

Uploaded Aug 7, 2021 Source

Built Distribution

starparser-1.27-py3-none-any.whl (37.5 kB view hashes)

Uploaded Aug 7, 2021 Python 3

Hashes for starparser-1.27.tar.gz

Hashes for starparser-1.27.tar.gz
Algorithm	Hash digest
SHA256	`5e36d5fa29875549f8683ebe066ecd7662c51d89f60edee63b29c6c134c838d6`
MD5	`4212aaf64ab7d5eb01e2cb87191bbd69`
BLAKE2b-256	`93dee61df1c42211d06d098a6168b8795013e2b553a69fe3ad5b369d1d5a03b3`

Hashes for starparser-1.27-py3-none-any.whl

Hashes for starparser-1.27-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8d4e06dc61e3d81e5433902412a198b66e28d3b2db66022b068fc351c4f2df27`
MD5	`8254f9cd0058f470232a8609166f54e6`
BLAKE2b-256	`e5a58eeb64bc4a7ee138c82586d9f3bb57f9c753df415cffe21234c2fb6e8c66`