Manipulate and mine Relion star files.
Project description
starparser
Use this package to manipulate Relion star files, including counting, modifying, plotting, and sifting the data. At the very least, this is a useful alternative to awk commands, which can get awkward. Below is a description of the command-line options with some examples. Alternatively, use the starparser modules in your own python scripts or within Relion.
- Installation
- Important notes
- Command-line options
- Limitations
- Relion GUI usage
- Scripting
- Examples
- License
Installation
-
Set up a fresh conda environment with Python >= 3.6:
conda create -n sp python=3.6
and activate it withconda activate sp
. -
Install starparser:
pip install starparser
Important notes
-
Your input file needs to be a standard Relion .star file with an optics table, followed by another data table (e.g. particle table), followed by a list with tab-delimited columns (i.e. it does not work on *_model.star files). Typical files include run_data.star, run_itxxx_data.star, movies.star, etc.
-
If the star file lacks an optics table, such as those from Relion 3.0, just add the
--opticsless
option to parse it. -
The term particles here refers to rows in a star file, which may represent objects other than particles, such as movies in a movies.star file.
-
Some of the options below are already available in Relion with "relion_star_handler".
Command-line options
Usage:
starparser --i input.star [options]
Input
--i
filename
Name of the input star file.
--f
filename
Name of a second star file, if necessary.
Data mining
--extract
Find particles that match a column header (--c
) and query (--q
) and write them to a new star file (default output.star, or specified with --o
).
--limit
column/comparator/value
Extract particles that match a specific operator (lt for less than, gt for greater than). The argument to pass is "column/comparator/value" (e.g. _rlnDefocusU/lt/40000 for defocus values less than 40000).
--count
Count particles and display the result. Optionally, this can be used with --c
and --q
to only count a subset of particles that match the query (see the Querying options), otherwise counts all.
--count_mics
Count the number of unique micrographs. Optionally, this can be used with --c
and --q
to only count a subset of particles that match the query (see the Querying options), otherwise counts all.
--list_column
column-name(s)
Write all values of a column to a file. For example, passing _rlnMicrographName will write all values to MicrographName.txt. To output multiple columns, separate the column names with a slash (for example, _rlnMicrographName/_rlnCoordinateX outputs MicrographName.txt and CoordinateX.txt). Optionally, this can be used with --c
and --q
to only consider values that match the query (see the Querying options), otherwise it lists all values.
--find_shared
column-name
Find particles that are shared between the input star file and the one provided by --f
based on the column provided here. Two new star files will be output, one with the shared particles and one with the unique particles.
--extract_if_nearby
distance
For every particle in the input star file, check the nearest particle in a second star file provided by --f
; particles that have a neighbor closer than the distance (in pixels) provided here will be written to particles_close.star, and those that don't will be written to particles_far.star. Particles that couldn't be matched to a neighbor will be skipped (i.e. if the second star file lacks particles in that micrograph). It will also output a histogram of nearest distances to Particles_distances.png (use --t
to change filetype; see the Output options).
--extract_clusters
threshold-distance/minimum-number
Extract particles that have a minimum number of neighbors within a given radius. For example, passing 400/4 extracts particles with at least 4 neighbors within 400 pixels.
--extract_indices
Extract particles with indices that match a list in a second file (specified by --f
). The second file must be a single column list of numbers with values between 1 and the last particle index of the star file. The result is written to output.star (or specified with --o
).
--extract_random
number-of-particles
Get a random set of particles totaling the number provided here. Optionally, use --c
and --q
to extract a random set of each passed query in the specified column (see the Querying options); in this case, the output star files will have the name(s) of the query(ies). Otherwise, a random set from all particles will be written to output.star (or specified with --o
).
--split
number-of-files
Split the input star file into the number of star files passed here, making sure not to separate particles that belong to the same micrograph. The files will have the input file name with the suffix "_split-#". Note that they will not necessarily contain exactly the same number of particles.
--split_classes
Split the input star file into independent star files for each class. The files will have the names "Class_#.star".
--split_optics
Split the input star file into independent star files for each optics group. The files will have the names of the optics group.
--sort_by
column-name
Sort the columns in ascending order according to the column passed here. Outputs a new file to output.star (or specified with --o
). Add a slash followed by "n" if the column contains numeric values (e.g. _rlnClassNumber/n); otherwise, it will sort the values as text.
Modifying
--operate
column-name[operator]value
Perform operation on all values of a column. The argument to pass is column[operator]value (without the brackets and without any spaces); operators include "*", "/", "+", and "-" (e.g. _rlnHelicalTrackLength*0.25). The result is written to a new star file (default output.star, or specified with --o
). If your terminal throws an error, try surrounding the argument with quotations (e.g. "_rlnHelicalTrackLength*0.25").
--operate_columns
column1[operator]column2=newcolumn
Perform operation between two columns and write to a new column. The argument to pass is column1[operator]column2=newcolumn (without the brackets and without any spaces); operators include "*", "/", "+", and "-" (e.g. _rlnCoordinateX+_rlnOriginX=_rlnShiftedX). If your terminal throws an error, try surrounding the argument with quotations (e.g. "_rlnCoordinateX+_rlnOriginX=_rlnShiftedX").
--remove_column
column-name(s)
Remove column, renumber headers, and write to a new star file (default output.star, or specified with --o
). E.g. _rlnMicrographName. To enter multiple columns, separate them with a slash: _rlnMicrographName/_rlnCoordinateX. Note that "relion_star_handler --remove_column" also does this.
--remove_particles
Remove particles that match a query (specified with --q
) within a column header (specified with --c
; see the Querying options), and write to a new star file (default output.star, or specified with --o
).
--remove_duplicates
column-name
Remove duplicate particles based on the column provided here (e.g. _rlnImageName).
--remove_mics_fromlist
Remove particles that belong to micrographs that have a match in a second file provided by --f
, and write to a new star file (default output.star, or specified with --o
). You only need to have the micrograph names and not necessarily the full paths in the second file.
--insert_column
column-name
Insert a new column that doesn't already exist with the values found in the file provided by --f
. The file should be a single column and should have an equivalent number to the star file. The result is written to a new star file (default output.star, or specified with --o
).
--replace_column
column-name
Replace all entries of a column with a list of values found in the file provided by --f
. The file should be a single column and should have an equivalent number to the star file. This is useful when used in conjunction with --list_column
, which outputs column values for easy editing before reinsertion with --replace_column
. The result is written to a new star file (default output.star, or specified with --o
).
--copy_column
source-column/target-column
Replace all entries of a target column with those of a source column in the same star file. If the target column does not exist, a new column will be made. The argument to pass is source-column/target-column (e.g. _rlnAngleTiltPrior/_rlnAngleTilt). The result is written to a new star file (default output.star, or specified with --o
)
--reset_column
column-name/new-value
Change all values of a column to the one provided here. The argument to pass is column-name/new-value (e.g. _rlnOriginX/0). The result is written to a new star file (default output.star, or specified with --o
)
--swap_columns
column-name(s)
Swap columns from another star file (specified with --f
). For example, pass _rlnMicrographName to swap that column. To enter multiple columns, separate them with a slash: _rlnMicrographName/_rlnCoordinateX. Note that the total number of particles should match. The result is written to a new star file (default output.star, or specified with --o
).
--fetch_from_nearby
distance/column-name(s)
Find the nearest particle in a second star file (specified with --f
) and if it is within a threshold distance, retrieve its column value to replace the original particle column value. The argument to pass is distance/column-name(s) (e.g. 300/_rlnClassNumber or 100/_rlnAnglePsi/_rlnHelicalTubeID). Outputs to output.star (or specified with --o
). Particles that couldn't be matched to a neighbor will be skipped (i.e. if the second star file lacks particles in that micrograph). The micrograph paths from _rlnMicrographName do not necessarily need to match, just the filenames need to.
--import_mic_values
column-name(s)
For every particle, find the micrograph that it belongs to in a second star file (specified with --f
) and replace the original column value with that of the second star file (e.g. _rlnOpticsGroup). This requires that the second star file only has one instance of each micrograph name (e.g. a micrographs_ctf.star file). To import multiple columns, separate them with a slash. The result is written to a new star file (default output.star, or specified with --o
).
--import_particle_values
column-name(s)
For every particle in the input star file, find the equivalent particle in a second star file (specified with --f
) (i.e. those with equivalent _rlnImageName values) and replace the original column value with the one from the second star file. To import multiple columns, separate them with a slash.
--regroup
particles-per-group
Regroup particles such that those with similar defocus values are in the same group (the number of particles per group is specified here) and write to a new star file (default output.star, or specified with --o
). Any value can be entered. This is useful if there aren't enough particles in each micrograph to make meaningful groups. This only works if _rlnGroupNumber is being used in the star file rater than _rlnGroupName. Note that Subset selection in Relion should be used for regrouping if possible (which groups on the *_model.star intensity scale factors).
--new_optics
optics-group-name
Provide a new optics group name. Use --c
and --q
to specify which particles belong to this optics group (see the Querying options). The optics values from the last entry of the optics table will be duplicated. The result is written to a new star file (default output.star, or specified with --o
).
--relegate
Remove optics table and optics column and write to a new star file (default output.star, or specified with --o
) so that it is compatible with Relion 3.0. Note that in some cases this will not be sufficient to be fully compatible with Relion 3.0 and you may have to use --remove_column
to remove other bad columns (e.g. helix-specific columns). Note that to use starparser on Relion 3.0 star files, you need to pass the --opticsless
option.
Plotting
--histogram
column-name
Plot values of a column as a histogram. Optionally, use --c
and --q
to only plot a subset of particles (see the Querying options), otherwise it will plot all. The filename will be that of the column name. Use --t
to change the filetype (see the Output options). The number of bins is calculated using the Freedman-Diaconis rule. Note that "relion_star_handler --hist_column" also does this.
--plot_orientations
Plot the particle orientations based on the _rlnAngleRot and _rlnAngleTilt columns on a Mollweide projection (longitude and latitude, respectively). Optionally, use --c
and --q
to only plot a subset of particles, otherwise it will plot all. The result will be saved to Particle_orientations.png. Use --t
to change filetype (see the Output options).
--plot_class_iterations
classes
Plot the number of particles per class for all iterations up to the one provided in the input (skips iterations 0 and 1). Pass "all" to plot all classes, or separate the classes that you want with a slash (e.g. 1/2/5). It can successfully handle filenames that have "_ct" in them if you've continued from intermediate jobs (only tested on a single continue). Use --t
to change filetype (see the Output options).
--plot_class_proportions
Find the proportion of particle sets that belong to each class. At least two queries (--q
, separated by slashes) must be provided along with the column to search in (--c
) (See the Querying options). It will display the proportions in percentages and plot the result to Class_proportion.png. Use --t
to change filetype (see the Output options).
--plot_coordinates
number-of-micrographs
Plot the particle coordinates for the input star file for each micrograph in a multi-page pdf (red circles). The argument to pass is the total number of micrographs to plot (pass "all" to plot all micrographs, but it might take a long time if there are many). Make sure you are running it in the Relion directory so that the micrograph .mrc files can be properly sourced (or change the _rlnMicrographName column to absolute paths). Use --f
to overlay the coordinates of a second star file (larger blue circles); in this case, the micrograph names should match between the two star files. Optionally, pass the desired size of the circle after a slash (e.g. 1/250 for 1 micrograph and a circle size of 250 pixels). The plots are written to Coordinates.pdf.
Querying
--c
column-name(s)
Column query term(s). E.g. _rlnMicrographName. This is used to look for a specific query specified with --q
. In cases where you can enter multiple columns, separate them with a slash: _rlnMicrographName/_rlnCoordinateX.
--q
query(ies)
Particle query term(s) to look for in the values within the specified column. To enter multiple queries, separate them with a slash: 20200101/20200203. Use --e
if the query(ies) should exactly match the values in the column.
--e
Pass this if you want an exact match of the values to the query(ies) provided by --q
. For example, you must pass this if you want just to look for "1" and ignore "15" (which has a "1" in it).
Other
--opticsless
Pass this if the input star file lacks an optics group (more specifically: the star file has exactly one table), such as with Relion 3.0 files. It will create a dummy optics table before moving on. This option does not work with --plot_class_proportions
or commands that require parsing a second file.
Output
--o
filename
Output file name. Default is output.star.
--t
filetype
File type of the plot that will be written. Choose between png, jpg, svg, and pdf. The default is png.
Limitations
-
The Freedman-Diaconis rule for histogram binning is not always appropriate.
-
Star files that lack a version header, which sometimes occurs with those generated outside of Relion, cannot be parsed. Temporary fix: add blank line # version 30001 blank line before each data table.
-
The
--plot_coordinates
circle size does not exactly match the requested value. If you need it to be exact, save the file as pdf with--t pdf
and open the plot in illustrator to modify the circle size. -
--opticsless
does not work when the second star file (--f
) lacks an optics table or when multiple star files are being read. There is little incentive to fix this since few still use Relion 3.0. -
Data mining options do not check if the subset that was created has rendered one of the optics groups void; they retain all optics groups.
-
--split_optics
does not renumber the optics groups that were greater than 1 back to 1, although this does not affect any behavior downstream in Relion and elsewhere.
Relion GUI Usage
- Use the External commands tab to run starparser within Relion. You don't need the double dash
--
in this case.
Scripting
- To parse a star file for downstream use in a python script:
from starparser import fileparser
particles, metadata = fileparser.getparticles("file.star")
- The particles DataFrame can be manipulated with pandas functions (see the example below). However, here are some examples of starparser options that are also available to use:
#Remove columns with delcolumn(particles,columns,metadata)
new_particles, new_metadata = columnplay.delcolumn(particles, ["_rlnMicrographName", "_rlnOpticsGroup"], metadata)
#Remove particles with delparticles(particles, columns, queries, queryexact)
new_particles = particleplay.delparticles(particles, ["_rlnMicrographName"], ["0207"], False)
#Remove duplicates with delduplicates(particles, column)
new_particles = particleplay.delduplicates(particles, "_rlnMicrographName")
#Operate on a column with operate(particles, column, operator, value) where operator is one of "multiply", "divide", "add", or "subtracts"
new_particles = columnplay.operate(particles, "_rlnHelicalTrackLength", "multiply", 0.25)
#Limit values with limit(particles, column, limit, operator) where operator is one of "lt" (less than) or "gt" (greater than)
new_particles = particleplay.limitparticles(particles, "_rlnDefocusU", 3000, "lt")
- After manipulating the particles, you can write the star file:
fileparser.writestar(newparticles, metadata, "output.star")
- A simple example showing how to iterate through micrographs and keep only one of three particles of a helix.
from starparser import fileparser
#import data to a pandas dataframe
particles, metadata = fileparser.getparticles("particles.star")
#group by micrographs
micrographs = particles.groupby(["_rlnMicrographName"])
keeplist = []
#iterate through the micrographs
for idm, micrograph in micrographs:
#get the helices for the current micrograph
helices = micrograph.groupby(["_rlnHelicalTubeID"])
#iterate through the helices for this micrograph
for idh, helix in helices:
#get the indices for the particles
indices = helix.index.tolist()
#get the indices for one of every three particles in the helix
keeplist.append(indices[::3])
#flatten the list; this is now the list of particles to keep
keeplist = [item for sublist in keeplist for item in sublist]
#write out a star file only containing those particles to keep
fileparser.writestar(particles[particles.index.isin(keeplist)], metadata, "particles_purged.star")
Examples
Plotting
- Plot a histogram of defocus values.
starparser --i run_data.star --histogram _rlnDefocusU
→ Output figure to DefocusU.png:
- Plot the particle orientation distribution.
starparser --i run_data.star --plot_orientations
→ Output figure to Particle_orientations.png:
- Plot the number of particles per class for the 25 iterations of a Class3D job.
starparser --i run_it025_data.star --plot_class_iterations all
→ Output figure to Class_distribution.png:
- Plot the proportion of particles in each class that belong to particles with the term 200702 versus those with the term 200826 in the _rlnMicrographName column.
starparser --i run_it025_data.star --plot_class_proportions --c _rlnMicrographName --q 200702/200826
→ The percentage in each class will be displayed in terminal.
→ Output figure to Class_proportion.png:
- Overlay the coordinates of two star files.
starparser --i particles.star --f select_particles.star --plot_coordinates 1
→ Plotting coordinates from the star file (red circles) and second file (blue circles) for 1 micrograph.
→ Output figure to Coordinates.pdf:
Modifying
Remove columns
starparser --i run_data.star --o run_data_del.star --remove_column _rlnCtfMaxResolution/_rlnCtfFigureOfMerit
→ A new star file named run_data_del.star will be identical to run_data.star except will be missing those two columns. The headers in the particles table will be renumbered.
Remove a subset of particles
starparser --i run_data.star --o run_data_del.star --remove_particles --c _rlnMicrographName --q 200702/200715
→ A new star file named run_data_del.star will be identical to run_data.star except will be missing any particles that have the term 200702 or 2000715 in the _rlnMicrographName column. In this case, this was useful to remove particles from specific data-collection days that had the date in the filename.
Replace values in a column with those of a text file
starparser --i particles.star --replace_column _rlnOpticsGroup --f newoptics.txt --o particles_newoptics.star
→ A new star file named particles_newoptics.star will be output that will be identical to particles.star except for the _rlnOpticsGroup column, which will have the values found in newoptics.txt.
Swap columns
starparser --i run_data.star --f run_data_2.star --o run_data_swapped.star --swap_columns _rlnAnglePsi/_rlnAngleRot/_rlnAngleTilt/_rlnNormCorrection/_rlnLogLikeliContribution/_rlnMaxValueProbDistribution/_rlnNrOfSignificantSamples/_rlnOriginXAngst/_rlnOriginYAngst
→ A new star file named run_data_swapped.star will be output that will be identical to run_data.star except for the columns in the input, which will instead be swapped in from run_data_2.star. This is useful for sourcing alignments from early global refinements.
Regroup a star file
starparser --i run_data.star --o run_data_regroup200.star --regroup 200
→ A new star file named run_data_regroup200.star will be output that will be identical to run_data.star except for the _rlnGroupNumber or _rlnGroupName columns, which will be renumbered to have 200 particles per group.
Create a new optics group for a subset of particles
starparser --i run_data.star --o run_data_newoptics.star --new_optics myopticsname --c _rlnMicrographName --q 10090
→ A new star file named run_data_newoptics.star will be output that will be identical to run_data.star except that a new optics group called myopticsname will be created in the optics table and particles with the term 10090 in the _rlnMicrographName column will have modified _rlnOpticsGroup and/or _rlnOpticsName columns to match the new optics group.
Relegate a star file to be compatible with Relion 3.0
starparser --i run_data.star --o run_data_3p0.star --relegate
→ A new star file named run_data_3p0.star will be output that will be identical to run_data.star except will be missing the optics table and _rlnOpticsGroup column. The headers in the particles table will be renumbered accordingly.
Data mining
Extract a subset of particles
starparser --i run_data.star --o run_data_c1.star --extract --c _rlnClassNumber --q 1 --e
→ A new star file named run_data_c1.star will be output with only particles that belong to class 1. The --e
option was passed to avoid extracting any class with the number 1 in it, such as "10", "11", etc.
Extract particles with limited defoci
starparser --i run_data.star --o run_data_under4um.star --limit _rlnDefocusU/lt/40000
→ A new star file named run_data_under4um.star will be output with only particles that have defocus estimations below 4 microns.
Count specific particles
starparser --i particles.star --o output.star --count --c _rlnMicrographName --q 200702/200715
→ There are 7726 particles that match ['200702', '200715'] in the specified columns (out of 69120, or 11.2%).
Count the number of micrographs
starparser --i run_data.star --count_mics
→ There are 7994 unique micrographs in this dataset.
Count the number of micrographs for specific particles
starparser --i run_data.star --count_mics --c _rlnMicrographName --q 200826
→ Creating a subset of 2358 particles that match ['200826'] in the columns ['_rlnMicrographName'] (or 3.4%)
→ There are 288 unique micrographs in this dataset.
List all items from a column in a text file
starparser --i run_data.star --list_column _rlnMicrographName
→ All entries of _rlnMicrographName will be written to MicrographName.txt in a single column.
List all items from multiple columns in independent text files
starparser --i run_data.star --list_column _rlnDefocusU/_rlnCoordinateX
→ All entries of _rlnDefocusU will be written to DefocusU.txt and all entries of _rlnCoordinateX will be written to CoordinateX.txt.
List all items from a column that match specific particles
starparser --i run_data.star --list_column _rlnDefocusU --c _rlnMicrographName --q 200826
→ Only _rlnDefocusU entries that have 200826 in _rlnMicrographName will be written to DefocusU.txt.
Compare particles between star files and extract those that are shared and unique
starparser --i run_data1.star --find_shared _rlnMicrographName --f run_data2.star
→ Two new star files will be created named shared.star and unique.star that will have only the particles that are unique to run_data1.star relative to run_data2.star (unique.star) and only the particles that are shared between them (shared.star) based on the _rlnMicrographName column.
Extract a random set of specific particles
starparser --i run_it025_data.star --extract_random 10000 --c _rlnMicrographName --q DOA3/OAA2
→ Two new star files will be created named DOA3_10000.star and OAA2_10000.star that will have a random set of 10000 particles that match DOA3 and OAA2 in the _rlnMicrographName column, respectively.
Split a star file
starparser --i particles.star --split 3
→ Three new star files called split_1.star, split_2.star, and split_3.star will be created with roughly equal numbers of particles. In this example, particles.star has 69120 particles and the split star files have 23053, 23042, and 23025 particles, respectively.
License
This project is licensed under the MIT License - see the LICENSE.txt file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for starparser-1.27-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d4e06dc61e3d81e5433902412a198b66e28d3b2db66022b068fc351c4f2df27 |
|
MD5 | 8254f9cd0058f470232a8609166f54e6 |
|
BLAKE2b-256 | e5a58eeb64bc4a7ee138c82586d9f3bb57f9c753df415cffe21234c2fb6e8c66 |