Matching gaia clustered stars to known clusters
Project description
GaiaClusterFit
GaiaClusterFit is a Python library for optuimizing GAIA clustering
Installation
Use the package manager pip to install GaiaClusterFit.
pip install GaiaClusterFit
Basic Usage
Import library
from GaiaClusterFit import GCA
from GaiaClusterFit import evalmetric
Specify Gaia query
#GAIA database query
query ="""SELECT TOP 1000 source_id, b, l, parallax,phot_g_mean_mag,pmra,pmdec, RUWE, bp_rp,phot_g_mean_mag+5*log10(parallax)-10 as mg
FROM gaiadr3.gaia_source
WHERE l < 275 AND l > 240
AND b < 5 AND b > -15
AND phot_g_mean_mag < 18
AND RUWE < 1.4
AND parallax < 4 AND parallax > 1.8
AND parallax_error/parallax < 0.02"""
Create an instance and import data
#Create instance
job = GCA.GCAinstance(RegionName = "Char")
#Login and fetch GAIA Data
job.GaiaLogin(username='username', password='password')
job.FetchQueryAsync(query)
#Import known cluster
job.ImportRegion("G:/path/known_cluster.fits")
Setting up basic cluster fit function to clustered GAIA data to known clusters
#Parameters to optimize Cluster function over (HDBscan by default)
parameters = [{"variable": "min_cluster_size", "min":10, "max":100}]
Renaming cluster table columns to match GAIA column names
job.RenameCol(job.regiondata, [["Source", "source_id"],["Pop", "population"]])
Optimizing cluster function(HDBscan) over GAIA data to match known clusters
optimal = job.optimize_grid(fit_params=parameters, evalmetric.homogeneityscore)
Scoring function returns a score for the fit based by default on homogeneity self-made score functions can be passed and recieve an astropy gaia table and an astropy region table. optimize_grid returns parameters for the highest score
Code Discriptions
GCA.GCAinstance
GCAinstance(data =None, regiondata =None, RegionName = "No region Name")
Creates an instance object class used for clusteringa and cluster match scoreing later on.
(data =None, regiondata =None, RegionName = "No region Name")
are optional.
Later instance.Datatable
and instance.Regiondata
can be populated by querying the GAIA database (GCAinstance.GaiaLogin
and GCAinstance.FetchQueryAsync
) or by uploading a Gaia FITs table through instance.ImportDataTable
and instance.ImportRegion
-
data
: an astropy.table table containing star data -
regiondata
: an astropy.table table containing known cluster data
GCAinstance.ImportDataTable()
def ImportDataTable(self,path): #import a fits datatable comming from Gaia or whatever
self.datatable =Table(fits.open(path)[1].data)
Imports a GAIA table from the .fits format and stores it to self.datatable
path
: a string specifying the path to the .fits table file containing star data
GCAinstance.ExportDataTable()
def ExportDataTable(self, path, **kwargs): #export the self.datatable to any format(for importing measures i would recommend .fits)
self.datatable.write(f'{path}',**kwargs)
Exports self.datatable to a .fits file at a specified path. Kwargs translate over from astropy.io.ascii.write(**kwargs)
function
path
: a string specifying the path where the .fits table file containing star data will be stored
GCAinstance.ImportRegion()
def ImportDataTable(self,path): #import a fits datatable comming from Gaia or whatever
self.regiondata =Table(fits.open(path)[1].data)
Imports a GAIA table from the .fits format and stores it to self.regiondata
path
: a string specifying the path to the .fits table file containing cluster region data
GCAinstance.ExportRegion()
def ExportDataTable(self, path, **kwargs): #export the self.datatable to any format(for importing measures i would recommend .fits)
self.regiondata.write(f'{path}',**kwargs)
Exports self.regiondata to a .fits file at a specified path. Kwargs translate over from astropy.io.ascii.write(**kwargs)
function
path
: a string specifying the path where the .fits table file containing cluster region data will be stored
GCAinstance.GaiaLogin()
def GaiaLogin(self, username, password):
Gaia.login(user=str(username), password=str(password))
The GCAinstance.GaiaLogin()
initiates a GAIA database session based on personal credentials (username="username", password="password"
). This allows for asynchronous data queries (GCAinstance.FetchQueryAsync()
) from the GAIA database. This session is constrained within the instance allowing multiple instances to initiate different sessions.
-
username
: a string specifying your GAIA username credential -
password
: a string specifying your GAIA password credential
GCAinstance.FetchQueryAsync()
def FetchQueryAsync(self, query, **kwargs):
job = Gaia.launch_job_async(query, **kwargs)
self.datatable = job.get_results()
The CAinstance.FetchQueryAsync(query, **kwargs)
function accepts a ADQL formatted query to fetch GAIA data. It writes this data to GCAinstance.datatable
.
-
query
: a string containing the to be queried ADQL query -
kwargs
: all keword arguments that theAstroquery.Gaia.launch_job_async also accepts
GCAinstance.Renamecol()
def RenameCol(self, table, newnames):
for i in newnames:
table.rename_column(i[0],i[1])
The Renamecol function converts the columnnames of an astropy.table
object to a set of new names. Within GaiaClusterFit we require that the columns of the regions and GAIA data match column names. Therefore it is standard practice to convert the GCAinstance.regiondata columns to match that of the GAIA columns. I.E GCAinstance.RenameCol(GCAinstance.regiondata, [["Source","Source_id"],["Pop",population]])
. The default columnname for labeled clusterdata in GCAinstance.datatable is "population"
-
table
: astropy.table table object -
newnames
: 2D python list as such [["old column name 1","new column name 1"],["old column name 2","new column name 2"]]
GCAinstance.Plot()
def Plot(self, xaxis = "b", yaxis = "l", **kwargs):
plt.title(f"{self.regionname}")
plt.scatter(self.datatable[xaxis],self.datatable[yaxis], **kwargs)
plt.ylabel(yaxis)
plt.xlabel(xaxis)
plt.xlim(max(self.datatable[xaxis]),min(self.datatable[yaxis]))
plt.show()
GCAinstance.Plot()
plots GCAinstance.datatable using matplotlib.pyplot. x
and y
dimensions of the plot can be controlled using xaxis = "GAIA parameter" , yaxis = "GAIA parameter"'
where the GAIA parameter can be the string name of any column in GCAinstance.datatable. **kwargs
takes any keywordargument matplotlib.pyplot
accepts.
-
xaxis
: column name of column inGCAinstance.datatable
to display on the x-axis -
yaxis
: column name of column inGCAinstance.datatable
to display on the y-axis -
kwargs
: general keyword arguments accepted by matplotlib.pyplot.plot()
GCAinstance.PlotCluster()
def PlotCluster(self, xaxis="b", yaxis ="l", clusterer="HDBSCAN", remove_outliers =False , **kwargs): #modified plot function with outlier filtration and Cluster selection
try:
fig, ax = plt.subplots(figsize=(10,10))
plotdata = (self.datatable[xaxis], self.datatable[yaxis])
labels = self.datatable[clusterer]
if remove_outliers == True :
plotdata = self.datatable[xaxis][self.datatable[f"{remove_outliers}_outlier"]],self.datatable[yaxis][self.datatable[f"{remove_outliers}_outlier"]]
labels = self.datatable[clusterer][self.datatable[f"{remove_outliers}_outlier"]]
ax.set_title(f"{clusterer} clusters in \n {self.regionname} \n Outliers removed = {remove_outliers} ")
ax.scatter(*plotdata, c=labels, **kwargs)
ax.set_ylabel(yaxis)
ax.set_xlabel(xaxis)
plt.show()
return fig,ax
except:
if clusterer not in self.datatable.columns:
print(f"Error: You did not perform the{clusterer} clustering yet. No {clusterer} column found in self.Datatable")
return fig,ax
The GCAinstance.PlotCluster()
function plots the clusterdata alongside the GCAinstance.datatable
data. This requires GCAinstance.datatable to be clustered before by GCAinstance.cluster()
function. The GCAinstance.Plotcluster()
plots clusterlabels alongside GCAinstance.datatable using matplotlib.pyplot. x
and y
dimensions of the plot can be controlled using xaxis = "GAIA parameter" , yaxis = "GAIA parameter"
where the GAIA parameter can be the string name of any column in GCAinstance.datatable. **kwargs
takes any keywordargument matplotlib.pyplot
accepts.
-
xaxis
: column name of column inGCAinstance.datatable
to display on the x-axis -
yaxis
: column name of column inGCAinstance.datatable
to display on the y-axis -
clusterer
: cluster function name of which to display latest formed clusters
GCAinstance.cluster()
def cluster(self, clusterer = HDBSCAN, dimensions = ["b","l","parallax","pmdec","pmra"],**kwargs):
print(f"Running {clusterer.__class__.__name__} on {self.regionname} over {dimensions}\n")
dataselection = [self.datatable[param] for param in dimensions] #N dimensional HDBscan
data =StandardScaler().fit_transform(np.array(dataselection).T)
clusterer = clusterer(**kwargs)
clusterer.fit(data)
clusterer.fit_predict(data) #in case of artificial of unknown stars we can use fit_predict to predict the cluster they would belong to
labels = clusterer.labels_ #list of all stars in which a number encodes to what cluster it is assigned
self.datatable[f"{clusterer.__class__.__name__}"] = labels #append all labels to the designated "clustername "self.datatable table
self.clusterer = clusterer
return clusterer
The cluster(self, clusterer = HDBSCAN, dimensions = ["b","l","parallax","pmdec","pmra"],**kwargs)
clusters the GCAinstance.datatable
data based on a specified cluster algorithm. The funnction returns the clusterer instance. Resulting Cluster labels are written to GCAinstance.datatable["cluster algorithm name"]
-
dimensions = ["GCAinstance.datatable column names"]
determines which columns of GCA.datatable are used to cluster the data over -
clusterer = cluster_algorithm
passes a clustering function that is used to cluster the data. By default this cluster function should only accept the to-be-clustered-data. i.eclusterer = GCA.HDBSCAN
,clusterer ='GCA.OPTICS',
clusterer = sklearn.cluster.DBSCAN`etc -
**kwargs
accepts keywords arguments that are passed on to the cluster algorithms(HDBSCAN,DBSCAN etc)
GCAinstance.optimize_grid()
def optimize_grid(self, dimensions= ["b","l","parallax","pmdec","pmra"], clusterer=HDBSCAN, fit_params=None, scoring_function=scoringfunction, **kwargs):
dataselection = [self.datatable[param] for param in dimensions] #N dimensional HDBscan
data = StandardScaler().fit_transform(np.array(dataselection).T)
scores= []
param_values = []
point_variable_names = [i["variable"]for i in fit_params]
point_variable_list = [list(range(i["min"], i["max"])) for i in fit_params]
combination = [p for p in itertools.product(*point_variable_list)]
combination = [dict(zip(point_variable_names, i)) for i in combination]
for i in tqdm(combination):
cluster = clusterer(**i, **kwargs)
cluster.fit(data)
cluster.fit_predict(data) #in case of artificial of unknown stars we can use fit_predict to predict the cluster they would belong to
labels = cluster.labels_
self.datatable["population"] = labels
scores.append(scoring_function(self.datatable, self.regiondata))
param_values.append(i)
max_score_index, max_score = np.argmax(scores) , np.max(scores)
return param_values[max_score_index]
GCAinstance.optimize_grid(self, dimensions= ["b","l","parallax","pmdec","pmra"], clusterer=HDBSCAN, fit_params=None, scoring_function=scoringfunction, **kwargs)' fits cluster function
clustererbased on a given set of parameter intervals
fit_paramsto optimize a
scoring_unction`. This scoring function compares the predicted clusters to the true clusters. The highest score results in the best fit (according to the scoring_function).
The function returns a list of dictionaries with the optimized parameter values
-
dimensions
: the dimensions/datacolumns of GCAinstance.datatable we will cluster over -
clusterer
: a clustering function that is used to cluster the data. By default this cluster function should only accept the to-be-clustered-data. i.eclusterer = GCA.HDBSCAN
,clusterer ='GCA.OPTICS',
clusterer = sklearn.cluster.DBSCAN`etc -
fit_params
: Is a python-list containing dicts formatted as follows[{"variable" :"cluster argument", "min":10, "max":20},{"variable" :"cluster argument", "min":5, "max":40}]
-
scoring_function
scoring function accepts a different function that takesGCAinstance.datatable and GCAinstance.regiondata
A set of properly out of the box formatted scoring functions is included inGaiaClusterFit.evalmetric
.
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file GaiaClusterFit-0.0.8.tar.gz
.
File metadata
- Download URL: GaiaClusterFit-0.0.8.tar.gz
- Upload date:
- Size: 9.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa088ea934867f65581e8b00b4917671e6101218dfc3dc428721199af52f0f17 |
|
MD5 | 4600425c62399c507abd5460abe8b526 |
|
BLAKE2b-256 | 86876bc9e9a458e538af94324fb069f75469da6453fadf36adf96db4c44eeb96 |
File details
Details for the file GaiaClusterFit-0.0.8-py3-none-any.whl
.
File metadata
- Download URL: GaiaClusterFit-0.0.8-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 028f67a52d33832e17a181a74980cef577ac517bd9c53a9e542ff84d8436db44 |
|
MD5 | b65d6f1266ea12a8b38600d497ba23b7 |
|
BLAKE2b-256 | e2b6ffe314c73ab46fec1286e54ad15c43537ae2bc0161cf98dc8ce80adc196a |