Matching gaia clustered stars to known clusters

Project description

GaiaClusterFit

GaiaClusterFit is a Python library for optuimizing GAIA clustering

Installation

Use the package manager pip to install GaiaClusterFit.

pip install GaiaClusterFit

Basic Usage

Import library

from  GaiaClusterFit import GCA

from  GaiaClusterFit import evalmetric

Specify Gaia query

#GAIA database query

query ="""SELECT TOP 1000  source_id, b, l, parallax,phot_g_mean_mag,pmra,pmdec, RUWE, bp_rp,phot_g_mean_mag+5*log10(parallax)-10 as mg

FROM gaiadr3.gaia_source

WHERE l < 275 AND l > 240 

AND b < 5 AND b > -15

AND phot_g_mean_mag < 18

AND RUWE < 1.4

AND parallax < 4 AND parallax > 1.8

AND parallax_error/parallax < 0.02"""

Create an instance and import data

#Create instance

job = GCA.GCAinstance(RegionName = "Char")



#Login and fetch GAIA Data

job.GaiaLogin(username='username', password='password')

job.FetchQueryAsync(query)



#Import known cluster

job.ImportRegion("G:/path/known_cluster.fits")

Setting up basic cluster fit function to clustered GAIA data to known clusters

#Parameters to optimize Cluster function over (HDBscan by default)

parameters = [{"variable": "min_cluster_size", "min":10, "max":100}]

Renaming cluster table columns to match GAIA column names

job.RenameCol(job.regiondata, [["Source", "source_id"],["Pop", "population"]])

Optimizing cluster function(HDBscan) over GAIA data to match known clusters

optimal = job.optimize_grid(fit_params=parameters, evalmetric.homogeneityscore)

Scoring function returns a score for the fit based by default on homogeneity self-made score functions can be passed and recieve an astropy gaia table and an astropy region table. optimize_grid returns parameters for the highest score

Code Discriptions

GCA.GCAinstance

GCAinstance(data =None, regiondata =None, RegionName = "No region Name")

Creates an instance object class used for clusteringa and cluster match scoreing later on.

(data =None, regiondata =None, RegionName = "No region Name") are optional.

Later instance.Datatable and instance.Regiondata can be populated by querying the GAIA database (GCAinstance.GaiaLogin and GCAinstance.FetchQueryAsync) or by uploading a Gaia FITs table through instance.ImportDataTable and instance.ImportRegion

data : an astropy.table table containing star data
regiondata: an astropy.table table containing known cluster data

GCAinstance.ImportDataTable()

def ImportDataTable(self,path): #import a fits datatable comming from Gaia or whatever

  self.datatable =Table(fits.open(path)[1].data)

Imports a GAIA table from the .fits format and stores it to self.datatable

path: a string specifying the path to the .fits table file containing star data

GCAinstance.ExportDataTable()

def ExportDataTable(self, path, **kwargs): #export the self.datatable to any format(for importing measures i would recommend .fits)

     self.datatable.write(f'{path}',**kwargs)

Exports self.datatable to a .fits file at a specified path. Kwargs translate over from astropy.io.ascii.write(**kwargs) function

path: a string specifying the path where the .fits table file containing star data will be stored

GCAinstance.ImportRegion()

def ImportDataTable(self,path): #import a fits datatable comming from Gaia or whatever

  self.regiondata =Table(fits.open(path)[1].data)

Imports a GAIA table from the .fits format and stores it to self.regiondata

path: a string specifying the path to the .fits table file containing cluster region data

GCAinstance.ExportRegion()

def ExportDataTable(self, path, **kwargs): #export the self.datatable to any format(for importing measures i would recommend .fits)

     self.regiondata.write(f'{path}',**kwargs)

Exports self.regiondata to a .fits file at a specified path. Kwargs translate over from astropy.io.ascii.write(**kwargs) function

path: a string specifying the path where the .fits table file containing cluster region data will be stored

GCAinstance.GaiaLogin()

def GaiaLogin(self, username, password):

  Gaia.login(user=str(username), password=str(password))

The GCAinstance.GaiaLogin() initiates a GAIA database session based on personal credentials (username="username", password="password"). This allows for asynchronous data queries (GCAinstance.FetchQueryAsync()) from the GAIA database. This session is constrained within the instance allowing multiple instances to initiate different sessions.

username: a string specifying your GAIA username credential
password: a string specifying your GAIA password credential

GCAinstance.FetchQueryAsync()

def FetchQueryAsync(self, query, **kwargs):

  job = Gaia.launch_job_async(query, **kwargs)

  self.datatable = job.get_results()

The CAinstance.FetchQueryAsync(query, **kwargs) function accepts a ADQL formatted query to fetch GAIA data. It writes this data to GCAinstance.datatable .

query: a string containing the to be queried ADQL query
kwargs: all keword arguments that the Astroquery.Gaia.launch_job_async also accepts

GCAinstance.Renamecol()

def RenameCol(self, table, newnames):

    for i in newnames:

      table.rename_column(i[0],i[1])

The Renamecol function converts the columnnames of an astropy.table object to a set of new names. Within GaiaClusterFit we require that the columns of the regions and GAIA data match column names. Therefore it is standard practice to convert the GCAinstance.regiondata columns to match that of the GAIA columns. I.E GCAinstance.RenameCol(GCAinstance.regiondata, [["Source","Source_id"],["Pop",population]]). The default columnname for labeled clusterdata in GCAinstance.datatable is "population"

table: astropy.table table object
newnames: 2D python list as such [["old column name 1","new column name 1"],["old column name 2","new column name 2"]]

GCAinstance.Plot()

def Plot(self, xaxis = "b", yaxis = "l", **kwargs):

    plt.title(f"{self.regionname}")

    plt.scatter(self.datatable[xaxis],self.datatable[yaxis], **kwargs)

    plt.ylabel(yaxis)

    plt.xlabel(xaxis)

    plt.xlim(max(self.datatable[xaxis]),min(self.datatable[yaxis]))

    plt.show()

GCAinstance.Plot() plots GCAinstance.datatable using matplotlib.pyplot. x and y dimensions of the plot can be controlled using xaxis = "GAIA parameter" , yaxis = "GAIA parameter"' where the GAIA parameter can be the string name of any column in GCAinstance.datatable. **kwargs takes any keywordargument matplotlib.pyplot accepts.

xaxis: column name of column in GCAinstance.datatable to display on the x-axis
yaxis: column name of column in GCAinstance.datatable to display on the y-axis
kwargs: general keyword arguments accepted by matplotlib.pyplot.plot()

GCAinstance.PlotCluster()

  def PlotCluster(self, xaxis="b", yaxis ="l", clusterer="HDBSCAN", remove_outliers =False , **kwargs): #modified plot function with outlier filtration and Cluster selection

    try:

      fig, ax = plt.subplots(figsize=(10,10))



      plotdata = (self.datatable[xaxis], self.datatable[yaxis])

      labels = self.datatable[clusterer]



      if remove_outliers == True : 

        plotdata = self.datatable[xaxis][self.datatable[f"{remove_outliers}_outlier"]],self.datatable[yaxis][self.datatable[f"{remove_outliers}_outlier"]]

        labels = self.datatable[clusterer][self.datatable[f"{remove_outliers}_outlier"]]

      ax.set_title(f"{clusterer} clusters in \n {self.regionname} \n Outliers removed = {remove_outliers} ")

      ax.scatter(*plotdata, c=labels, **kwargs)

      ax.set_ylabel(yaxis)

      ax.set_xlabel(xaxis)

      plt.show()

      return fig,ax

    except:

      if clusterer not in self.datatable.columns:

        print(f"Error: You did not perform the{clusterer} clustering yet. No {clusterer} column found in self.Datatable")

      return fig,ax

The GCAinstance.PlotCluster()function plots the clusterdata alongside the GCAinstance.datatable data. This requires GCAinstance.datatable to be clustered before by GCAinstance.cluster() function. The GCAinstance.Plotcluster() plots clusterlabels alongside GCAinstance.datatable using matplotlib.pyplot. x and y dimensions of the plot can be controlled using xaxis = "GAIA parameter" , yaxis = "GAIA parameter" where the GAIA parameter can be the string name of any column in GCAinstance.datatable. **kwargs takes any keywordargument matplotlib.pyplot accepts.

xaxis: column name of column in GCAinstance.datatable to display on the x-axis
yaxis: column name of column in GCAinstance.datatable to display on the y-axis
clusterer: cluster function name of which to display latest formed clusters

GCAinstance.cluster()

  def cluster(self, clusterer = HDBSCAN, dimensions = ["b","l","parallax","pmdec","pmra"],**kwargs):

        print(f"Running {clusterer.__class__.__name__} on {self.regionname} over {dimensions}\n")

        dataselection = [self.datatable[param] for param in dimensions] #N dimensional HDBscan

        data =StandardScaler().fit_transform(np.array(dataselection).T)

        clusterer = clusterer(**kwargs)

        clusterer.fit(data)

        clusterer.fit_predict(data) #in case of artificial of unknown stars we can use fit_predict to predict the cluster they would belong to

        labels = clusterer.labels_ #list of all stars in which a number encodes to what cluster it is assigned

        self.datatable[f"{clusterer.__class__.__name__}"] = labels #append all labels to the designated "clustername "self.datatable table

        self.clusterer = clusterer  

        return clusterer

The cluster(self, clusterer = HDBSCAN, dimensions = ["b","l","parallax","pmdec","pmra"],**kwargs) clusters the GCAinstance.datatable data based on a specified cluster algorithm. The funnction returns the clusterer instance. Resulting Cluster labels are written to GCAinstance.datatable["cluster algorithm name"]

dimensions = ["GCAinstance.datatable column names"] determines which columns of GCA.datatable are used to cluster the data over
clusterer = cluster_algorithm passes a clustering function that is used to cluster the data. By default this cluster function should only accept the to-be-clustered-data. i.e clusterer = GCA.HDBSCAN , clusterer ='GCA.OPTICS', clusterer = sklearn.cluster.DBSCAN`etc
**kwargs accepts keywords arguments that are passed on to the cluster algorithms(HDBSCAN,DBSCAN etc)

GCAinstance.optimize_grid()

def optimize_grid(self, dimensions= ["b","l","parallax","pmdec","pmra"], clusterer=HDBSCAN, fit_params=None, scoring_function=scoringfunction, **kwargs):     

      dataselection = [self.datatable[param] for param in dimensions] #N dimensional HDBscan

        

      data = StandardScaler().fit_transform(np.array(dataselection).T)

      scores= []

      param_values = []

      point_variable_names = [i["variable"]for i in fit_params]

      point_variable_list = [list(range(i["min"], i["max"])) for i in fit_params]

      combination = [p for p in itertools.product(*point_variable_list)]

      combination = [dict(zip(point_variable_names, i)) for i in combination]

      for i in tqdm(combination):

        cluster = clusterer(**i, **kwargs)

        cluster.fit(data)

        cluster.fit_predict(data) #in case of artificial of unknown stars we can use fit_predict to predict the cluster they would belong to

        labels = cluster.labels_

        self.datatable["population"] = labels

        scores.append(scoring_function(self.datatable, self.regiondata))

        param_values.append(i)

      max_score_index, max_score = np.argmax(scores) , np.max(scores)

      return param_values[max_score_index]

GCAinstance.optimize_grid(self, dimensions= ["b","l","parallax","pmdec","pmra"], clusterer=HDBSCAN, fit_params=None, scoring_function=scoringfunction, **kwargs)' fits cluster function clustererbased on a given set of parameter intervalsfit_paramsto optimize ascoring_unction`. This scoring function compares the predicted clusters to the true clusters. The highest score results in the best fit (according to the scoring_function).

The function returns a list of dictionaries with the optimized parameter values

dimensions : the dimensions/datacolumns of GCAinstance.datatable we will cluster over
clusterer : a clustering function that is used to cluster the data. By default this cluster function should only accept the to-be-clustered-data. i.e clusterer = GCA.HDBSCAN , clusterer ='GCA.OPTICS', clusterer = sklearn.cluster.DBSCAN`etc
fit_params: Is a python-list containing dicts formatted as follows [{"variable" :"cluster argument", "min":10, "max":20},{"variable" :"cluster argument", "min":5, "max":40}]
scoring_functionscoring function accepts a different function that takes GCAinstance.datatable and GCAinstance.regiondata A set of properly out of the box formatted scoring functions is included in GaiaClusterFit.evalmetric.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

Project details

Release history Release notifications | RSS feed

This version

0.0.9

Nov 23, 2022

0.0.8

Nov 23, 2022

0.0.7

Nov 4, 2022

0.0.6

Nov 1, 2022

0.0.5

Oct 26, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GaiaClusterFit-0.0.9.tar.gz (9.4 kB view details)

Uploaded Nov 23, 2022 Source

Built Distribution

GaiaClusterFit-0.0.9-py3-none-any.whl (9.3 kB view details)

Uploaded Nov 23, 2022 Python 3

File details

Details for the file GaiaClusterFit-0.0.9.tar.gz.

File metadata

Download URL: GaiaClusterFit-0.0.9.tar.gz
Upload date: Nov 23, 2022
Size: 9.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for GaiaClusterFit-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`0985930a480e9b58cf5d945eafa4fe0426e12683e4d6fa71daccfbba46be3483`
MD5	`a3815275f0ad71d86905686f8be1ce3f`
BLAKE2b-256	`90998d4d49972480ed022c99ef4d820b75cd338ec4668f0836f3a59a3d19f2ae`

See more details on using hashes here.

File details

Details for the file GaiaClusterFit-0.0.9-py3-none-any.whl.

File metadata

Download URL: GaiaClusterFit-0.0.9-py3-none-any.whl
Upload date: Nov 23, 2022
Size: 9.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for GaiaClusterFit-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fff30629b50728a8e941469bf4043d07420bbd53a8098cfbe985c39c8c5530cd`
MD5	`5bce379ef3a34fa09a282d46716ae010`
BLAKE2b-256	`755f9d3880b260ebcdfd783a73d582e68d384e705b7f83f37905af0d853d1449`