Skip to main content

Matching gaia clustered stars to known clusters

Project description

GaiaClusterFit

GaiaClusterFit is a Python library for optuimizing GAIA clustering

Installation

Use the package manager pip to install GaiaClusterFit.

pip install GaiaClusterFit

Basic Usage

Import library

from  GaiaClusterFit import GCA

from  GaiaClusterFit import evalmetric

Specify Gaia query

#GAIA database query

query ="""SELECT TOP 1000  source_id, b, l, parallax,phot_g_mean_mag,pmra,pmdec, RUWE, bp_rp,phot_g_mean_mag+5*log10(parallax)-10 as mg

FROM gaiadr3.gaia_source

WHERE l < 275 AND l > 240 

AND b < 5 AND b > -15

AND phot_g_mean_mag < 18

AND RUWE < 1.4

AND parallax < 4 AND parallax > 1.8

AND parallax_error/parallax < 0.02""" 

Create an instance and import data

#Create instance

job = GCA.GCAinstance(RegionName = "Char")



#Login and fetch GAIA Data

job.GaiaLogin(username='username', password='password')

job.FetchQueryAsync(query)



#Import known cluster

job.ImportRegion("G:/path/known_cluster.fits")

Setting up basic cluster fit function to clustered GAIA data to known clusters

#Parameters to optimize Cluster function over (HDBscan by default)

parameters = [{"variable": "min_cluster_size", "min":10, "max":100}]

Renaming cluster table columns to match GAIA column names

job.RenameCol(job.regiondata, [["Source", "source_id"],["Pop", "population"]])

Optimizing cluster function(HDBscan) over GAIA data to match known clusters

optimal = job.optimize_grid(fit_params=parameters, evalmetric.homogeneityscore)

Scoring function returns a score for the fit based by default on homogeneity self-made score functions can be passed and recieve an astropy gaia table and an astropy region table. optimize_grid returns parameters for the highest score

Code Discriptions

GCA.GCAinstance

GCAinstance(data =None, regiondata =None, RegionName = "No region Name")

Creates an instance object class used for clusteringa and cluster match scoreing later on.

(data =None, regiondata =None, RegionName = "No region Name") are optional.

Later instance.Datatable and instance.Regiondata can be populated by querying the GAIA database (GCAinstance.GaiaLogin and GCAinstance.FetchQueryAsync) or by uploading a Gaia FITs table through instance.ImportDataTable and instance.ImportRegion

  • data : an astropy.table table containing star data

  • regiondata: an astropy.table table containing known cluster data

GCAinstance.ImportDataTable()

def ImportDataTable(self,path): #import a fits datatable comming from Gaia or whatever

  self.datatable =Table(fits.open(path)[1].data)

Imports a GAIA table from the .fits format and stores it to self.datatable

  • path: a string specifying the path to the .fits table file containing star data

GCAinstance.ExportDataTable()

def ExportDataTable(self, path, **kwargs): #export the self.datatable to any format(for importing measures i would recommend .fits)

     self.datatable.write(f'{path}',**kwargs)

Exports self.datatable to a .fits file at a specified path. Kwargs translate over from astropy.io.ascii.write(**kwargs) function

  • path: a string specifying the path where the .fits table file containing star data will be stored

GCAinstance.ImportRegion()

def ImportDataTable(self,path): #import a fits datatable comming from Gaia or whatever

  self.regiondata =Table(fits.open(path)[1].data)

Imports a GAIA table from the .fits format and stores it to self.regiondata

  • path: a string specifying the path to the .fits table file containing cluster region data

GCAinstance.ExportRegion()

def ExportDataTable(self, path, **kwargs): #export the self.datatable to any format(for importing measures i would recommend .fits)

     self.regiondata.write(f'{path}',**kwargs)

Exports self.regiondata to a .fits file at a specified path. Kwargs translate over from astropy.io.ascii.write(**kwargs) function

  • path: a string specifying the path where the .fits table file containing cluster region data will be stored

GCAinstance.GaiaLogin()

def GaiaLogin(self, username, password):

  Gaia.login(user=str(username), password=str(password))

The GCAinstance.GaiaLogin() initiates a GAIA database session based on personal credentials (username="username", password="password"). This allows for asynchronous data queries (GCAinstance.FetchQueryAsync()) from the GAIA database. This session is constrained within the instance allowing multiple instances to initiate different sessions.

  • username: a string specifying your GAIA username credential

  • password: a string specifying your GAIA password credential

GCAinstance.FetchQueryAsync()

def FetchQueryAsync(self, query, **kwargs):

  job = Gaia.launch_job_async(query, **kwargs)

  self.datatable = job.get_results()

The CAinstance.FetchQueryAsync(query, **kwargs) function accepts a ADQL formatted query to fetch GAIA data. It writes this data to GCAinstance.datatable .

  • query: a string containing the to be queried ADQL query

  • kwargs: all keword arguments that the Astroquery.Gaia.launch_job_async also accepts

GCAinstance.Renamecol()

def RenameCol(self, table, newnames):

    for i in newnames:

      table.rename_column(i[0],i[1])

The Renamecol function converts the columnnames of an astropy.table object to a set of new names. Within GaiaClusterFit we require that the columns of the regions and GAIA data match column names. Therefore it is standard practice to convert the GCAinstance.regiondata columns to match that of the GAIA columns. I.E GCAinstance.RenameCol(GCAinstance.regiondata, [["Source","Source_id"],["Pop",population]]). The default columnname for labeled clusterdata in GCAinstance.datatable is "population"

  • table: astropy.table table object

  • newnames: 2D python list as such [["old column name 1","new column name 1"],["old column name 2","new column name 2"]]

GCAinstance.Plot()

def Plot(self, xaxis = "b", yaxis = "l", **kwargs):

    plt.title(f"{self.regionname}")

    plt.scatter(self.datatable[xaxis],self.datatable[yaxis], **kwargs)

    plt.ylabel(yaxis)

    plt.xlabel(xaxis)

    plt.xlim(max(self.datatable[xaxis]),min(self.datatable[yaxis]))

    plt.show()

GCAinstance.Plot() plots GCAinstance.datatable using matplotlib.pyplot. x and y dimensions of the plot can be controlled using xaxis = "GAIA parameter" , yaxis = "GAIA parameter"' where the GAIA parameter can be the string name of any column in GCAinstance.datatable. **kwargs takes any keywordargument matplotlib.pyplot accepts.

  • xaxis: column name of column in GCAinstance.datatable to display on the x-axis

  • yaxis: column name of column in GCAinstance.datatable to display on the y-axis

  • kwargs: general keyword arguments accepted by matplotlib.pyplot.plot()

GCAinstance.PlotCluster()

  def PlotCluster(self, xaxis="b", yaxis ="l", clusterer="HDBSCAN", remove_outliers =False , **kwargs): #modified plot function with outlier filtration and Cluster selection

    try:

      fig, ax = plt.subplots(figsize=(10,10))



      plotdata = (self.datatable[xaxis], self.datatable[yaxis])

      labels = self.datatable[clusterer]



      if remove_outliers == True : 

        plotdata = self.datatable[xaxis][self.datatable[f"{remove_outliers}_outlier"]],self.datatable[yaxis][self.datatable[f"{remove_outliers}_outlier"]]

        labels = self.datatable[clusterer][self.datatable[f"{remove_outliers}_outlier"]]

      ax.set_title(f"{clusterer} clusters in \n {self.regionname} \n Outliers removed = {remove_outliers} ")

      ax.scatter(*plotdata, c=labels, **kwargs)

      ax.set_ylabel(yaxis)

      ax.set_xlabel(xaxis)

      plt.show()

      return fig,ax

    except:

      if clusterer not in self.datatable.columns:

        print(f"Error: You did not perform the{clusterer} clustering yet. No {clusterer} column found in self.Datatable")

      return fig,ax

The GCAinstance.PlotCluster()function plots the clusterdata alongside the GCAinstance.datatable data. This requires GCAinstance.datatable to be clustered before by GCAinstance.cluster() function. The GCAinstance.Plotcluster() plots clusterlabels alongside GCAinstance.datatable using matplotlib.pyplot. x and y dimensions of the plot can be controlled using xaxis = "GAIA parameter" , yaxis = "GAIA parameter" where the GAIA parameter can be the string name of any column in GCAinstance.datatable. **kwargs takes any keywordargument matplotlib.pyplot accepts.

  • xaxis: column name of column in GCAinstance.datatable to display on the x-axis

  • yaxis: column name of column in GCAinstance.datatable to display on the y-axis

  • clusterer: cluster function name of which to display latest formed clusters

GCAinstance.cluster()

  def cluster(self, clusterer = HDBSCAN, dimensions = ["b","l","parallax","pmdec","pmra"],**kwargs):

        print(f"Running {clusterer.__class__.__name__} on {self.regionname} over {dimensions}\n")

        dataselection = [self.datatable[param] for param in dimensions] #N dimensional HDBscan

        data =StandardScaler().fit_transform(np.array(dataselection).T)

        clusterer = clusterer(**kwargs)

        clusterer.fit(data)

        clusterer.fit_predict(data) #in case of artificial of unknown stars we can use fit_predict to predict the cluster they would belong to

        labels = clusterer.labels_ #list of all stars in which a number encodes to what cluster it is assigned

        self.datatable[f"{clusterer.__class__.__name__}"] = labels #append all labels to the designated "clustername "self.datatable table

        self.clusterer = clusterer  

        return clusterer 

The cluster(self, clusterer = HDBSCAN, dimensions = ["b","l","parallax","pmdec","pmra"],**kwargs) clusters the GCAinstance.datatable data based on a specified cluster algorithm. The funnction returns the clusterer instance. Resulting Cluster labels are written to GCAinstance.datatable["cluster algorithm name"]

  • dimensions = ["GCAinstance.datatable column names"] determines which columns of GCA.datatable are used to cluster the data over

  • clusterer = cluster_algorithm passes a clustering function that is used to cluster the data. By default this cluster function should only accept the to-be-clustered-data. i.e clusterer = GCA.HDBSCAN , clusterer ='GCA.OPTICS', clusterer = sklearn.cluster.DBSCAN`etc

  • **kwargs accepts keywords arguments that are passed on to the cluster algorithms(HDBSCAN,DBSCAN etc)

GCAinstance.optimize_grid()

def optimize_grid(self, dimensions= ["b","l","parallax","pmdec","pmra"], clusterer=HDBSCAN, fit_params=None, scoring_function=scoringfunction, **kwargs):     

      dataselection = [self.datatable[param] for param in dimensions] #N dimensional HDBscan

        

      data = StandardScaler().fit_transform(np.array(dataselection).T)

      scores= []

      param_values = []

      point_variable_names = [i["variable"]for i in fit_params]

      point_variable_list = [list(range(i["min"], i["max"])) for i in fit_params]

      combination = [p for p in itertools.product(*point_variable_list)]

      combination = [dict(zip(point_variable_names, i)) for i in combination]

      for i in tqdm(combination):

        cluster = clusterer(**i, **kwargs)

        cluster.fit(data)

        cluster.fit_predict(data) #in case of artificial of unknown stars we can use fit_predict to predict the cluster they would belong to

        labels = cluster.labels_

        self.datatable["population"] = labels

        scores.append(scoring_function(self.datatable, self.regiondata))

        param_values.append(i)

      max_score_index, max_score = np.argmax(scores) , np.max(scores)

      return param_values[max_score_index]

GCAinstance.optimize_grid(self, dimensions= ["b","l","parallax","pmdec","pmra"], clusterer=HDBSCAN, fit_params=None, scoring_function=scoringfunction, **kwargs)' fits cluster function clustererbased on a given set of parameter intervalsfit_paramsto optimize ascoring_unction`. This scoring function compares the predicted clusters to the true clusters. The highest score results in the best fit (according to the scoring_function).

The function returns a list of dictionaries with the optimized parameter values

  • dimensions : the dimensions/datacolumns of GCAinstance.datatable we will cluster over

  • clusterer : a clustering function that is used to cluster the data. By default this cluster function should only accept the to-be-clustered-data. i.e clusterer = GCA.HDBSCAN , clusterer ='GCA.OPTICS', clusterer = sklearn.cluster.DBSCAN`etc

  • fit_params: Is a python-list containing dicts formatted as follows [{"variable" :"cluster argument", "min":10, "max":20},{"variable" :"cluster argument", "min":5, "max":40}]

  • scoring_functionscoring function accepts a different function that takes GCAinstance.datatable and GCAinstance.regiondata A set of properly out of the box formatted scoring functions is included in GaiaClusterFit.evalmetric.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GaiaClusterFit-0.0.7.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

GaiaClusterFit-0.0.7-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file GaiaClusterFit-0.0.7.tar.gz.

File metadata

  • Download URL: GaiaClusterFit-0.0.7.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for GaiaClusterFit-0.0.7.tar.gz
Algorithm Hash digest
SHA256 c2d4f5fbdaf4f09311a45737e1b09b4a1e39f745ca46dac4d8fbb5ebdb30642e
MD5 8e0fef1abe47e171eb15b783e5e70a6c
BLAKE2b-256 6aac77a8369ed19dd974aa8e241b6814c72a607e9a6c519ee46e0bc379a1934c

See more details on using hashes here.

File details

Details for the file GaiaClusterFit-0.0.7-py3-none-any.whl.

File metadata

File hashes

Hashes for GaiaClusterFit-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 354a069fcb5549f5fda1c1e2870954a673020b4c49894118b0e05d8eb602f004
MD5 2f3b14343a1197cb9f0eb2ee3f8a11b6
BLAKE2b-256 61b9fdefe0c72e635d3f28b7e65ba1202f392ac0eaf1d84936fa80aa93c096f2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page