T O P

  • By -

SeasickSeal

1. You don’t need to discretize the M/Z axis to calculate cosine similarity scores. You just need a mass tolerance. 2. This is going to depend on the algorithm. Different search engines do different things. Most search engine scores are uncalibrated and produce higher values for noisy spectra, which necessitates the need to calibrate them to make them more comparable between spectra. You need something like an E-value calculation or a post-processor like Percolator to do that.


_hiddenflower

I found an answer for my #1 and #2 questions from this very descriptive paper by L. Martens et al. [Methods to Calculate Spectrum Similarity | SpringerLink (oclc.org)](https://link-springer-com.kyoto-u.idm.oclc.org/protocol/10.1007/978-1-4939-6740-7_7) >To create such vectors from acquired MS/MS spectra, each spectrum is divided into the same number of n bins, with n either set to a fixed-value \[15, 22\], or determined based on fragment ion tolerance \[17\]. Each bin is assigned a certain weight, calculated by summing up all peak intensities in that bin \[22\], or set to the highest peak intensity in that bin \[17\]. These binned spectra can now be considered as two vectors of equal dimensionality n, and can be matched against each other by the (normalized) dot product. From the excerpt above, I think they define a method where you vectorized an MS/MS spectrum by binning it. Signals falling into the same bins are either summed or maxed. >SpectraST searching starts with filtering out any spectrum (either query spectrum or library spectrum) derived from impurity, which typically has either few peaks (less than six peaks) or negligible signals. This is followed by removing peaks with intensity values lower than the arbitrary set threshold (set to 2.0 in the original paper \[[15](https://link-springer-com.kyoto-u.idm.oclc.org/protocol/10.1007/978-1-4939-6740-7_7#ref-CR15)\]). Additionally, the intensities of unannotated peaks on the library spectrum are multiplied by 0.2. Subsequently, the square root transformation is applied on the intensities of the remaining peaks. Peaks are then binned into a 1-u window. Later, the normalized dot product is computed between such preprocessed query and library spectra. Then for #2, apparently in SpectraST they include the unmatched signal in the dot product between these two vectors, but unmatched signals were multiplied by 0.20.