Month in 4 Papers (April 2024)
This series of posts is designed to bring you the newest findings and developments in the NLP field. I’ll delve into four significant research papers each month, offering a comprehensive summary. Be sure to visit my blog regularly or subscribe to my newsletter for monthly updates. Let’s dive in!
Cosine-Similarity
📝 Is Cosine-Similarity of Embeddings Really About Similarity? [paper]
Cosine similarity is the go-to method for retrieving similar documents in RAG (Retrieval-Augmented Generation) pipelines. This study performs a series of experiments to evaluate the effectiveness of this metric. Their findings indicate that cosine similarity can yield random outcomes in linear models, depending on the applied normalization or regulation methods. They suggest it shouldn’t be used blindly in all circumstances since the results are inconsistent.
Unfortunately, they didn’t offer alternative methods. But, we can look at the recommendations from the model providers since they set the learning objective and the normalization methods. Like OpenAI, which recommends using cosine similarity in their embedding models.
Synthethic Instruction Tuning
📝 Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models [paper]