Member-only story

Month in 4 Papers (April 2024)

4 min readMay 4, 2024

This series of posts is designed to bring you the newest findings and developments in the NLP field. I’ll delve into four significant research papers each month, offering a comprehensive summary. Be sure to visit my blog regularly or subscribe to my newsletter for monthly updates. Let’s dive in!

Cosine-Similarity

📝 Is Cosine-Similarity of Embeddings Really About Similarity? [paper]

Cosine similarity is the go-to method for retrieving similar documents in RAG (Retrieval-Augmented Generation) pipelines. This study performs a series of experiments to evaluate the effectiveness of this metric. Their findings indicate that cosine similarity can yield random outcomes in linear models, depending on the applied normalization or regulation methods. They suggest it shouldn’t be used blindly in all circumstances since the results are inconsistent.

Unfortunately, they didn’t offer alternative methods. But, we can look at the recommendations from the model providers since they set the learning objective and the normalization methods. Like OpenAI, which recommends using cosine similarity in their embedding models.

Synthethic Instruction Tuning

📝 Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models [paper]

This paper introduces GLAN, an approach for creating an instruction-tuning dataset by leveraging a previously established taxonomy of human knowledge and capabilities as input. unlike methods that build the dataset from existing datasets or few-shot examples, GLAN uses a pre-curated syllabus covering various subjects. First, GPT-4 is used to compile a list of all human knowledge and capabilities, which is then refined by human evaluators.

Next, they used the mentioned list to create subjects for each item along with a course syllabus, drawing inspiration from the educational system’s structure. Finally, the…

Month in 4 Papers (April 2024)

Cosine-Similarity

Synthethic Instruction Tuning

Written by Ala Falaki, PhD

No responses yet