Auto-Evaluator

Get Started

Welcome to the auto-evaluator! This is an app to evaluate the performance of question-answering LLM chains. This demo has pre-loaded two things: (1) a document (the Lex Fridman podcast with Andrej Karpathy) and (2) a "test set" of question-answer pairs for this episode. The aim is to evaluate the performance of various question-answering LLM chain configurations against the test set. You can build any QA chain using the components and score its performance.

Choose the question-answering chain configuration (left) and launch an experiment using the button below. For more detail on each setting, see full the documentation here.

File Name	Size (MB)
karpathy-pod.txt	0.2

Test Dataset

Question	Answer
Why is the transformer architecture expressive in the forward pass?	The transformer architecture is expressive because it uses a general message passing scheme where nodes get to look at each other, decide what's interesting and then update each other.
What design criteria does the Transformer meet?	The transformer is very expressive in a forward pass, optimizable in the backward pass using the techniques that we have such as gradient descent, and it can run efficiently on our hardware such as GPUs.
Why is next word prediction an effective training objective?	On a sufficiently large dataset, the task of predicting the next word multi-tasks knowledge of a lot of things, including understanding of chemistry, physics, and human nature. You have to understand a lot about the world to make that prediction on an internet-scale dataset.
What was the World Of Bits project and why did it fail?	World Of Bits was an effort to give AI access to tools, such as a keyboard and mouse, in order to complete tasks, such as complete bookings. It failed because it turned out that reinforcement learning is an extremely inefficient way of training neural networks. You take many actions, but you only get a sparse reward once in a while. Starting from scratch, it is very unlikely to stumble on the correct action - such as a booking - by chance at random, so the reward signal is very sparse.
Why can additional sensors be a liability in an autonomous vehicle system?	Each sensor adds complexity to the system. The hardware must be sourced, versioned, and maintain firmware. Software must ingest it, track versions. The cost of this additional bloat or entropy must be weighted against the added benefit of that particular sensor.

Experiment Results

This table shows the each question-answer pair from the test set along with the model's answer to the question. The app will score two things: (1) the relevance of the retrieved documents relative to the question and (2) the similarity of the LLM generated answer relative to ground truth answer. The prompts for both can be seen here and can be chosen by the user in the drop-down list "Grading prompt style". The "Fast" prompt will only have the LLM grader output the score. The other prompts will also produce an explanation.

Question	Expected Answer	Observed Answer	Retrieval Relevancy Score	Answer Similarity Score	Latency (s)
Why is the transformer architecture expressive in the forward pass?	The transformer architecture is expressive because it uses a general message passing scheme where nodes get to look at each other, decide what's interesting and then update each other.	The transformer architecture is designed to be very expressive in the forward pass and is optimized for hardware with lots of parallelism. It is designed to perform a lot of operations serially and is very optimizable in the backward pass. The residual connections in the transformer support a kind of passing where nodes get to communicate and broadcast their keys and values.	GRADE: Correct JUSTIFICATION: All of the retrieved documents discuss the transformer architecture and its design features, specifically its ability to be expressive in the forward pass. Doc 3 specifically mentions the message passing scheme used by the transformer architecture, which directly answers the question. Therefore, the retrieved documents are relevant and support the answer.	GRADE: Correct JUSTIFICATION: The student answer accurately explains that the transformer architecture is designed to be expressive in the forward pass and uses a message passing scheme where nodes update each other. The additional information provided by the student does not conflict with the true answer.	9.030
What design criteria does the Transformer meet?	The transformer is very expressive in a forward pass, optimizable in the backward pass using the techniques that we have such as gradient descent, and it can run efficiently on our hardware such as GPUs.	The Transformer was designed to be both expressive in the forward pass and optimizable in the backward pass, with hardware efficiency in mind. The authors aimed to create a powerful architecture that could express general computation and support message passing. The Transformer has been remarkably stable since its release in 2016, with some minor reshuffling of layer normalizations in a pre-norm formulation.	GRADE: Correct JUSTIFICATION: The retrieved documents provide information on the design criteria of the Transformer, including its efficiency on hardware, expressiveness in the forward pass, and optimizability in the backward pass. These criteria directly answer the question and support the answer provided.	GRADE: Correct JUSTIFICATION: The student answer accurately describes the design criteria of the Transformer, including its expressiveness in the forward pass, optimizability in the backward pass, and hardware efficiency. The additional information provided does not conflict with the true answer.	9.819
Why is next word prediction an effective training objective?	On a sufficiently large dataset, the task of predicting the next word multi-tasks knowledge of a lot of things, including understanding of chemistry, physics, and human nature. You have to understand a lot about the world to make that prediction on an internet-scale dataset.	Next word prediction is an effective training objective because it is a simple objective that can be applied to language models, which have existed for a long time. By consuming a sequence of words and predicting the next word, neural nets can be trained on large datasets and prompted to solve problems. This objective requires understanding of the world and context, making it an effective way to train language models.	GRADE: Correct JUSTIFICATION: All of the retrieved documents discuss the task of predicting the next word in a sequence and how it relates to language modeling and neural networks. They also mention the importance of understanding various aspects of the world to effectively predict the next word. Therefore, they are relevant to the question and support the answer.	GRADE: Correct JUSTIFICATION: The student answer accurately explains that next word prediction is an effective training objective because it requires understanding of the world and context, making it an effective way to train language models. The answer also mentions that neural nets can be trained on large datasets and prompted to solve problems, which is also true.	9.989
What was the World Of Bits project and why did it fail?	World Of Bits was an effort to give AI access to tools, such as a keyboard and mouse, in order to complete tasks, such as complete bookings. It failed because it turned out that reinforcement learning is an extremely inefficient way of training neural networks. You take many actions, but you only get a sparse reward once in a while. Starting from scratch, it is very unlikely to stumble on the correct action - such as a booking - by chance at random, so the reward signal is very sparse.	The World of Bits project was an attempt to use reinforcement learning to teach neural networks how to complete tasks using a keyboard and mouse. The project failed because this approach was found to be inefficient and impractical for solving problems. The project also lacked a clear goal or purpose.	GRADE: Correct JUSTIFICATION: - Doc 1 mentions the World Of Bits project and the time period it was active. - Doc 2 discusses the inefficiency of reinforcement learning, which was the approach taken in the World Of Bits project. - Doc 3 is not directly related to the question, but it does not contradict the information in the other documents. - Doc 4 provides a clear explanation of what the World Of Bits project was and what it aimed to achieve.	GRADE: Correct JUSTIFICATION: The student accurately describes the World of Bits project as an attempt to use reinforcement learning to teach neural networks how to complete tasks using a keyboard and mouse. They also correctly state that the project failed due to the inefficiency and impracticality of this approach, as well as the lack of a clear goal or purpose.	10.109
Why can additional sensors be a liability in an autonomous vehicle system?	Each sensor adds complexity to the system. The hardware must be sourced, versioned, and maintain firmware. Software must ingest it, track versions. The cost of this additional bloat or entropy must be weighted against the added benefit of that particular sensor.	Additional sensors can be a liability in an autonomous vehicle system because they are not free and can add bloat to the data engine. They can also be a distraction and may change over time, requiring additional resources to maintain. It is important to focus resources on necessary and sufficient sensors and consider the full cost of adding a sensor.	GRADE: Correct JUSTIFICATION: All four documents discuss the potential drawbacks of adding additional sensors to an autonomous vehicle system, including increased complexity, cost, and potential distraction. The answer provided is supported by the information in the retrieved documents.	GRADE: Correct JUSTIFICATION: The student answer accurately explains that additional sensors can add complexity and cost to an autonomous vehicle system, and emphasizes the importance of considering the full cost and necessity of adding a sensor.	9.157

Summary

Experiment #	# of Eval Questions	Chunk Size	Overlap	Split Method	Retriever	Embedding Algorithm	Model	Grading Prompt Style	# of Chunks Retrieved	Avg Retrieval Relevancy Score	Avg Answer Similarity Score	Avg Latency (s)
1	5	2000	0	RecursiveTextSplitter	SVM	OpenAI	gpt-3.5-turbo	Descriptive	3	1	1	14.516
2	5	1500	50	RecursiveTextSplitter	TF-IDF	OpenAI	gpt-3.5-turbo	Descriptive	3	1	0.8	10.672
3	5	500	0	CharacterTextSplitter	similarity-search	OpenAI	gpt-3.5-turbo	Descriptive	3	1	1	9.621