AI Tools

Llama 3 vs GPT-4: Meta Challenges OpenAI on AI Turf

Meta not too long ago launched its Llama 3 mannequin in two sizes with 8B and 70B parameters and open-sourced the fashions for the AI group. Whereas being a smaller 70B mannequin, Llama 3 has proven spectacular functionality, as evident from the LMSYS leaderboard. So now we have in contrast Llama 3 with the flagship GPT-4 mannequin to guage their efficiency in numerous exams. On that notice, let’s undergo our comparability between Llama 3 and GPT-4.

1. Magic Elevator Take a look at

Let’s first run the magic elevator test to guage the logical reasoning functionality of Llama 3 compared to GPT-4. And guess what? Llama 3 surprisingly passes the take a look at whereas the GPT-4 mannequin fails to offer the proper reply. That is fairly shocking since Llama 3 is simply skilled on 70 billion parameters whereas GPT-4 is skilled on an enormous 1.7 trillion parameters.

Bear in mind, we ran the take a look at on the GPT-4 mannequin hosted on ChatGPT (out there to paid ChatGPT Plus customers). It appears to be utilizing the older GPT-4 Turbo mannequin. We ran the identical take a look at on the recently-released GPT-4 mannequin (gpt-4-turbo-2024-04-09) by way of OpenAI Playground, and it handed the take a look at. OpenAI says that they’re rolling out the most recent mannequin to ChatGPT, however maybe it’s not out there on our account but.

There's a tall constructing with a magic elevator in it. When stopping on an excellent ground, this elevator connects to ground 1 as a substitute.
Beginning on ground 1, I take the magic elevator 3 flooring up. Exiting the elevator, I then use the steps to go 3 flooring up once more.
Which ground do I find yourself on?

Winner: Llama 3 70B, and gpt-4-turbo-2024-04-09

Observe: GPT-4 loses on ChatGPT Plus

2. Calculate Drying Time

Subsequent, we ran the basic reasoning query to check the intelligence of each fashions. On this take a look at, each Llama 3 70B and GPT-4 gave the proper reply with out delving into arithmetic. Good job Meta!

If it takes 1 hour to dry 15 towels underneath the Solar, how lengthy will it take to dry 20 towels?

Winner: Llama 3 70B, and GPT-4 by way of ChatGPT Plus

drying time test llama 3 vs gpt-4

3. Discover the Apple

After that, I requested one other query to match the reasoning functionality of Llama 3 and GPT-4. On this take a look at, the Llama 3 70B mannequin comes near giving the suitable reply however misses out on mentioning the field. Whereas, the GPT-4 mannequin rightly solutions that “the apples are nonetheless on the bottom contained in the field”. I’m going to present it to GPT-4 on this spherical.

There's a basket and not using a backside in a field, which is on the bottom. I put three apples into the basket and transfer the basket onto a desk. The place are the apples?

Winner: GPT-4 by way of ChatGPT Plus

advanced reasoning test llama 3 vs gpt-4

4. Which is Heavier?

Whereas the query appears fairly easy, many AI fashions fail to get the suitable reply. Nevertheless, on this take a look at, each Llama 3 70B and GPT-4 gave the right reply. That stated, Llama 3 generally generates flawed output so preserve that in thoughts.

What's heavier, a kilo of feathers or a pound of metal?

Winner: Llama 3 70B, and GPT-4 by way of ChatGPT Plus

find the weight test llama 3

5. Discover the Place

Subsequent, I requested a easy logical query and each fashions gave an accurate response. It’s fascinating to see a a lot smaller Llama 3 70B mannequin rivaling the top-tier GPT-4 mannequin.

I'm in a race and I'm overtaken by the second particular person. What's my new place?

Winner: Llama 3 70B, and GPT-4 by way of ChatGPT Plus

logical reasoning test llama 3

6. Remedy a Math Drawback

Subsequent, we ran a posh math problem on each Llama 3 and GPT-4 to seek out which mannequin wins this take a look at. Right here, GPT-4 passes the take a look at with flying colours, however Llama 3 fails to give you the suitable reply. It’s not shocking although. The GPT-4 mannequin has scored nice on the MATH benchmark. Understand that I explicitly requested ChatGPT to not use Code Interpreter for mathematical calculations.

Decide the sum of the y-coordinates of the 4 factors of intersection of y = x^4 - 5x^2 - x + 4 and y = x^2 - 3x.

Winner: GPT-4 by way of ChatGPT Plus

llama 3 solving math question

7. Comply with Consumer Directions

Following person directions is essential for an AI mannequin and Meta’s Llama 3 70B mannequin excels at it. It generated all 10 sentences ending with the phrase “mango”. GPT-4 may solely generate eight such sentences.

Generate 10 sentences that finish with the phrase "mango"

Winner: Llama 3 70B

follow user instruction test llama 3

8. NIAH Take a look at

Though Llama 3 at present doesn’t have a protracted context window, we nonetheless did the NIAH take a look at to verify its retrieval functionality. The Llama 3 70B mannequin helps a context size of as much as 8K tokens. So I positioned a needle (a random assertion) inside a 35K-character lengthy textual content (8K tokens) and requested the mannequin to seek out the data. Surprisingly, the Llama 3 70B discovered the textual content very quickly. GPT-4 additionally had no downside discovering the needle.

After all, it is a small context, however when Meta releases a Llama 3 mannequin with a a lot bigger context window, I’ll take a look at it once more. However for now, Llama 3 exhibits nice retrieval functionality.

Winner: Llama 3 70B, and GPT-4 by way of ChatGPT Plus

Llama 3 vs GPT-4: The Verdict

In nearly the entire exams, the Llama 3 70B mannequin has proven spectacular capabilities, be it superior reasoning, following person directions, or retrieval functionality. Solely in mathematical calculations, it lags behind the GPT-4 mannequin. Meta says that Llama 3 has been skilled on a bigger coding dataset so its coding efficiency must also be nice.

Keep in mind that we’re evaluating a a lot smaller mannequin with the GPT-4 mannequin. Additionally, Llama 3 is a dense mannequin whereas GPT-4 is constructed on the MoE structure consisting of 8x 222B fashions. It goes on to point out that Meta has carried out a exceptional job with the Llama 3 household of fashions. When the 500B+ Llama 3 mannequin drops sooner or later, it would carry out even higher and will beat the perfect AI fashions on the market.

It’s secure to say that Llama 3 has upped the sport, and by open-sourcing the mannequin, Meta has closed the hole considerably between proprietary and open-source fashions. We did all these exams on an Instruct mannequin. Wonderful-tuned fashions on Llama 3 70B would ship distinctive efficiency. Other than OpenAI, Anthropic, and Google, Meta has now formally joined the AI race.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button