What kind of bug would make machine learning suddenly 40% worse at NetHack?

5000 points is pretty sad, probably 2000 of it is finding an elven dagger and naming it "Sting". And the rest is a lot of rats and goblins.

Looking at what happens on the full moon, I still don't get why the bot scored lower. Maybe a higher chance of being bitten by a wererat and dropping all your stuff? Or maybe it's because throwing tripe at attacking dogs works less often to tame them (Vs 100% normally).

edit: aha, I think this is it - attacked werecreatures are much more likely to summon help on full moons. Poor bot probably got overrun.
 
Last edited:
Upvote
80 (82 / -2)
Post content hidden for low score. Show…

mmorales

Ars Praetorian
418
Subscriptor
Oh this is great.

In the interest of sharing pain, ~25 years ago as a graduate student I coded up an unsupervised neural network. Now this was a time of lots of processor and OS diversity, I was a weird Mac person (OS X beta!) and coded it up on my PowerPC. On my Mac it learned fine, but on the department servers I had very hit or miss luck. It would never crash, just on some machines it would learn and some it wouldn't, depending on some combination of compiler settings and chips. I eventually had the following pattern:

  • PowerPC would always learn
  • Intel x86 would never learn
  • Sun Sparc would never learn
  • SGI IRIX MIPS would learn, but only at optimization -O2. Lower or higher optimization it wouldn't learn.

It took my advisor 10 minutes to figure out what was going on (I thought he was psychic). Down in a key part of the code I had e^a times e^b, and it turned out 'a' became large when 'b' became small, and the learning was in the last few bits of precision in the double.

PowerPC carried double precision to 67 bits internally, so always learned. The other chips all worked at 64 bits, so never learned. But at -O2 optimization the IRIX compiler refactored my math from e^a times e^b to e^(a+b) for speed reasons (exponentials are very slow). This optimization also gave it more precision in the answer and it learned. Lower optimization didn't make this change, and higher -O3 optimization dropped the exponential to floating point precision so it wouldn't learn.

I simply changed the source code to e^(a+b) and it learned everywhere all the time.
 
Upvote
210 (211 / -1)

Studbolt

Ars Scholae Palatinae
921
Ask a model to get the best score, and it will farm the heck out of low-level monsters because it never gets bored.

Farm the heck out of upper-level monsters. In Nethack, the monsters get tougher as the player descends. You can hang out in upper levels and farm monsters for as long as your patience will last before you dive into the depths of Hell.
 
Upvote
10 (13 / -3)
Quote
K
Kevinpurdy
My shallow NetHack playing shines through. I mean "level" as in skill/number, but can see how it's confusing. Fixed!
Upvote
10 (13 / -3)

Tofystedeth

Ars Praefectus
5,598
Subscriptor++
5000 points is pretty sad, probably 2000 of it is finding an elven dagger and naming it "Sting". And the rest is a lot of rats and goblins.

Looking at what happens on the full moon, I still don't get why the bot scored lower. Maybe a higher chance of being bitten by a wererat and dropping all your stuff? Or maybe it's because throwing tripe at attacking dogs works less often to tame them (Vs 100% normally).

edit: aha, I think this is it - attacked werecreatures are much more likely to summon help on full moons. Poor bot probably got overrun.
FTA I think the score was based on their metrics, not game score. And it wasn't necessarily playing worse at the game, just worse at whatever playstyle it had optimized for.
 
Upvote
31 (31 / 0)

TimeWinder

Ars Scholae Palatinae
1,633
Subscriptor
Farm the heck out of upper-level monsters. In Nethack, the monsters get tougher as the player descends. You can hang out in upper levels and farm monsters for as long as your patience will last before you dive into the depths of Hell.
If we're going to go down (up?) the rabbit hole of the many, often contradictory ways "level" is used in RPGs, we're going to be here until the next full moon.
 
Upvote
51 (51 / 0)
FTA I think the score was based on their metrics, not game score. And it wasn't necessarily playing worse at the game, just worse at whatever playstyle it had optimized for.
Yeah, I missed "by their own metrics". Still, higher numbers of early deaths from being overrun by summoned creatures would cause a lower score by any measure.
 
Upvote
6 (6 / 0)

bonob

Wise, Aged Ars Veteran
198
The article said:
[...] and the only thing you keep from game to game is your skill and knowledge.
And bones files!

Well, it's a very funny idea to unleash an AI agent against Nethack in any case..

The article said:
"even it can only solve sokoban and reach mines end,"
Well, at least I can do as good as the algorithm, on my best runs, I feel that's somewhat reassuring ><


Edit: to make it less obscure, a bones file is a level from a previous game – the level where the previous player died – that's used as one of the current game levels. You typically find the player's corpse (your previous corpse if you only play local games), all its inventory, and likely all the monsters that killed you back then still roaming on the level. Most of your corpse inventory is cursed and very difficult to use, this is a rare occurence and may happen only once in a game if I'm not mistaken.

And beating Sokoban and the Mines is pretty hard (at least for non-seasoned players such as me), and I feel crazy powerful when that happens, even though I know this is just the early phase of the game.. My guess is you've reached 25% down the levels at that point, which is not much, and that's not considering the trip back up ><
 
Last edited:
Upvote
14 (14 / 0)

zaghahzag

Ars Scholae Palatinae
709
Subscriptor++
5000 points is pretty sad, probably 2000 of it is finding an elven dagger and naming it "Sting". And the rest is a lot of rats and goblins.

Looking at what happens on the full moon, I still don't get why the bot scored lower. Maybe a higher chance of being bitten by a wererat and dropping all your stuff? Or maybe it's because throwing tripe at attacking dogs works less often to tame them (Vs 100% normally).

edit: aha, I think this is it - attacked werecreatures are much more likely to summon help on full moons. Poor bot probably got overrun.
You made my day.
 
Upvote
3 (6 / -3)

plarstic

Smack-Fu Master, in training
89
Subscriptor
Farm the heck out of upper-level monsters. In Nethack, the monsters get tougher as the player descends. You can hang out in upper levels and farm monsters for as long as your patience will last before you dive into the depths of Hell.
Because the difficulty of the game scared me I inevitably tried to stay at safer upper levels for as long as possible, and it was always a lack of damn food that caused me to need to descend, as monsters didn't seem to respawn quickly enough to farm. That was a long, long time ago mind you and I barely knew the mechanics of the game, despite scouring Usenet for clues and outright walkthroughs.
 
Upvote
11 (11 / 0)

markgo

Ars Tribunus Militum
2,833
Subscriptor++
Assorted thoughts:

  1. Scary that I went “oh, of course”. Spent entirely too much time playing Nethack on University systems in the 80s.
  2. As to speculations on cause of lowered scores, I’d tend towards the worsened were attacks. It was pretty common knowledge among regular players that “you feel lucky” was a bad sign. I doubt it was shop farming with your pet—that’s a fairly complex behavior; if it can’t ascend, I doubt it can farm reliably.
  3. Amazing that Nethack still lives on. Rogue may have gotten the noun, but Nethack apparently will never die.
 
Upvote
29 (29 / 0)

neutronium

Smack-Fu Master, in training
88
Subscriptor++
I never got very far in Nethack as a kid because I'd just reroll my starting character continuously until I got a ring of polymorph. 1st turn, equip ring. All subsequent turns, chaos.

I should revisit the game with a more reasonable approach.
Polymorph + RoPC fun times, until you Genocide all "L" and forget you're a Master Lich (or Vampire Lord ) ....
 
Upvote
9 (9 / 0)

adespoton

Ars Tribunus Angusticlavius
9,095
Oh this is great.

In the interest of sharing pain, ~25 years ago as a graduate student I coded up an unsupervised neural network. Now this was a time of lots of processor and OS diversity, I was a weird Mac person (OS X beta!) and coded it up on my PowerPC. On my Mac it learned fine, but on the department servers I had very hit or miss luck. It would never crash, just on some machines it would learn and some it wouldn't, depending on some combination of compiler settings and chips. I eventually had the following pattern:

  • PowerPC would always learn
  • Intel x86 would never learn
  • Sun Sparc would never learn
  • SGI IRIX MIPS would learn, but only at optimization -O2. Lower or higher optimization it wouldn't learn.

It took my advisor 10 minutes to figure out what was going on (I thought he was psychic). Down in a key part of the code I had e^a times e^b, and it turned out 'a' became large when 'b' became small, and the learning was in the last few bits of precision in the double.

PowerPC carried double precision to 67 bits internally, so always learned. The other chips all worked at 64 bits, so never learned. But at -O2 optimization the IRIX compiler refactored my math from e^a times e^b to e^(a+b) for speed reasons (exponentials are very slow). This optimization also gave it more precision in the answer and it learned. Lower optimization didn't make this change, and higher -O3 optimization dropped the exponential to floating point precision so it wouldn't learn.

I simply changed the source code to e^(a+b) and it learned everywhere all the time.
If you'd done it in Lisp, you could have avoided the whole issue :D

PowerLisp on a PowerPC was my go-to back then, despite Cyc using Java.
 
Upvote
5 (5 / 0)

johnsonwax

Ars Legatus Legionis
14,629
Oh this is great.

In the interest of sharing pain, ~25 years ago as a graduate student I coded up an unsupervised neural network. Now this was a time of lots of processor and OS diversity, I was a weird Mac person (OS X beta!) and coded it up on my PowerPC. On my Mac it learned fine, but on the department servers I had very hit or miss luck. It would never crash, just on some machines it would learn and some it wouldn't, depending on some combination of compiler settings and chips. I eventually had the following pattern:

  • PowerPC would always learn
  • Intel x86 would never learn
  • Sun Sparc would never learn
  • SGI IRIX MIPS would learn, but only at optimization -O2. Lower or higher optimization it wouldn't learn.

It took my advisor 10 minutes to figure out what was going on (I thought he was psychic). Down in a key part of the code I had e^a times e^b, and it turned out 'a' became large when 'b' became small, and the learning was in the last few bits of precision in the double.

PowerPC carried double precision to 67 bits internally, so always learned. The other chips all worked at 64 bits, so never learned. But at -O2 optimization the IRIX compiler refactored my math from e^a times e^b to e^(a+b) for speed reasons (exponentials are very slow). This optimization also gave it more precision in the answer and it learned. Lower optimization didn't make this change, and higher -O3 optimization dropped the exponential to floating point precision so it wouldn't learn.

I simply changed the source code to e^(a+b) and it learned everywhere all the time.
Should have gone to college a decade earlier. Instructors loved putting the e^a times e^b trick where a and b are large positive/negative numbers in math/physics problems to blow up your calculator and figure out which students knew their basic math. Difference of two squares was another common trick. Learned to spot and simplify those a mile away.
 
Upvote
24 (24 / 0)

neutronium

Smack-Fu Master, in training
88
Subscriptor++
Upvote
10 (10 / 0)

Wickwick

Ars Legatus Legionis
36,157
Oh this is great.

In the interest of sharing pain, ~25 years ago as a graduate student I coded up an unsupervised neural network. Now this was a time of lots of processor and OS diversity, I was a weird Mac person (OS X beta!) and coded it up on my PowerPC. On my Mac it learned fine, but on the department servers I had very hit or miss luck. It would never crash, just on some machines it would learn and some it wouldn't, depending on some combination of compiler settings and chips. I eventually had the following pattern:

  • PowerPC would always learn
  • Intel x86 would never learn
  • Sun Sparc would never learn
  • SGI IRIX MIPS would learn, but only at optimization -O2. Lower or higher optimization it wouldn't learn.

It took my advisor 10 minutes to figure out what was going on (I thought he was psychic). Down in a key part of the code I had e^a times e^b, and it turned out 'a' became large when 'b' became small, and the learning was in the last few bits of precision in the double.

PowerPC carried double precision to 67 bits internally, so always learned. The other chips all worked at 64 bits, so never learned. But at -O2 optimization the IRIX compiler refactored my math from e^a times e^b to e^(a+b) for speed reasons (exponentials are very slow). This optimization also gave it more precision in the answer and it learned. Lower optimization didn't make this change, and higher -O3 optimization dropped the exponential to floating point precision so it wouldn't learn.

I simply changed the source code to e^(a+b) and it learned everywhere all the time.
I have a bug story based on what workstations I was on. I had a CFD class in college in the mid-90's. For the final assignment, I did most of the programming on the cluster of Sun Sparc stations but was finishing the project on some SGI's. The code would compile and run just fine on the Sparcs, but not on the SGI's. I simply had to take the '-o' (optimize) flag out of the gcc command in my makefile and then it would run. What undergrad in mechanical engineering would expect the compiler to be the source of an error and not something they wrote?
 
Upvote
8 (9 / -1)
And bones files!

Well, it's a very funny idea to unleash an AI agent against Nethack in any case..


Well, at least I can do as good as the algorithm, on my best runs, I feel that's somewhat reassuring ><


Edit: to make it less obscure, a bones file is a level from a previous game – the level where the previous player died – that's used as one of the current game levels. You typically find the player's corpse (your previous corpse if you only play local games), all its inventory, and likely all the monsters that killed you back then still roaming on the level. Most of your corpse inventory is cursed and very difficult to use, this is a rare occurence and may happen only once in a game if I'm not mistaken.

And beating Sokoban and the Mines is pretty hard (at least for non-seasoned players such as me), and I feel crazy powerful when that happens, even though I know this is just the early phase of the game.. My guess is you've reached 25% down the levels at that point, which is not much, and that's not considering the trip back up ><
Sokoban isn't even close to 25% of the game. If I remember right, it's at level 10 at its deepest, has several off-track levels of it's own (I want to say 4). The game goes down to level 53 on the main branch, and up to level ... -5, I think? It's been a minute since I've played a game.

And that doesn't include class quest levels and some of Vlad's tower. It's a big game. I think the fastest I ever completed a speed ascension was in the 14 hour range.
 
Upvote
3 (3 / 0)

scrimbul

Ars Tribunus Militum
1,808
Oh, reminds to revisit the original ADOM. Date‑changing was part of the optimised scumming strategy...
Improved ADOM Guidebook still works but some of the more egregious scumming for wishes were patched out.

The ADOM sequel appears to be dead to rights and from an unfocused design but that's only hearsay, I haven't played it myself.
 
Upvote
2 (2 / 0)

clewis

Ars Scholae Palatinae
940
Subscriptor++
I have a bug story based on what workstations I was on. I had a CFD class in college in the mid-90's. For the final assignment, I did most of the programming on the cluster of Sun Sparc stations but was finishing the project on some SGI's. The code would compile and run just fine on the Sparcs, but not on the SGI's. I simply had to take the '-o' (optimize) flag out of the gcc command in my makefile and then it would run. What undergrad in mechanical engineering would expect the compiler to be the source of an error and not something they wrote?
Back then, -O, -O2, and -O3 were always suspect, especially -O3. IIRC, the man page warned against using them.

When I was using gcc, most of my development was without any optimization, and I'd only add it when I had good tests. I think it wasn't until the late 2000s that I finally started trusting -O2.
 
Upvote
10 (10 / 0)

Wickwick

Ars Legatus Legionis
36,157
Back then, -O, -O2, and -O3 were always suspect, especially -O3. IIRC, the man page warned against using them.

When I was using gcc, most of my development was without any optimization, and I'd only add it when I had good tests. I think it wasn't until the late 2000s that I finally started trusting -O2.
I was just using a copy of the makefile provided to us by our professor. It wasn't a programming class. It was a numerical programming class.
 
Upvote
5 (5 / 0)

CardinalJester

Smack-Fu Master, in training
34
This is a common problem we've seen with machine learning optimization at work. The flaw is in the metrics that the ML systems are trained to. With simplistic, and typically fixed metrics for success ML agents get caught in optimizing to do one thing that the scoring system values, ie their issue with not being able to meet high level goals. They need to implement a dynamic scoring metric, likely with a diminishing function for repetitious events, to prevent these cul de sacs. For example, the first bat you kill is 1 pt, the tenth is 0.5 pt, the 20th is 0.001, etc.

We use a system of 'gamification' for training where each goal has a declining value for repetition, common in board game design where the first to accomplish something gets like 8 points, second 5 points, third 1 point, and everyone else zero.
 
Upvote
10 (11 / -1)

DeeplyUnconcerned

Ars Praetorian
561
Subscriptor++
This is a common problem we've seen with machine learning optimization at work. The flaw is in the metrics that the ML systems are trained to. With simplistic, and typically fixed metrics for success ML agents get caught in optimizing to do one thing that the scoring system values, ie their issue with not being able to meet high level goals. They need to implement a dynamic scoring metric, likely with a diminishing function for repetitious events, to prevent these cul de sacs. For example, the first bat you kill is 1 pt, the tenth is 0.5 pt, the 20th is 0.001, etc.

We use a system of 'gamification' for training where each goal has a declining value for repetition, common in board game design where the first to accomplish something gets like 8 points, second 5 points, third 1 point, and everyone else zero.
I’d argue that they need to make their metric simpler, not more complex. Diminishing returns are good for systems design, but for a scoring/ranking system you want to get as close as possible to the thing you actually care about, and as simple as you can manage to minimise the impact of overfitting/gaming the scoring.

In this case, it needs a really hard think about what “good at nethack” means; if your simple definition statement of that doesn’t include number of monsters killed, then making monsters killed part of your scoring system is a mistake no matter how you finesse the math.
 
Upvote
16 (16 / 0)

Carewolf

Ars Tribunus Angusticlavius
9,197
Subscriptor
Oh this is great.

In the interest of sharing pain, ~25 years ago as a graduate student I coded up an unsupervised neural network. Now this was a time of lots of processor and OS diversity, I was a weird Mac person (OS X beta!) and coded it up on my PowerPC. On my Mac it learned fine, but on the department servers I had very hit or miss luck. It would never crash, just on some machines it would learn and some it wouldn't, depending on some combination of compiler settings and chips. I eventually had the following pattern:

  • PowerPC would always learn
  • Intel x86 would never learn
  • Sun Sparc would never learn
  • SGI IRIX MIPS would learn, but only at optimization -O2. Lower or higher optimization it wouldn't learn.

It took my advisor 10 minutes to figure out what was going on (I thought he was psychic). Down in a key part of the code I had e^a times e^b, and it turned out 'a' became large when 'b' became small, and the learning was in the last few bits of precision in the double.

PowerPC carried double precision to 67 bits internally, so always learned. The other chips all worked at 64 bits, so never learned. But at -O2 optimization the IRIX compiler refactored my math from e^a times e^b to e^(a+b) for speed reasons (exponentials are very slow). This optimization also gave it more precision in the answer and it learned. Lower optimization didn't make this change, and higher -O3 optimization dropped the exponential to floating point precision so it wouldn't learn.

I simply changed the source code to e^(a+b) and it learned everywhere all the time.
x86 wasn't using x87 80bit precision? 25 years ago that would always be the default for doubles..

That was the usual story back then, people wrote code on x86, and then it only worked x86 and broke on other platforms, and broke when x86 switched to using SSE2 ;)
 
Upvote
8 (8 / 0)