Implementing a Transformer from scratch in PyTorch - a write-up on my experience

post by Mislav Jurić (mislav-juric) · 2023-04-25T20:51:22.049Z · LW · GW · 0 comments

Contents

  Introduction
  My goals
  Knowledge I had prior to this project
  Rules I had
  Total time spent on this project
  Useful resources if you’re implementing the Transformer from scratch
  Notes on the things which were unclear to me or tripped me up
  Conclusion
  Appendix: Time log
None
No comments

Introduction

As is discussed in posts such as this one [EA · GW], a good way to test your skills as a machine learning research engineer is to implement a Transformer from scratch in PyTorch. This is exactly what I did. Below I am sharing my experience while doing so.

The code I wrote can be found in this GitHub repository, although I don't recommend looking at it if you are going to attempt this project yourself.

My goals

All of my goals were achieved. I picked the goals above because they ensured that my model was implemented correctly. If I can train the model and use it, I can be convinced that my implementation works. I could have put some other goals here as well, but they would be out of scope for what I had in mind – I wanted to test if I could implement concepts from scientific papers and that was it. All of the other things could be considered for future work.

Knowledge I had prior to this project

Prior to this project, I had pretty limited experience with natural language processing (NLP). One of my larger projects was fine-tuning GPT-2 so that it generates movie scripts, but that was done relying on Hugging Face. I did have 2-3 years of machine learning engineering experience (or related), but that was almost exclusively in the field of computer vision.

I’ll do my best to enumerate what I knew at the point of starting the project (which is relevant to the project):

Rules I had

Based on the advice from this post [LW · GW], I had some rules regarding how this project was to be conducted. They were as follows:

  1. I was allowed to read only the original paper and start writing code from there. If I got stuck, I was allowed to go to step number 2.
  2. I was allowed to consult blog posts and/or articles containing explanations for the things I didn’t understand. If it contained code by any chance, I wouldn’t look at it. If I got stuck here as well, I was allowed to go to step number 3.
  3. I was allowed to ask questions on particular things I didn’t understand or about a particular bug in my code, after I tried to solve the misunderstanding or the bug by myself for some reasonable amount of time. Here I want to extend a big thank you to Louis Bradshaw, who helped me clear up some conceptual misunderstandings and also gave me advice on how to best approach finding the bugs I had in my code and solving some specific ones I found.

    Finally, if none of this worked, I was allowed to go to step 4.
     
  4. I was allowed to look at existing Transformer implementations. Also, I was allowed to copy/paste certain parts of code. I never got to this step.

Total time spent on this project

During the project, I kept a time log. I want to note that when I worked on this project, I generally worked for 30 minutes, then took a 10 minute break, then worked for 30 minutes again, then I took another 10 minute break etc. So for example, if I say I worked for 40 minutes, 30 minutes was actually me sitting on a computer working, while 10 minutes was me walking around the room resting. This work/rest split is something I found optimal for myself and I’m noting it here so you can keep it in mind when reading the numbers below.

I spent around 3550 minutes on this project in total, which translates to around 60 hours. By skimming my time log, I’d say that around 10 to 15 hours was spent on clarifying my conceptual understanding, while the other time was spent writing code, re-writing code and debugging; around 5 hours was spent on writing dataset-related code.

If you want to look at the full time log, it can be found in the Appendix.

Useful resources if you’re implementing the Transformer from scratch

Below I list all the resources I found very useful during this project. You can find more by Googling when you don’t understand something in particular; I’m listing the resources I found really useful. Although the list is unordered, I tried to put the resources I used at the beginning of the project at the beginning of the list and vice versa.

Notes on the things which were unclear to me or tripped me up

I list some things which were unclear to me or tripped me up below. This isn’t an exhaustive list; there were many small and large “aha” moments I had, but these are the ones that I deem most prominent. I recommend not reading through these if you want to implement Transformer on your own as well; do come back to these if you get stuck, as I don’t think it contains any “major spoilers”. In the list, I will use Python-style indexing to explain certain things:

Conclusion

This project was, as far as I can recall, the most intellectually challenging project I have done so far. This was a project where I had to understand things first, implement them and then repeat the process as I discovered that something doesn’t work as intended. This is different from almost all the other projects I worked on in the sense that a lot of my mental bandwidth went on understanding the concepts; I usually understand (almost) all the concepts when programming something other than a scientific paper.

This project also opened my mind in the sense that I now know that I can implement and, more importantly, understand almost everything (if not everything). I just need enough time and patience and if I have that, who knows where the limits of my abilities lie? I hope to find this out over the course of my career.

Appendix: Time log

Here is the time log in its “raw” format, if anyone is curious:

0 comments

Comments sorted by top scores.