Place-Based Programming - Part 1 - Places

post by lsusr · 2021-04-14T22:18:15.184Z · LW · GW · 18 comments

Contents

19 comments

When I write machine learning software, I tend to use the Place-Based Programming (PBP) paradigm. PBP caches your computations so you rarely have to perform the same computation twice.

The fundamental unit of data is a place, which refers to a location on disk. Consider the hard-coded string "I am Satoshi Nakamoto.". You can complete the place of a string by hashing it.

;; This code is written in Hy.
;; See https://docs.hylang.org/en/stable/ for documentation of the language.
(import [hashlib [md5]] os)
(setv +place-dir+ ".places/")

(defn place-of [expression]
  "Returns the place of an expression"
  (os.path.join
    +place-dir+
    "str/"
    (+ (.hexdigest (md5 (.encode (str expression))))
       ".pickle")))

;; prints ".places/<class 'hy.models.HyString'>/17f36dc3403a328572adcea3fd631f55.pickle"
(print (place-of '"I am Satoshi Nakamoto."))

In Lisp, the ' tag means "do not evaluate the following expression". Note how we did not compute the place of the string's value directly. We evaluated the place of the source code which defines the string. We can replace our function with a macro so the user does not have to quote his or her code.

(import [hashlib [md5]] os)
(setv +place-dir+ ".places/")

(defmacro place-of [expression]
  "Returns the place of an expression"
  `(os.path.join
    +place-dir+
    (str (type '~data))
    (+ (.hexdigest (md5 (.encode (str '~expression))))
       ".pickle")))

;; prints ".places/<class 'hy.models.HyString'>/17f36dc3403a328572adcea3fd631f55.pickle"
(print (place-of "I am Satoshi Nakamoto."))

Whenever a function returns a place, it implicitly guarantees that the place is populated. The place-of macro is not allowed to just compute where a place would be if it existed. The macro must also save our data to the place if the place is not already populated.

(defmacro/g! place-of [expression]
  "Returns the place of an expression"
  `(do
     (setv ~g!place
           (os.path.join
             +place-dir+
             (str (type '~code))
             (+ (.hexdigest (md5 (.encode (str '~expression))))
                ".pickle")))
     (if-not (os.path.exists ~g!place)
             (with [f (open ~g!place "wb")]
               (pickle.dump (eval '~expression) f)))
     ~g!place))

;; prints ".places/<class 'hy.models.HyString'>/17f36dc3403a328572adcea3fd631f55.pickle"
(print (place-of "I am Satoshi Nakamoto."))

Reading from a place is easier.

(defn value-of [place]
  (with [f (open place "rb")]
    (pickle.load f)))

;; prints  "I am Satoshi Nakamoto."
(print (value-of (place-of "I am Satoshi Nakamoto.")))

This constitutes a persistent memoization system where code is evaluated no more than once.

(import [time [sleep]])
(print (value-of (place-of (do (sleep 5) "This computation takes 5 seconds"))))

The first time you call the above code it will take 5 seconds to execute. On all subsequent runs the code will return instantly.

18 comments

Comments sorted by top scores.

comment by skybrian · 2021-04-15T06:21:59.549Z · LW(p) · GW(p)

This method of caching assumes that an expression always evaluates to the same value. This is sometimes true in functional programming, but only if you're careful. For example, suppose the expression is a function call, and you change the function's definition and restart your program. When that happens, you need to delete the out-of-date entries from the cache or your program will read an out-of-date answer.

Also, since you're using the text of an expression for the cache key, you should only use expressions that don't refer to any local variables. For example, caching an expression that's within a function and refers to a function parameter will result in bugs when the function is called more than once with different parameters.

So this might be okay in simple cases when you are working alone and know what you're doing, but it likely would result in confusion when working on a team.

It's also essentially the same kind of caching that's commonly done by build systems. It's common for makefiles to be subtly broken so that incremental builds are unreliable and you need to do a "clean build" (with an empty cache) when it really matters that a build is correct. (The make command will compare file dates, but that's often not enough due to missing dependencies.)

But it still might be better to switch to a build system that's designed for this sort of thing, because then at least people will expect to need to do a clean build whenever the results seem to be wrong.

(Bazel is a build system that tries very hard to make sure that incremental builds are always correct and you never need to do a "clean build," but it's hard enough to use that I don't really recommend it.)

Replies from: SatvikBeri, lsusr
comment by SatvikBeri · 2021-04-15T16:32:11.284Z · LW(p) · GW(p)

This is sometimes true in functional programming, but only if you're careful.

I think this overstates the difficulty, referential transparency is the norm in functional programming, not something unusual.

For example, suppose the expression is a function call, and you change the function's definition and restart your program. When that happens, you need to delete the out-of-date entries from the cache or your program will read an out-of-date answer.

As I understand, this system is mostly useful if you're using it for almost every function. In that case, your inputs are hashes which contain the source code of the function that generated them, and therefore your caches will invalidate if an upstream function's source code changed.

Also, since you're using the text of an expression for the cache key, you should only use expressions that don't refer to any local variables.

Agreed.

So this might be okay in simple cases when you are working alone and know what you're doing, but it likely would result in confusion when working on a team.

I agree that it's essentially a framework, and you'd need buy-in from a team in order to consistently use it in a repository. But I've seen teams buy into heavier frameworks pretty regularly; this version seems unusual but not particularly hard to use/understand. It's worth noting that bad caching systems are pretty common in data science, so something like this is potentially a big improvement there.

Replies from: justinpombrio
comment by justinpombrio · 2021-04-16T02:15:42.293Z · LW(p) · GW(p)

I think this overstates the difficulty, referential transparency is the norm in functional programming, not something unusual.

It really depends on what your domain you're working in. If you're memoizing functions, you're not allowed to use the following things (or rather, you can only use them in functions that are not transitively called by memoized functions):

  • Global mutable state (to no-one's surprise)
  • A database, which is global mutable state
  • IO, including reading user input, fetching something non-static from the web, or logging
  • Networking with another service that has state
  • Getting the current date

Ask a programmer to obey this list of restrictions, and -- depending on the domain they're working in -- they'll either say "ok" or "wait what that's most of what my code does".

As I understand, this system is mostly useful if you’re using it for almost every function. In that case, your inputs are hashes which contain the source code of the function that generated them, and therefore your caches will invalidate if an upstream function’s source code changed.

That's very clever! I don't think it's sufficient, though.

For example, say you have this code:

(defnp add1 [x] (+ x 10)) ; oops typo
(defnp add2 [x] (add1 (add1 x)))
(add2 100)

You run it once and get this cache:

(add1 100) = 110
(add1 (add1 100)) = 120
(add2 100) = 120

You fix the first function:

(defnp add1 [x] (+ x 1)) ; fixed
(defnp add2 [x] (add1 (add1 x)))
(add2 100)

You run it again, which invokes (add2 100), which is found in the cache to be 120. The add2 cache entry is not invalidated because the add2 function has not changed, nor has its inputs. The add1 cache entries would be invalidated if anything ever invoked add1, but nothing does.

(This is what I meant by "You also have to look at the functions it calls (and the functions those call, etc.)" in my other comment.)

Replies from: SatvikBeri, justinpombrio
comment by SatvikBeri · 2021-04-16T02:58:39.456Z · LW(p) · GW(p)

Ah, that's a great example, thanks for spelling it out.

comment by justinpombrio · 2021-04-16T02:48:22.972Z · LW(p) · GW(p)

I think this is fixable. An invocation (f expr1 expr2) will produce the same result as the last time you invoked it if:

  • The body of f is the same as last time.
  • Every function it calls, including transitively, has the same source code as the last time you called f. Also every macro and type definition that is used transitively. Basically any code that it depends on in any way.
  • Every function involved is pure (no state, no IO).
  • Every function involved is top-level. I'm not sure this will play well with higher-order functions.
  • The invocations expr1 and expr2 also obey this checklist.

I'm not sure this list is exhaustive, but it should be do-able in principle. If I look at a function invocation and all the code it transitively depends on (say it's 50% of the codebase), and I know that that 50% of the codebase hasn't changed since last time you ran the program, and I see that that 50% of the codebase is pure, and I trust you that the other 50% of the codebase doesn't muck with it (as it very well could with e.g. macros), then that function invocation should produce the same result as last time.

This is tricky enough that it might need language level support to be practical. I'm glad that Isusr is thinking of it as "writing a compiler".

comment by lsusr · 2021-04-16T00:29:32.281Z · LW(p) · GW(p)

Note to readers: skybrian's parent comment was written when this post was titled "Place-Based Programming", before I changed the title to "Place-Based Programming - Part 1".


Your are correct that the code here in Part 1 breaks when you use variables with nonlocal scope. I begin to solve this problem in Part 2 [LW · GW].

It's also essentially the same kind of caching that's commonly done by build systems.

Yes. I often think about this project as "writing a compiler". Some of the techniques I use come from Makefiles.

comment by uncomputable · 2021-04-14T23:01:38.478Z · LW(p) · GW(p)

generally referred to as persistent memoization.

maybe persistent futures in this case, since you've got these intermediate "place" values.

Replies from: lsusr
comment by lsusr · 2021-04-14T23:33:53.025Z · LW(p) · GW(p)

Thanks. I have changed "simple caching system" to "persistent memoization system".

comment by NicholasKross · 2021-04-14T23:19:20.496Z · LW(p) · GW(p)

My normie-paradigm mindset translates this to something like "the variable is a pointer to a result, and the result needs to be computed once (same as a source file may need to be computed once to use)". Is this accurate, or am I missing more of it?

Replies from: lsusr
comment by lsusr · 2021-04-14T23:24:14.143Z · LW(p) · GW(p)

This is accurate. Places are analogous to pointers.

comment by SatvikBeri · 2021-04-14T23:15:36.771Z · LW(p) · GW(p)

This is very cool. The focus on caching a code block instead of just the inputs to the function makes it significantly more stable, since your cache will be automatically invalidated if you change the code in any way.

Replies from: justinpombrio
comment by justinpombrio · 2021-04-15T14:37:46.368Z · LW(p) · GW(p)

More stable, but not significantly so.

You cannot tell what an expression does just by looking at the expression. You also have to look at the functions it calls (and the functions those call, etc.). If any of those change, then the expression may change as well.

You also need to look at local variables, as skybrain points out. For example, this function:

(defn myfunc [x] (value-of (place-of [EXPR INVOLVING x])))

will behave badly: the first time you call it it will compute the answer for the value of x you give it. The second time you call it, it will compute the same answer, regardless of what x you give it.

Replies from: lsusr
comment by lsusr · 2021-04-16T00:32:45.059Z · LW(p) · GW(p)

Note to readers: justinpombrio's parent comment was written when this post was titled "Place-Based Programming", before I changed the title to "Place-Based Programming - Part 1".


I think the disagreement here comes from me communicating different things to different people. I showed SatvikBeri a more complete system of which this post is just a tiny part. If all you see is this post (without the "Part 1") then justinpombrio's comment makes sense. If you see the entire project then SatvikBeri's comment makes sense.

justinpombrio's particular example can be solved with the defnp macro in Part 2 [? · GW].

(defnp myfunc [x] x)

(assert (= 3 value-of (myfunc 3))) ;; works
comment by purge · 2021-04-16T18:30:30.146Z · LW(p) · GW(p)

Not related to the main idea, but the point of os.path.join is to combine path elements using whichever delimiter the OS requires ("/" on Unix, "\" on Windows, etc., even though Windows in particular can also handle "/").  If you don't care about that portability, you might as well use normal string concatenation.  Or if you're using os.path.join, you might as well omit the "/" delimiters in your string literals to get extra portability.

comment by Pattern · 2021-04-15T20:58:00.245Z · LW(p) · GW(p)
The place-of macro is not allowed to just compute where a place would be if it existed. The macro must also save our data to the place [if] the place is not already populated.
Replies from: lsusr
comment by lsusr · 2021-04-15T21:33:54.430Z · LW(p) · GW(p)

Fixed. Thanks.

comment by Alexei · 2021-04-14T23:33:59.905Z · LW(p) · GW(p)

Satvik mentioned you had a way to go from hash to the source code?

Replies from: lsusr, SatvikBeri
comment by lsusr · 2021-04-14T23:37:38.821Z · LW(p) · GW(p)

You can go from hash to source code by saving the source code too in addition to saving the value. You can go from place to source code by treating source code as a value. Otherwise, hashing is a trapdoor function.