Language Reflections

Last month, I quietly announced the project that I had been working on: a collection of Python modules and tools for reading, writing and modifying Android packages and the files contained within them. The motivation for doing this was my unwillingness to write Java just to learn about Android programming. The aim of having a collection of tools to manipulate packages was to enable the creation of new packages containing code that was assembled or compiled from languages other than Java. The irony of this is that I've learned more about Java in the last few months while actively trying to avoid it than I have in the preceding ten years of simply ignoring it.

DUCK

The Dalvik Unpythonic Compiler Kit (DUCK) is the result of this experiment to see if it was practical to create “native” Android packages using something other than the Android SDK, and also explore how small these packages can be. Originally, I had thought that all the work would have gone into handling the Dalvik bytecode, but it turns out that constructing the DEX files containing the bytecode and class information was where most the effort was needed. Other elements of packages, such as resource manifests, also required more attention than I had expected.

It's been an eye-opening experience to dig into how Java code is deployed on Android. My impression of how Java was compiled and deployed was that it must surely be distilled down to low-level bytecode that was tightly linked to the resources it used, possibly involving some highly-optimised linking magic. The realisation that it actually involves a load of indexed strings in a file (plus references to these strings) is actually something of a disappointment.

I haven't yet finished describing the process of putting packages together. At some point I'll cover the annoying restrictions on the order of various elements in DEX files which make it harder than necessary to minimise the space they use, and the unpleasantness around variable-length branch instructions that make assembling bytecode a more intensive process than it perhaps needs to be.

At the moment I'm still surprised that this approach actually works, which is not really an ideal point of view to have about a piece of software you want to rely on. Having said that, it's been working pretty well for a few months now, so the obvious problems have already been fixed. Most problems I encounter tend to be caused by the compiler I've been building on top of this collection of components.

Serpentine

The first language I've experimented with is based on Python syntax because I'm fairly comfortable with writing Python code. Examples of the syntax can be found in the Making Friends with Robots and Juggling Registers articles. For very simple programs, the only noticeable difference is the use of method decorators to describe the parameter and return types of those methods. If you are familiar with language bindings for libraries written in statically-typed languages, such as PyQt for the Qt framework, you are probably not surprised by this kind of decoration.

I've been slowly building up support for some of the niceties of Python's syntax that I use, and ignoring things I don't. This isn't the Python language. The compiler accepts valid Python syntax, but it won't compile all valid Python. Conversely, some valid Python syntax that the Serpentine language relies on won't run correctly in CPython. For example, we need to allow methods to be overloaded, with different versions of the same named method accepting different types of parameters. The way this is done is to allow methods with the same name to be defined more than once — in Python, this means that only the last method will be present in the class definition unless fancy tricks were performed to keep the others around.

As I continue to add language features, I become less motivated to support some of the finer points that make some of them nice to use in Python. So, with exceptions, it's nice to support try and except keywords, but the semantics of finally would require more plumbing in the generated code than I want to consider in a “simple” compiler. Occasionally, code added to support several new features is refactored into common code that reduces the complexity those features initially required. However, the compiler will only get more complex and it may become difficult to figure out how to add support for new features.

Reflections

Other than a certain amount of assembly language, I've mostly written code in C++ and Python over the last few years. The process of writing in C++, which for me is a less expressive language than Python, has often made me frustrated because the APIs or core language features have required a lot more typing just for the purpose of holding the compiler's hand. I've rarely been writing code that was so complicated that the compiler couldn't have figured out everything about the types I was using, and the idioms used for common control flow were often more verbose and less powerful than the corresponding ones in Python.

On the other hand, when writing Python code, I've noticed that it's very easy to create abstractions or overuse certain data types, like dictionaries. Some of the control flow structures include features that are useful but confusing to beginners, like the else keyword with for loops, yet fail to address issues that I, at least, appear to encounter again and again. For example, it would be nice to know if the current iteration of a loop is the last one. Over and above how code is written, there's always the question of how efficiently it executes. Python is incredibly flexible and expressive, but that comes at a cost at run-time, when methods are dispatched to override built-in behaviour and objects are dynamically resolved.

Breaking with Python compatibility and conforming to the restrictions of static typing has been an interesting way to try and determine which parts of the Python language are central to my idea of a good programming experience, which parts are nice to have but are optional, and which parts are lacking. Some of the restrictions are annoying for me: I tend to rebind names to different types of objects, purely to avoid polluting a namespace with different names for everything, but this isn't something I allow in the compiler for Serpentine. In a way, this is a problem with starting with the syntax of an existing language: although there is the convenience and familiarity associated with that syntax, the mechanics of using the new language can be different and that means changing old habits.

Although I'm using the compiler to create packages for Android, it might be interesting to experiment with a different language without all the baggage that Serpentine carries with it from Python. Perhaps even one that is more in tune with the idioms that the Java and Android APIs revolve around. Whatever I try, it probably won't be an existing language purely because it would face the same problems as Serpentine, but hopefully it would be simpler than Python while remaining expressive.

This document is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license.