Posts Tagged ‘Java’
Java, please stop ruining my fun.
I don’t like Java. I haven’t learned Java well because I don’t enjoy using it. I don’t enjoy using it because it’s verbose, for one, but mostly because it’s constantly making things hard for me to do. I know there are ways to do what I want, after all, millions of people use Java successfully every day, but I don’t know what they are. Furthermore, finding out what they are is excruciatingly painful.
I recently did a series of articles on a project I was doing to learn Clojure. It kind of petered out for a number of reasons, but one constant annoyance in learning Clojure was dealing with the Java-isms. Java has given Clojure a vast library of high quality software essentially for free, but it’s also brought on a lot of the pain, much of which I think needs to be fixed before Clojure can have the nice feel of my favorite dynamic languages.
Installing Clojure
The first thing one has to do is install Clojure. It’s not a package in Ubuntu yet, but it’s young, so that’s ok cause we’re veterans and don’t need no stinking packages. To compile, we just download the source and type “ant”.
And that’s it. There’s no install process that makes a nice pretty “clojure” command that takes us to the REPL or executes scripts that are passed to it. To run clojure, you need to run it using Java:
$ java -cp clojure.jar clojure.lang.Repl
That is a lot to type just to get a Repl, and getting a usable command line is even harder. After installing JLine ConsoleRunner, you need to get the library into your classpath (a rant on which is upcoming) and then run
$ java -cp jline-0.9.91.jar:clojure.jar jline.ConsoleRunner clojure.lang.Repl
Not exactly intuitive, but whatever. We put it in a bash script, put it in our path, and head off to the races. After a while, we have a few lines of a quality script we would like to save and run. How do we do that?
Obviously, it’s:
$ java -cp clojure.jar clojure.lang.Script my-script.clj
This assumes that clojure.jar is in the same directory as the script you want to run. If you don’t have clojure.jar there, you must provide a specific path to the jar file. There is no idea of a default directory where Java will look for jar files. You must provide every single jar file to Java at runtime.
Contrast this with the Python install process:
$ sudo apt-get install python $ python ... Have fun in the interpreter ... Write a script $ python my_script.py
Simple.
The Classpath
First of all, I’m no expert on the classpath, but it seems like an unholy abomination thrust upon us by invisible powers that must be extinguished at all costs. It would appear, and again, I am no expert, but it would appear that every single dependency of a program must be explicitly passed to Java at the time you run your program. I wrote a bash script to automate the process, but viewing the command line for running my simple Compojure-based webapp is apalling:
java -Djava.library.path=/usr/local/lib -cp :/mnt/data/Users/justin/bin/compojure/compojure.jar:/mnt/data/Users/justin/bin/compojure/deps/clojure-contrib.jar:/mnt/data/Users/justin/bin/compojure/deps/clojure.jar:/mnt/data/Users/justin/bin/compojure/deps/fact.jar:/mnt/data/Users/justin/bin/compojure/deps/jetty-6.1.14.jar:/mnt/data/Users/justin/bin/compojure/deps/jetty-util-6.1.14.jar:/mnt/data/Users/justin/bin/compojure/deps/re-rand.jar:/mnt/data/Users/justin/bin/compojure/deps/servlet-api-2.5-6.1.14.jar:/mnt/data/Users/justin/lib/clj-http-client.jar:/mnt/data/Users/justin/lib/clojure-contrib.jar:/mnt/data/Users/justin/lib/clojure.jar:/mnt/data/Users/justin/lib/commons-codec-1.3.jar:/mnt/data/Users/justin/lib/commons-httpclient-3.1.jar:/mnt/data/Users/justin/lib/commons-io-1.4-javadoc.jar:/mnt/data/Users/justin/lib/commons-io-1.4-sources.jar:/mnt/data/Users/justin/lib/commons-io-1.4.jar:/mnt/data/Users/justin/lib/commons-logging-1.1.1-javadoc.jar:/mnt/data/Users/justin/lib/commons-logging-1.1.1-sources.jar:/mnt/data/Users/justin/lib/commons-logging-1.1.1.jar:/mnt/data/Users/justin/lib/commons-logging-adapters-1.1.1.jar:/mnt/data/Users/justin/lib/commons-logging-api-1.1.1.jar:/mnt/data/Users/justin/lib/commons-logging-tests.jar:/mnt/data/Users/justin/lib/compojure.jar:/mnt/data/Users/justin/lib/jline-0.9.94.jar:/mnt/data/Users/justin/lib/tokyo-cabinet-clj.jar:/mnt/data/Users/justin/lib/tokyo-cabinet.jar:/mnt/data/Users/justin/lib/tokyocabinet.jar:/mnt/data/Users/justin/lib/tokyotyrant-0.6.jar clojure.lang.Script index.clj
That is bad. That is not correct, that is not how software should be designed, I object. Every other language I can think of off the top of my head (except JavaScript) has some structured way of finding its dependencies, and most have a way of adding additional rules to that search should the defaults not be adequate. While this can lead to “DLL hell”, I do not see how the Java situation is any better when everybody just ends up with scripts to automate the process and then those scripts pick up the wrong things and you can’t figure out why.
The classpath makes me very upset. If Clojure can find a way to mask it, I would appreciate it very much.
Maven
First of all, what the hell is Maven? A quick trip to their site reveals a huge chunk of text with hundreds of links and an initial sentence that describes it as:
Maven, a Yiddish word meaning accumulator of knowledge, was originally started as an attempt to simplify the build processes in the Jakarta Turbine project.
I went to the site with some hope that it would provide some relief to my dependency issues (All I want is “pip install”, or “gem install”), and I get greeted with a dense paragraph of history combined with some mumbo-jumbo about “best practices”.
After reading a bit I find that Maven downloads and builds dependencies and installs them in a local repository, along with the library you are trying to compile. Perfect! Sounds like exactly what I want. However, it doesn’t mention anything about the classpath. Am I still responsible for dealing with all that muck, even though it’s tucking my libraries in a hidden directory (implying that it’s responsible for managing them)?
To answer that question I need to wade through dozens of other pages that alternately describe how to accomplish basic tasks and lecture me on software engineering. Finally I come to the conclusion that while Maven does indeed find dependencies for you, it does not actually help you execute programs with those dependencies in place. This means you either need a script that automatically passes your entire maven local repository to Java, or you need to know the dependencies that Maven was conveniently supposed to hide from you. To top it off, it doesn’t play well with Clojure. Completely useless.
(For the record, there is a Maven extension that does exactly this.)
The Last Word
Dependency management is a hard problem that all languages must learn to deal with. Higher level languages have an even harder time in that they must not only deal with whatever dependencies they have written in their own language, but also with extensions written in other languages. Clojure, which is still very young, suffers tremendously from the godawful environment that Java has ensconsed itself in. I am largely a veteran of the *nix world, which seems quite different from the world Java developers have built around themselves. They have their own tools, their own build systems, their own set of “best practices”, and the Apache foundation. What I have seen in my brief saunter over the wall has appalled me. It has appalled me far more than similar saunters into the somewhat exciting world of Microsoft and .NET. It strikes me very much as a world in need of fixing, and I hope that Clojure (or Scala) can do it. Heck, I may even do my part to help.
But probably I’ll just run back to Python.
Fourth: Regular Expressions in Clojure
Things are cruising right along now in creating my awesome twitter portal in clojure. So far we have gotten set up with compojure, started using the twitter API to grab data, and built some forms to make sure the data is relevant to the logged in user. The next little chore is to find URLs in tweets and make them into actual, clickable links. I want to keep this simple for now, so we’ll just find http:// or https:// and link that.
The Code
It turns out that the code to do this is really simple. Clojure just uses Java’s regular expression engine, but integrates it into the language a bit cleaner than Java does. A big thanks to Fatvat for basically walking me through it.
Nothing too complicated here, but there is an interesting new concept. For the first time ever, Clojure doesn’t do everything we want and we talk to Java. This is one of the most powerful attributes of Clojure. Even though it’s a young language, it’s built on a mature platform that does basically everything you need. In this case, we wanted to mutate the “text” string. This isn’t exactly kosher in a functional language, but I didn’t want to slice and dice the text when there was a perfectly usable Java method that would do the replacement for me.
Anyway, how does this work? “.replaceAll” is a method of java.util.regex.Matcher. What we’re trying to express in Java is:
In clojure, re-matcher returns matches constructed out of applying a Pattern instance to a string (“text”). So, we’re applying the .replaceAll method to the object returned by re-matcher, which is a Matcher instance created out of a Pattern (indicated by the “#” macro). This is exactly what we want, expressed in a nice, functional style. After the instance that we’re operating on, we can pass additional arguments to the method. In this case we pass the replacement string.
Another thing you might notice is the string in the urlize function definition. Clojure has extensive support for metadata, which is something that I’ve largely ignored. In it’s simplest form, you can pass a string to defn as I have done, and that will be included as the docstring. The language also includes introspection features to pull these things out, but I have yet to investigate them in depth.
Again, pretty straightforward, and now we’re starting to do some real damage. I think I’m going to dive into JavaScript and CSS for a while, but I’ll be back soon with static storage. It should be fun! As always, all the code is on github.
A Threading Model Overview
I noticed in a story on Hacker News that many people do not understand that differences in threading implementations between different programming languages. In the single processor days, understanding the threading model that you were working with was not that important. With more than one core, it is a good thing to know. This is an overview.
The Beginning (C and Native Threads)
The first threading model we will look at is the standard OS level thread. Every modern OS has support for this, though the APIs change from OS to OS. Basically, a thread is a process that can run on its own processor, is scheduled by the OS scheduler, and can block. It acts just like its own process except that it shares resources with every other thread in the process. This mainly means that memory and file descriptors are shared between all threads in a process. This is what people mean by “native threading”. From C on linux, you can use these threads by linking with the pthread library. BSDs generally support pthreads as well, and Windows does its own thing that is very similar.
Java and Green Threads
When Java came out, it introduced a different type of threading model to the world called green threads. Green threads are essentially simulated threads. The Java virtual machine would take care of switching between different green threads, but the virtual machine itself would only run in one OS thread. This generally has some advantages. OS threads have almost as much overhead as a process on most POSIX systems. It is also usually slower to switch between native threads than it is between green threads.
This can mean that in some situations, green threads are much preferable to native threads. A system can usually support a much higher number of green threads than OS threads. For instance, it would be practical to spawn a new green thread for every new connection on a web server, but it is not generally practical to spawn a new native thread for every incoming HTTP connection.
There are disadvantages, however. The biggest is that you cannot have two threads running at the same time. There is only one native thread, so it is the only thread that gets scheduled. Even if there are multiple CPUs and multiple green threads, only one CPU will be running any given green thread at any given time. This is because it all looks like one thread to the OS scheduler.
Java has supported native threading since version 1.2, and it has been the default for some time now.
Python
Python is one of my favorite scripting languages, and was one of the first scripting languages to offer threading. Python exposes a threading module that manipulates native threads. This means that Python can benefit from all the advantages of true native threading, except for one catch.
Python has a global interpreter lock (GIL). This lock is necessary to keep Python threads from corrupting the global state of the interpreter. This means that no two Python instructions can be running simultaneously. The GIL gets released every 100 Python instructions or so and another Python thread is free to acquire the lock and begin executing.
On the face of it, this seems like a major flaw. However, in practice it is not that big of a deal. Any thread that blocks will generally release the GIL. C extensions can also release the GIL whenever they are not interacting with the Python/C api, so CPU intensive operations can be carried out in C without blocking the executing Python threads. The only situation in which the GIL proves problematic is when you have more than one CPU bound thread written in Python on a multi-core machine.
Stackless python is an implementation of Python that brings “tasklets” (essentially green threads) to Python. The greenlet module is derived from their work and is compatible with the standard cPython implementation.
Ruby
Ruby’s threading model is and always has been in a state of flux. Ruby’s original implementation only supported cooperative green threads. These work fine in many situations, but they do not take advantage of multiple processors.
JRuby mapped Ruby’s threads straight to Java’s threads, which are generally OS native threads. This doesn’t work. Since Ruby’s threads are cooperative, there is no need to synchronize between the threads. Every thread can be assured that no other thread is accessing a resource while it is accessing it. This breaks down in JRuby, since native threads are generally preemptive, meaning any thread could be accessing any shared data at any time.
Because of the mismatches and the desire for native threading from the C Ruby folks, it was decided that Ruby would move to a native threading in Ruby 2.0. In Ruby 1.9, a different interpreter was swapped into the standard Ruby distribution. 1.9 adds a threading model it calls Fibers, which as far as I know are a more efficient implementation of green threads.
In short, Ruby’s threading model is a poorly documented mess.
Perl
Perl has an interesting threading model, one which Mozilla borrowed for SpiderMonkey if I’m not mistaken. Instead of having a global interpreter lock like Python, Perl makes all global state thread local and spawns off a new interpreter with each new thread. This allows for true native threading. There are two catches though.
First, you must explicitly make variables available to threads outside your own. This is the nature of everything being thread local. The values must then be kept up to date across threads.
The second catch is that every new thread is very expensive to create. The interpreter is not small, and duplicating it with every thread makes for a lot of overhead.
Erlang, JavaScript, C# and so on
There are a lot of other models out there that people play with from time to time. Erlang, for instance, has a shared nothing architecture that forces you to use lightweight, user-land processes over threading. This is actually an outstanding architecture for parallel programming since it takes out all of the headaches involved with synchronizing memory, and the processes are so lightweight you can generally just spawn as many of them as you want.
JavaScript is usually not thought of as a language that supports threading, but it needs to support it for a browser implemented largely in JavaScript like Mozilla. Its threading model is very similar to that of Perl’s.
C# uses native threads.
Well, I hope that makes the whole threading picture a little bit clearer. Please let me know if anything is confusing or if I messed anything up. I don’t know everything, after all.