Trying to write more
So it's been about three years since the last post here. That seems hard to believe, since the post before this recounts a tale from the job that I started right before moving to Buffalo, and it doesn't feel like I've lived here for three years at all. Time, as per usual, has passed more quickly than I'd like.
I don't feel bad about it, since writing hasn't been a priority for various reasons, but lately I've been feeling the itch. A majority of the posts here were the result of challenging myself to write a post a week, and it was surprising how easy it was to come up with at least something every week. I recall them being ok, but I haven't scrolled back and read through any of them in a while. Mostly, I just like the idea of having written.
And that's sort of what this post is about. Since the beginning of this year, I've been making a genuine effort to reprioritize some things that I've been neglecting. The crux of those efforts has been scheduling things and seeing how I do following through. Using Google Inbox reminders I've set myself things that I should be doing. Tonight I had a reminder pop up to "write", so here I am.
Other habits I've been trying to form: yoga, running, meditation, calling friends and making time to finish the book I've been reading for who knows how long. I've had varying success with all of these, and I've been happier, so I'm continuing to try.
Writing feels nice at the moment, so after editing this post I'll set a reminder for next week, and I can write something about the other things I've been trying.
In which python is great
To me, python's greatest strength is that the syntax is almost identical to the form that programs take when they appear in your mind. This makes the time from the initial thought to final code shorter, and makes it easy for others reading your code to understand your thinking. This makes it ideal for interview problems, but mostly just makes it a joy to work with.
To profile some code that I had written, I was calculating the average of many page loads. My inital approach was to write down the numbers and then plug them into a calculator. Simple enough, but like most pure programming problems, very boring for numbers greater than three or four. Sounds like a good time for some python.
To start, I formatted my numbers as simply as possible. The file looked like this:
Wall 130 45 300 312 60 77
CPU 100 50 330 300 60 78
Mem 2000 2121 400 3100 1945 1224
I wanted it to print out the name of the run (Baseline in this case), and then print an average of each measurement. The code I wrote was this:
f = open('runs', 'r')
for line in f:
parts = line.split(' ')
sum = 0
count = 0
for part in parts:
n = int(part)
sum += n
count += 1
print sum / count
Short, and gets the job done. The best part is that it only took about five minutes from inception to working code. I needed to run it twice to get the names of the exceptions correct, but beyond that, it took me almost no time compared to the amount of time it saved, and that's a beautiful thing.
Could I have done it with awk? or excel? Certainly. However, one requires opening up excel (if you even have it. Also, yuck), and the other would require a bit of reading to figure out. Neither is really ideal. Although the awk one might be sort of fun. Perhaps we'll see about that in a future post.
A short development tale
I missed posting last week, and this week I'm also feeling pretty overwhelmed. I'm working on finding a place to live and starting a job in a new city. So instead of trying to come up with something completely new to write about and missing a second week, I'm going to just relate a tale of code deployment gone wrong.
Roughly three years ago, I was working, and at the time, the company was in a bit of flux. The development team was understaffed, so with a ow level of experience, I inherited the job of deploying the website to production.
This meant taking machines in and out of service by logging into the load balancer, rsyncing new code to every machine, and then logging directly into the production servers to clear cache before making them live again. Clearing the cache was done by going into the cache directory and typing "/bin/rm -rf ." - it was necessary to do this because rm had been aliased to "rm -i", which prompted for every file removed, mainly for safety reasons.
This process was careful work, but at the same time, it was quite tedious. So while various tasks were running, I did what any good programmer does during downtime. I dorked around on the internet.
So one day, while doing a deployment, I was looking at reddit. When I looked up, my boss was standing over my shoulder - not just any boss, but the big boss. I'm not sure how many management layers existed between us at the time, but it was several. Needless to say, I felt a bit put off that I had been caught screwing around. He had a question. I answered quickly, spun in my chair, and executed the next command in my routine. "/bin/rm -rf ." Then I got up to get some coffee.
When I came back, the command was still running, which was unusual. I didn't think much of it until I saw a string of permission denied errors appear. That was very
unusual. It had never happened before. That's when I realized I had never changed directories to where the cache was. I had just removed everything in the home directory. Which was also the web root.
And that's why you automate your deployments.
(I eventually did)
On the importance of backups
I like to keep my stuff backed up. I've got a huge external drive next to my laptop that I use as a mac time machine. For stuff like tax backups, I also throw a copy in my dropbox. And another on my web server. I used to put a copy on a thumb drive. I wish I could find the source of a piece of advice I once read, but my google skill is lacking today. Basically, the jist was: "If you have n
copies of something, you really only have n - 1
copies for all practical purposes." If you only have one copy of anything, you could find yourself without that thing any moment.
Making bunches of copies may also appeal to my inner hoarder sensibilities without taking up extra space in my house, but I'm going with the "just being safe" argument. At least publicly.
Either way, I realized just the other day that, to my horror, the only copy I have of my blog is the single database at my web host. I know for a fact that they do regular backups, but for my own piece of mind, I'd like to have copies myself.
So I wrote a little script that hooks into my django models, converts my posts to markdown formatting, and saves them to flat files. It also saves some meta data about each post in a related json file. At the moment it's nothing to write home about really, and it messes up the markup that I use for code blocks, but it's a good enough backup for most purposes. I plan on extending it a bit so that I can write posts as flat files and have them converted to html and synced back out to the live site, automatically push the copies to github, etc.
That's the other advantage. Now that my blog posts exist outside of a database, I can push the markdown files out to github. That way, if my grammar sucks or something, anyone can correct it and submit a pull request. That's assuming anyone cares to correct my grammar, but let's not get all academic here.
You can check out my backup script here
, and the repository with copies of all my blog posts here
A little more Quicksorting
This has been quite a week. I've made the decision to take a new job and move my fiancee and myself to Buffalo from Brooklyn. We think that moving there will give us a better opportunity to live the lifestyle we want. A shorter commute, more space and proximity to family are among a whole host of reasons we've decided to do it. It was a difficult decision, and we'll both miss a lot of things about Brooklyn, but I think its for the best, even though I feel a little overwhelmed right now.
Anyway, I was poking through my work computer to see if there was any code that I didn't want to leave behind, and I found something that played off last week's interest in Quicksort. It was a piece of code that I had written when I first taught myself the algorithm. And, unlike most of your old code you run into, this was more nicely written than what I wrote just recently
#what I wrote last week
if len(set) <= 1:
high = 
low = 
pivot = set.pop(len(set) / 2)
for i in set:
if i >= pivot:
return quicksortLoop(low) + [pivot] + quicksortLoop(high)
#what I wrote a long time ago
if len(list) <= 1:
pivot = list.pop(len(list) / 2)
less = [i for i in list if i < pivot]
more = [i for i in list if i >= pivot]
return quicksort(less) + [pivot] + quicksort(more)
Instead of looping through the set and putting values into two arrays, this implementation uses python's list comprehensions to build the more and less lists. I think this approach is much more elegant, yielding easier to read code.
This lead me to the question of whether or not the list comprehensions incurred a signifigant amount of overhead. To find out, I stripped away all the parts that distiguished the two different implementations from one another with the exception of the loop vs list comprehension pieces. Then I wrote some code to the run the two different function through the same paces.
from time import time
from random import randint
#function definitions here
if __name__ == "__main__":
loopTimes = 
compTimes = 
for i in range(1, 1000):
testSet = [randint(0, 100) for x in range(1000)]
cSet = list(testSet)
loopSet = list(testSet)
compTime = time()
compTime = time() - compTime
loopTime = time()
loopTime = time() - loopTime
print "Avg with loop: ", sum(loopTimes)/1000
print "Avg with list comprehension: ", sum(compTimes)/1000
What this is does is sort a thousand lists, each containing a thousand numbers between one and one hundred, using both sorting functions. It counts the time each function takes on each list, and then calculates an average.
The results are sort of interesting.
[snpxw@PWadeiMAC:random-bits ]$ python quicksortTimer.py
Avg with loop: 0.00623641085625
Avg with list comprehension: 0.0061008477211
List comprehensions beat the loop every time, which is the opposite of what I expected. I'll speculate that the difference is either some internal optimization python makes for list comprehensions, or the pre-allocation of the low and high arrays in the loop version. I'm going to have to do a bit more research about what happens inside python to figure it out.
Regardless, the difference is negligible. Even if the list being sorted was five thousand times larger (that is, five million elements), the difference in the two implementations would be about 0.5 seconds. Not really enough to bother most people.
(The full code from this post is here